The purpose of these exercises is to give the workshop participant an initial exposure to the resources available on the Web. These simple exercises are a "step-by-step" approach for analyzing DNA and protein sequences and for visualizing biochemical and macromolecular structures. (1) All of the links explained or described here can be accessed at http://www.uib.es/depart/dba/MolBio/ .
Exercise #1
Locating a DNA /RNA or Protein
Sequence
Entrez
Swiss-Prot
Exercise
#2
Comparing a New Sequence Against
Sequence Databases
BLAST
FASTA
Exercise #3
Performing Multiple Sequence
Alignments
MSA
Exercise #4
Performing Structure Predictions
ProtScale
SWISS-MODEL
Exercise #5
Visualization of Biochemical
and Protein Structures
Chemicals with Pharmaceutical Acitivity
Klotho: Biochemical Compounds
Protein Data Bank 3D Browser
RasMol
Additional Information
References
Searching for DNA sequences
There are a variety of ways to access sequences from the Web (4.) One place to start is to use a site that acts as a center to other sites. One of these sites is called NCBI. The NCBI search tool not only searches databases for DNA (and protein) sequences, but it also quickly searches for scientific articles about the protein sequences one is searching for. Here is a short example of how to use NCBI to find the DNA sequence of the type I regulatory subunit of cAMP dependent protein kinase in cows.
1. First access the site http://www.ncbi.nlm.nih.gov
2. Type in the keyword kinase and SEARCH,
3. ADD the keyword cAMP and SEARCH,
4. ADD the keyword regulatory and SEARCH,
5. ADD the keyword bovine and SEARCH
(Ideally, one should receive only a few hits.
This approach saves time instead of viewing over 21,696 entries from the
original search with the keyword kinase.) One can now select RETREIVE
DOCUMENTS, and the following is displayed:
-------------------------------------------------------------------------------------------------------------
D83380
Sea urchin
mRNA for catalytic subunit of cAMP-dependent histone kinase, complete cds
gi|1199787|dbj|D83380|SUHCAMPB
[1199787]
(View GenBank
report,FASTA report,ASN.1 report,Graphical view,1 MEDLINE link, 1
protein
link, or 6 nucleotide neighbors )
D83379
Sea urchin
mRNA for regulatory subunit of cAMP-dependent histone kinase, complete
cds
gi|1199785|dbj|D83379|SUHCAMPA
[1199785]
(View GenBank
report,FASTA report,ASN.1 report,Graphical view,1 MEDLINE link, 1
protein
link, or 1 nucleotide neighbor )
J05692
B.taurus
cAMP-dependant protein kinase regulatory subunit RII-beta mRNA, complete
cds
gi|163669|gb|J05692|BOVRIIB
[163669]
(View GenBank
report,FASTA report,ASN.1 report,Graphical view,1 MEDLINE link, 1
protein
link, or 12 nucleotide neighbors )
K00833
bovine
camp-dependent protein kinase, type 1 regulatory subunit (r-i) mrna
gi|163533|gb|K00833|BOVPKIRI
[163533]
(View GenBank
report,FASTA report,ASN.1 report,Graphical view,1 MEDLINE link, 1
protein
link, or 5 nucleotide neighbors )
M82914
Bos taurus
anchor protein regulatory subunit (AKAP75) gene, complete cds
gi|162637|gb|M82914|BOVAKAP
[162637]
(View GenBank
report,FASTA report,ASN.1 report,Graphical view,1 MEDLINE link, 1
protein
link, or 1 nucleotide neighbor )
--------------------------------------------------------------------------------------------------------------
After the server returns a list of possible
DNA sequences, one can select the GenBank or FASTA report you want. However,
one will note that some of these sequences are DNA or RNA sequences. The
DNA or RNA sequence can be submitted and translated by using the Translate
Tool at the ExPASy site at http://www.expasy.ch/tools/dna.html,
or one can go to another site that searches only for protein sequences.
Searching for Protein Sequences
1. Locate the link to the Swiss-Prot website.
2. Select the by description or identification link, enter the protein name in the description box on the page, and then submit the request.
The server will search for all amino acid sequences
related to that word. Once the server returns a list of possible proteins,
one must select what species sequence to retrieve. Most sequences that
are analyzed are from cow (Bovine), mouse (Murine), human
(Homo Sapiens), fly (Drosophila), Escherichia coli
(E. coli), or yeast (Saccharomyces cerevisiae). For instance
if kinase bovine regulatory is submitted, the following is
obtained:
--------------------------------------------------------------------------------------------------------------
Search in
SWISS-PROT for: kinase
bovine regulatory
(Release 35 and updates up to 13-Jun-1998)
Number of sequences found:
7
Note that the selected
sequences can be saved to a file to be later retrieved; to do so, go to
the bottom of this page.
Please choose one of the
following entries:
AK75_BOVIN
A-KINASE
ANCHOR PROTEIN 75 (AKAP 75) (CAMP-DEPENDENT PROTEIN
KINASE
REGULATORY SUBUNIT II HIGH AFFINITY BINDING PROTEIN)
(P75)
- BOS TAURUS (BOVINE)
CD5R_BOVIN
CYCLIN-DEPENDENT
KINASE 5 ACTIVATOR PRECURSOR (CDK5
ACTIVATOR)
(TAU PROTEIN KINASE II 23 KD SUBUNIT) (TPKII
REGULATORY
SUBUNIT) (P23) (P25) (P35) {GENE: NCK5A} - BOS TAURUS
(BOVINE)
KAP0_BOVIN
CAMP-DEPENDENT
PROTEIN KINASE TYPE I-ALPHA REGULATORY CHAIN
{GENE:
PRKAR1A} - BOS TAURUS (BOVINE)
[continued]
------------------------------------------------------------------------------------------------------------
2. Once the protein sequence and information
are displayed, it can be saved as text file from the browser.
Exercise
#2: Comparing a New Sequence
Against Sequence Databases
BLAST (Basic Local Alignment Search Tool)
This analysis is useful if one has a new DNA or protein sequence, and one wants to compare it to other sequences in GenBank. You can compare DNA with
DNA, protein with protein, or a combination of these two.
1. Use a sequence from experimental data or obtain one from Exercise #1.
2. Locate the BLAST site at http://www.ncbi.nlm.nih.gov/BLAST/
3. One can choose between either ADVANCED or BASIC BLAST.
4. Paste in the sequence into the appropriate box.
Now, one must choose between Blastn, Blastp, tBlastn,
tBlastx, or Blastx. (See Additional Information) If one inputs a protein sequence, then select Blastp. Important: If one enters a sequence other than the FASTA format, then one must add the symbol ">" then a carriage return and then the sequence.
Example of a new protein sequence:
----------------------------------------------------------------------------------------------------------
>
MKKTILAIAIPALFASAANAAVIYDKDGTTFDVYGRVQANYYGDTNEADSTAASGYKDVDGELKGSSRL
GWSGKIALNNTWSGIAKTEWQVSAENSANKFDSRHIYVGFDGTQYGKVIFGQTDTAFYDVLEPTDIFNEW
GSEGNFYDGRQEGQVIYSNAIGGFKGKVSYQTNDDQAVKVADVAGGIKTTVFPDVKRKYAYAAAVGYDFD
FGLGFNGGYAYSDLEGKTTDASGKKSEWALGAHYAINGFNFAGVYTQAEVKNDTTGYKDEGRGYELAATY
NVDAWTFLAGYNFKEGKENANLSGSSYEDLMDATLLGVQYSFTSKLKAYTEYKINGVSGKDDDFTVALQY
NF
----------------------------------------------------------------------------------------------------------
The default parameters can be altered once
one knows how each
changes the search, but for now just select
SUBMIT.
One will receive a list similar the one shown
below.
-------------------------------------------------------------------------------------
BLASTP
2.0.4 [Feb-24-1998]
Reference:
Altschul, Stephen F., Thomas L. Madden, Alejandro A.
Schäffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman
(1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs",
Nucleic Acids Res. 25:3389-3402.
Query=(351
letters)
Database:
Non-redundant GenBank CDS
translations+PDB+SwissProt+SPupdate+PIR
313,805
sequences; 94,785,530 total letters
Searching..................................................done
Sequences
producing significant alignments: (bits)
Value Score E
gi|1465755
(U59311) OmpL [Photobacterium sp. SS9]
110 1e-23
gnl|PID|d1015697
(D90775) Outer membrane protein F precursor (O... 109 2e-23
gi|3273514
(AF035618) porin OmpN [Escherichia coli]
109 2e-23
gi|148373
(M28296) outer membrane protein [Enterobacter cloacae] 104
8e-22
pdb|1PHO|
Phosphoporin (Phoe)
102 4e-21
[continued...........]
---------------------------------------------------------------------------------------------------------
At the end of the list, one will obtain many alignments like the one shown below:
-----------------------------------------------------------------------------------------------------------
gi|1465755
(U59311) OmpL [Photobacterium sp. SS9]
Length
= 341
Score =
110 bits (272), Expect = 1e-23
Identities
= 99/366 (27%), Positives = 152/366 (41%), Gaps = 43/366 (11%)
Query: 2
KKTXXXXXXXXXXXXXXXXXXXYDKDGTTFDVYGRVQAN-YYGDTNEADSTAASGYKDVD
60
KK
Y + ++ V GR +A D N+ ++
+ +V
Sbjct:
3 KKLIALAVAAASISSVATAAEVYSDETSSLAVGGRFEARAVLADVNKDENVTNTASSEVS
62
Query: 61
GELKGSSRLGWSGKIALNNTWSGIAKTEWQVSAENSANKFDSRHIYVGFDGTQYGKVIFG 120
K R+ +GK + + G+ E + S+
+S N ++R+ Y G G+QYG++++G
Sbjct:
63 D--KSRVRINVAGKTDITEDFYGVGFFEKEFSSADSDND-ETRYAYAGV-GSQYGQLVYG 118
Query: 121
QTDTAFYDVLEPTDIFNEWGSE-GNFYDG--RQEGQVIYSNAIGGFKGKVSYQTNDDQAV 177
+ D + + + TDI G+E GN
R + + Y +G F
D+ V
Sbjct:
119 KADGSLGMLTDFTDIMAYHGNEAGNKLAAADRTDNNLSY---VGSFD-----LNGDNLTV 170
Query: 178
KVADVAGGIKTTVFPDVKRKYAYAAAVGYDFDFGLGFNGGYAYSDLEGKTTDASGKKSEW 237
K V GG
Y+AA Y D GLGF GY D +
K +
Sbjct:
171 KANYVFGGSD--------ENEGYSAAAMYAMDMGLGFGAGYGEQDGQSSKNGNEDKTGKQ 222
Query: 238
ALGA-HYAINGFNFAGVYTQAE---VKNDTTGYKDEGRGYELAATYNVDAWTFLAGYNFK 293
A GA Y I+ F F+G+Y + V ND
DE GYE AA Y F+ YNF
Sbjct:
223 AFGAISYTISDFYFSGLYQDSRNTVVNNDLI---DESTGYEFAAAYTYGKAVFITTYNF- 278
Query: 294
EGKENANLSGSSYEDLMDATLLGVQYSFTSKLKAYTEYKINGVSGK--------DDDFTV 345
E++N SG + DL D + + Y F + Y YK
N + D+F
+
Sbjct:
279 --LEDSNASGDA-SDLRDSIAIDGTYYFNKNFRTYASYKFNLLDANSSTTKAQASDEFVL 335
Query: 346
ALQYNF 351
+Y+F
Sbjct:
336 GARYDF 341
[continued]
------------------------------------------------------------------------------------------------------------------
5. If the gi|1465755link
is selected, then the sequence of this protein is retrieved.
Finally, one can narrow the search by changing
the parameters on the
input page. (See Additional Information for
list of parameters.)
FASTA (FastA)
FASTA is a good sequence alignment tool when one accepts gaps in the whole sequence comparison. Here is an example.
1. Access a FASTA site at http://www.fasta.genome.ad.jp/ , or for a more extensive query try accessing FASTA at http://bioweb.pasteur.fr/seqanal/interfaces/fasta.html.
2. Enter the protein sequence of interest,
modify the parameters if needed (See
Additional Information.), and submit the sequence.
The information received is similar to the truncated example below.
------------------------------------------------------------------------------------------------------
FASTA searches
a protein or DNA sequence data bank
version 3.0t74
December, 1996
< 20 173
0:== 22 0 0: one = represents 103 library sequences
24 0 0:
26 3 1:*
28 12 16:*
30 74 95:*
32 352 368:===*
34 957 998:=========*
36 2001 2050:===================*
38 3842 3387:================================*=====
40 5132 4725:=============================================*====
42 6034 5776:========================================================*==
44 6180 6371:===========================================================*
46 6100 6489:===========================================================*
48 6106 6213:===========================================================*
50 5501 5669:======================================================
*
52 4724 4984:==============================================
*
54 4160 4257:=========================================*
56 3499 3556:==================================*
58 2904 2919:============================*
60 2386 2365:======================*=
62 1851 1896:==================*
64 1505 1508:==============*
66 1194 1192:===========*
68 969 937:=========*
25083768 residues
in 69113 sequences
statistics
extrapolated from 50000 to 68920 sequences
Expectation
fit: rho(ln(x))= 6.0807+/-0.000545; mu= 2.6432+/- 0.030;
mean_var=84.8171+/-17.696
Kolmogorov-Smirnov
statistic: 0.0142 (N=29) at 42
FASTA (3.06
Sept, 1996) function (optimized, /bio/db/fasta/matrix/aa/blosum50 matrix)
ktup: 2
join: 37,
opt: 25, gap-pen: -12/ -2, width: 16 reg.-scaled
Scan time: 8.014
The best scores are: initn init1 opt z-sc E(69082)
sp:PHOE_CITFR
OUTER MEMBRANE PORE PROTEIN ( 351) 192 77 393 450.1 2e-18
sp:PHOE_KLEPN
OUTER MEMBRANE PORE PROTEIN ( 351) 213 78 381 436.6 1.1e-17
sp:PHOE_ECOLI
OUTER MEMBRANE PORE PROTEIN ( 351) 274 99 377 432.1 2e-17
sp:OMPF_ECOLI
OUTER MEMBRANE PROTEIN F PRE ( 362) 146 79 367 420.6 8.8e-17
[continued...]
-------------------------------------------------------------------------------------------------------
There are several protein (and DNA) alignment tools on the Web. One common tool is ClustalW, and it is very good and one can refine the output. However, if you want to perform a fast and simple alignment, one can access the MSA tool at http://www.ibc.wustl.edu/ibc/msa.html . Once one has acquired this page, the following steps can be carried out.
1. Select the type of input (raw sequence, Swiss-Prot ID, GenPept
ID, or PIR ID).
2. Input the sequence into the correct boxes.
3. Then select RUN MSA to begin alignment.
The output looks similar to the truncated
version below:
-----------------------------------------------------------------------------------------------------------
Multiply Sequence Alignment
Output
Running MSA with options:
Here is the Input Data File
>Seq 1
MKKTILAIAIPALFASAANAAVIYDKDGTTFDVYGRVQANYYGDTNEADSTAASGYKDVDGELKGSSRLÖ
>Seq 2
+++++MMKRNILAVI+VPALLVAGTA+NAAEIYNKDG+NKVDLYGKAV+GLHYFSKGNG+ENSYGGNGDÖ
>Seq 3
AEIYNKDGNKLDLYGKIDGLHYFSDDKDVDGDQTYMRLGVKGETQINDQLTGYGQWEYNVQANNTESSSÖ
Here is the Output from the Run
*** Heuristic Multiple Alignment ***
132 ****************231 ********231
-MKKTILAIAIPALF-ASAANAAVIYDKDGTTFDVYGRVQA-NYYGDTNEADSTAASGYKDVDGELKGSSRLDGW
MMKRNILAVIVPALLVAGTANAAEIYNKDGNKVDLYGKAVGLHYFSKGNGENS-----Y-GGNGDMDTYARL-GF
----------------------AEIYNKDGNKLDLYGKIDGLHYFSDDKD-----------VDGD-QTYMRL-GV
SGKIALNNTWSGIAKTEWQVSAENSANKFDSR--HIYVGFDG-TQYGKV---IF-GQTDTAFYDVLEPTDIFNEW
KGETQINSDLTGYGQWEYNFQGNNSEGADAQTGNKTRLAFAG-LKYADVGSFDYDGRNYGVVYDALGYTDMLPEF
KGETQINDQLTGYGQWEYNVQANNTESSSDQA--WTRLAFAGDLKFGDAGSFDY-GRNYGVVYDVTSWTDVLPEF
DGS---EGNFYDGRQEGQVIYSNA-IGGFKGKVSYQTND-DQAVKVADVAGGIKTTVFPDVKRKYAYAAAVGYD-
GGDTAYSDDFFVGRVGGVATYRNSNFFGLVDGLNFAVQY------LDGKN-ERDTAR---RSNGDGVGGSISYE-
GGDTYGSDNFLQSRANGVATYRNSDFFGLVDGLNFALQYQGKNGSVSGEDGATNNGRGALKQNGDGFGTSVTYDI
FDDFGLGFNGGYAYSDLEGKTTD----ASGKKSEWALGAHYAINGFNFAG--------------------
YEGFGIVG--AYG---------------------------------------------------------
FDGISAGF?AYANSKRTDDQNQLLLGEGDHAETYTGGLKYDANNIYLATQYTQTYNDATRAGSLGFANK
Secondary Structure
1. Select and access a server that predicts
structural properties.
There are several algorithms that predict
structural properties. One of
the sites that one can use is ProtScale (http://www.expasy.ch/cgi-bin/protscale.pl
).
It is very important that one uses several
algorithms for predicting
structural properties. The method of analysis
is based on different mathematical or
pattern relationships. The following are a
list of common algorithms:
Garnier, Osgoodthorpe and Robson
Chou and Fasman
Deleage & Roux
2. Paste the raw protein sequence into the appropriate input box and submit the request.
One will obtain an immediate response through the Web site. The response is usually a graphical format. Usually an arbitrary scale for each algorithm is on the Y-axis, and the residue number is on the X-axis. Here is an example of such a report:
If this plot were emphasizing alpha helical
content, then the stretches of sequences
that have a high propensity for secondary
structure are regions from
residue 35 to residue 55 and from residue
125 to residue 140. One must keep in mind that the lower ranking regions
of amino acids donít imply the absence of secondary structure; this
plot simply shows the probability of alpha helical structure. In order
to determine beta strand, turns, loops, etc., one needs to return to the
original input page, change the analysis performed, and continue the analysis.
3. Compare your collected data or plots
Next, one will need to compare the different
data obtained from the several
algorithms. Typically, if all or most of the
analyses recognize the same regions
of sequences with secondary structure, then
it is highly probable that those regions
do have a predicted structure. Once one locates
these regions, then one must
decide exactly where each secondary structure
begins and ends. This is where human judgement becomes important. Finally,
it is useful to keep some type of table or master diagram that documents
what regions of the protein sequence has specific secondary structure.
3-D modeling
1. One can receive a predicted 3-D structure
of a protein sequence from SWISS-MODEL.
(http://www.expasy.ch/swissmod/SM_TOPPAGE.html)
2. Click on First Approach Mode on the new frame of the new page.
3. Enter the necessary information requested on the First Approach Mode page.
Paste in the amino sequence in the dialog box. Note: It is suggested that one select the Short mode because the Normal mode will send more information than average user needs. The analysis can take a few minutes to hours depending on the complexity of the sequence submitted. Many times one will not receive any results, or only a segment of the sequence will be modeled.
4. Once one receives an e-mail file from SWISS-MODEL
one can save that file as a text file and view it in RasMac or WebLab ViewerLite
( http://www.msi.com/weblab/)
as described in Exercise # 4.
Exercise #5: Visualization of Biochemical and Protein Structures
Viewing Pharmaceutical Biochemicals
1. Locate the Chemicals with Pharmaceutical Activity database (http://www.chem.ox.ac.uk/mom/chemical-database/)
2. Search or select a chemical of interest
from the alphabetical index.
For example if penicillin V is selected,
the following interactive
model is displayed.
-----------------------------------------------------------------------------------------------------
Penicillin_V
C16H18N2O5S
: 350.389
--------------------------------------------------------------------------------------------------------------
Note: If one does not have the correct installed,
one must follow the instructions as directed by the links at this site.
After installation, one must restart
the computer and access this site again.
Viewing Metabolic Biochemical Structures
1. One can view general biochemicals at Klotho: Biochemical Compounds
Declarative Database (http://www.ibc.wustl.edu/klotho/)
2. One can find a compound by selecting COMPOUND LISTING. One can search
or scroll down the list to find a compound.
3. After selecting a compound, the molecule is displayed, and one has the choice of displays: Interactive Viewer, Static 3D GIF Image, or Contents. Note: one needs the correct Plug-in for the Interactive Viewer.
1. Locate the PROTEIN DATA BANK (PDB)(http://www.rcsb.org/pdb/index.html)
2. Select the 3DB Browser link.
3. Type in the name of the protein (eg. cAMP protein kinase) in keyword and "search." If the structure of the protein has been solved, it will have a list of structures. Below is an example of what one will receive if the structure has been solved.
------------------------------------------------------------------------
-------------------------------------------------------------------------
4. Select one of the files by highlighting it, and click on "Explore"
5. If you want to view the structure, click
on View Structure. If one wants the complete PDB file, select Download/Display
and select file format complete with coordindates--TEXT.
--------------------------------------------------------------------
---------------------------------------------------------------------
6. Now, if complete with coordinates
was selected, then one is ready to
download the document. A truncated file is
shown below:
--------------------------------------------------------------------------------------------------
HEADER
HYDROLASE (SULFHYDRYL PROTEINASE) 14-MAY-91 1PE6 1PE6 2
COMPND
PAPAIN (E.C.4.3.22.2) COMPLEX WITH E-64-C 1PE6 3
SOURCE
PAPAYA (CARICA $PAPAYA) FRUIT LATEX 1PE6 4
AUTHOR
D.YAMAMOTO,K.MATSUMOTO,H.OHISHI,T.ISHIDA,M.INOUE, 1PE6 5
AUTHOR
2 K.KITAMURA,H.MIZUNO 1PE6 6
REVDAT
2 15-MAY-95 1PE6A 1 SEQRES 1PE6A 1
REVDAT
1 15-APR-93 1PE6 0 1PE6 7
JRNL AUTH
D.YAMAMOTO,K.MATSUMOTO,H.OHISHI,T.ISHIDA,M.INOUE, 1PE6 8
JRNL AUTH
2 K.KITAMURA,H.MIZUNO 1PE6 9
JRNL TITL
REFINED X-RAY STRUCTURE OF PAPAIN(DOT)E-64-C 1PE6 10
JRNL TITL
2 COMPLEX AT 2.1-ANGSTROMS RESOLUTION 1PE6 11
JRNL REF
J.BIOL.CHEM. V. 266 14771 1991 1PE6 12
JRNL REFN
ASTM JBCHA3 US ISSN 0021-9258 071 1PE6 13
[continued.....]
--------------------------------------------------------------------------------------------------
7. Click on Save full entry to file link. Designate where to save the file. It is suggested that a simple name like "PKA protein" be used instead of something like PDB Short entry for 1PE6.
8. Later, one can view the molecule in a visualization application like WebLab ViewerLite or RasMol (RasMac).
How to view PDB coordinates in RasMol (Rasmac)
RasMol
1. RasMol can be downloaded from http://www.umass.edu/microbio/rasmol/, and
there is a tutorial of how to use RasMol at http://ahab.life.uiuc.edu/bioph410.html.
2. After downloading RasMol and installing it properly, open the application.
3. Scroll down the FILE menu and select OPEN
in order to read the text file desired. The file needs to be in a TEXT
or an acceptable format because the software reads the PDB text file and
produces a three-dimensional structure which can be rotated freely. It
is important that the user is in the MAIN WINDOW in order to view the molecule.
The software allows the user to switch from a wireframe to a ribbon to
a space-filling format in a matter of seconds to minutes depending on the
size of the file and the speed of the computer. Two examples are shown
below:
-----------------------------------------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------------------------
4. The molecular models can then be printed
in color using a color
ink-jet printer or a color laser printer.
The structures can also be saved as PICT, GIF,
and other formats, or they can be copied and
pasted into word processing programs
such as Microsoft Word (v5.0 or higher).
5. To view another molecule one must CLOSE
the current file and OPEN a new one.
Additional Information
BLAST Search main parameters adapted from NCBI page
Program Input Sequence Database
Used
BLASTN DNA DNA
BLASTP protein protein
BLASTX DNA protein
TBLASTN Protein DNA
HISTOGRAM
Display a histogram of scores
for each search; default is yes.
DESCRIPTIONS 100 is good
Restricts the number of short
descriptions of matching sequences reported to the number specified;
default limit is 100 descriptions.
See also EXPECT and CUTOFF.
ALIGNMENTS 50 is good
Restricts database sequences
to the number specified for which high-scoring segment pairs
(HSPs) are reported; the default
limit is 50. If more database sequences than this happen to satisfy
the statistical significance
threshold for reporting (see EXPECT and CUTOFF below), only the
matches ascribed the greatest
statistical significance are reported
EXPECT-lower is more stringent
The statistical significance
threshold for reporting matches against database sequences; the default
value is 10, such that 10 matches
are expected to be found merely by chance, according to the
stochastic model of Karlin and
Altschul (1990). If the statistical significance ascribed to a match is
greater than the EXPECT threshold,
the match will not be reported. Lower EXPECT thresholds
are more stringent, leading
to fewer chance matches being reported. Fractional values are
acceptable.
CUTOFF higher is more stringent
Cutoff score for reporting high-scoring
segment pairs. The default value is calculated from the
EXPECT value (see above). HSPs
are reported for a database sequence only if the statistical
significance ascribed to them
is at least as high as would be ascribed to a lone HSP having a score
equal to the CUTOFF value. Higher
CUTOFF values are more stringent, leading to fewer chance
matches being reported.Typically,
significance thresholds can be more intuitively managed using
EXPECT.
MATRIX for proteins, low PAM
values are good for short sequences
Specify an alternate scoring
matrix for BLASTP, BLASTX, TBLASTN and TBLASTX. The
default matrix is BLOSUM62 (Henikoff
& Henikoff, 1992). The valid alternative choices include:
PAM40, PAM120, PAM250 and IDENTITY.
No alternate scoring matrices are available for
BLASTN; specifying the MATRIX
directive in BLASTN requests returns an error response.
STRAND
Restrict a TBLASTN search to
just the top or bottom strand of the database sequences; or restrict a
BLASTN, BLASTX or TBLASTX search
to just reading frames on the top or bottom strand of
the query sequence.
FILTER
Mask off segments of the query
sequence that have low compositional complexity, as determined
by the SEG program of Wootton
& Federhen (Computers and Chemistry, 1993), or segments
consisting of short-periodicity
internal repeats, as determined by the XNU program of Claverie &
States (Computers and Chemistry,
1993), or, for BLASTN, by the DUST program of Tatusov
and Lipman (in preparation).
Filtering can eliminate statistically significant but biologically
uninteresting reports from the
blast output (e.g., hits against common acidic-, basic- or
proline-rich regions), leaving
the more biologically interesting regions of the query sequence
available for specific matching
against database sequences.
FASTA format description
(adapted)
A sequence in FASTA format begins
with a single-line description, followed by lines of sequence
data. The description line is
distinguished from the sequence data by a greater-than (">") symbol in
the first column. It is recommended that all lines of text be shorter than
80 characters in length. An example sequence in FASTA format is:
>gi|532319|pir|TVFV2E|TVFV2E
envelope protein
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT
QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC
HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK
MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK
TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF
APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL
LAAVEAQQQMLKLTIWGVK
Sequence Format
Sequences are expected to be
represented in the standard IUB/IUPAC amino acid and nucleic acid
codes, with these exceptions:
lower-case letters are accepted and are mapped into upper-case; a single
hyphen or dash can be used to
represent a gap of indeterminate length; and in amino acid sequences, U
and
* are acceptable letters (see
below). Before submitting a request, any numerical digits in the query
sequence should either be removed
or replaced by appropriate letter codes (e.g., N for unknown nucleic
acid residue or X for unknown
amino acid residue).
The nucleic acid codes supported
are:
A --> adenosine M --> A C (amino)
C --> cytidine S --> G C (strong)
G --> guanine W --> A T (weak)
T --> thymidine B --> G T C
U --> uridine D --> G A T
R --> G A (purine) H --> A C
T
Y --> T C (pyrimidine) V -->
G C A
K --> G T (keto) N --> A G C
T (any)
- gap of indeterminate length
For those programs that use amino
acid query sequences (BLASTP and TBLASTN), the accepted amino
acid codes are:
A alanine
P proline
B aspartate or asparagine
Q glutamine
C cystine
R arginine
D aspartate
S serine
E glutamate
T threonine
F phenylalanine
U selenocysteine
G glycine
V valine
H histidine
W tryptophan
I isoleucine
Y tyrosine
K lysine
Z glutamate or glutamine
L leucine
X any
M methionine
* translation stop
N asparagine
- gap of indeterminate length
References
1. León, D.A., Miranda, J & Uridil, S, Structural Analysis and Modeling of Proteins on the Web: An Investigation for Biochemistry Undergraduates, Journal of Chemical Education, June 1998.
2. Understanding Our Genetic Inheritance. The US Human Genome Project. The First Five Years FY 1991-1995, NIH publication No. 90-1590 April 1990.
3. Benton, David, Bioinformatics, TIBTECH, vol. 14, Aug. 1996.
4. Peruski, L.F. Jr. and Peruski, A.H., The Internet and the New
Biology: Tools for Genomic and Molecular Research, ASM, Washington,
D.C., 1997.