Computer Tutorial with Exercises


Summary

The purpose of these exercises is to give the workshop participant an initial exposure to the resources available on the Web. These simple exercises are a "step-by-step" approach for analyzing DNA and protein sequences and for visualizing biochemical and macromolecular structures. (1) All of the links explained or described here can be accessed at http://www.uib.es/depart/dba/MolBio/ .


Index





Exercise #1
Locating a DNA /RNA or Protein Sequence
                    Entrez
                    Swiss-Prot

 Exercise #2
Comparing a New Sequence Against Sequence Databases
                    BLAST
                    FASTA

Exercise #3
Performing Multiple Sequence Alignments
                    MSA

Exercise #4
Performing Structure Predictions
                    ProtScale
                    SWISS-MODEL

Exercise #5
Visualization of Biochemical and Protein Structures
                    Chemicals with Pharmaceutical Acitivity
                    Klotho: Biochemical Compounds
                    Protein Data Bank 3D Browser
                    RasMol

Additional Information

References


Exercise #1: Locating a DNA /RNA or Protein  Sequence

Searching for DNA sequences

There are a variety of ways to access sequences from the Web (4.) One place to start is to use a site that acts as a center to other sites. One of these sites is called NCBI. The NCBI search tool not only searches databases for DNA (and protein) sequences, but it also quickly searches for scientific articles about the protein sequences one is searching for. Here is a short example of how to use NCBI to find the DNA sequence of the type I regulatory subunit of cAMP dependent protein kinase in cows.

1. First access the site http://www.ncbi.nlm.nih.gov

2. Type in the keyword kinase and SEARCH,

3. ADD the keyword cAMP and SEARCH,

4. ADD the keyword regulatory and SEARCH,

5. ADD the keyword bovine and SEARCH

(Ideally, one should receive only a few hits. This approach saves time instead of viewing over 21,696 entries from the original search with the keyword kinase.) One can now select RETREIVE DOCUMENTS, and the following is displayed:
-------------------------------------------------------------------------------------------------------------
D83380
Sea urchin mRNA for catalytic subunit of cAMP-dependent histone kinase, complete cds
gi|1199787|dbj|D83380|SUHCAMPB [1199787]
(View GenBank report,FASTA report,ASN.1 report,Graphical view,1 MEDLINE link, 1
protein link, or 6 nucleotide neighbors )

D83379
Sea urchin mRNA for regulatory subunit of cAMP-dependent histone kinase, complete cds
gi|1199785|dbj|D83379|SUHCAMPA [1199785]
(View GenBank report,FASTA report,ASN.1 report,Graphical view,1 MEDLINE link, 1
protein link, or 1 nucleotide neighbor )

J05692
B.taurus cAMP-dependant protein kinase regulatory subunit RII-beta mRNA, complete cds
gi|163669|gb|J05692|BOVRIIB [163669]
(View GenBank report,FASTA report,ASN.1 report,Graphical view,1 MEDLINE link, 1
protein link, or 12 nucleotide neighbors )

K00833
bovine camp-dependent protein kinase, type 1 regulatory subunit (r-i) mrna
gi|163533|gb|K00833|BOVPKIRI [163533]
(View GenBank report,FASTA report,ASN.1 report,Graphical view,1 MEDLINE link, 1
protein link, or 5 nucleotide neighbors )

M82914
Bos taurus anchor protein regulatory subunit (AKAP75) gene, complete cds
gi|162637|gb|M82914|BOVAKAP [162637]
(View GenBank report,FASTA report,ASN.1 report,Graphical view,1 MEDLINE link, 1
protein link, or 1 nucleotide neighbor )
--------------------------------------------------------------------------------------------------------------
After the server returns a list of possible DNA sequences, one can select the GenBank or FASTA report you want. However, one will note that some of these sequences are DNA or RNA sequences. The DNA or RNA sequence can be submitted and translated by using the Translate Tool at the ExPASy site at http://www.expasy.ch/tools/dna.html, or one can go to another site that searches only for protein sequences.

Searching for Protein Sequences

1. Locate the link to the Swiss-Prot website.

2. Select the by description or identification link, enter the protein name in the description box on the page, and then submit the request.

The server will search for all amino acid sequences related to that word. Once the server returns a list of possible proteins, one must select what species sequence to retrieve. Most sequences that are analyzed are from cow (Bovine), mouse (Murine), human (Homo Sapiens), fly (Drosophila), Escherichia coli (E. coli), or yeast (Saccharomyces cerevisiae). For instance if kinase bovine regulatory is submitted, the following is obtained:
--------------------------------------------------------------------------------------------------------------
Search in SWISS-PROT for: kinase

bovine regulatory

(Release 35 and updates up to 13-Jun-1998)

Number of sequences found: 7
Note that the selected sequences can be saved to a file to be later retrieved; to do so, go to
the bottom of this page.
Please choose one of the following entries:

AK75_BOVIN
A-KINASE ANCHOR PROTEIN 75 (AKAP 75) (CAMP-DEPENDENT PROTEIN
KINASE REGULATORY SUBUNIT II HIGH AFFINITY BINDING PROTEIN)
(P75) - BOS TAURUS (BOVINE)

CD5R_BOVIN
CYCLIN-DEPENDENT KINASE 5 ACTIVATOR PRECURSOR (CDK5
ACTIVATOR) (TAU PROTEIN KINASE II 23 KD SUBUNIT) (TPKII
REGULATORY SUBUNIT) (P23) (P25) (P35) {GENE: NCK5A} - BOS TAURUS
(BOVINE)

KAP0_BOVIN
CAMP-DEPENDENT PROTEIN KINASE TYPE I-ALPHA REGULATORY CHAIN
{GENE: PRKAR1A} - BOS TAURUS (BOVINE)

[continued]
------------------------------------------------------------------------------------------------------------
2. Once the protein sequence and information are displayed, it can be saved as text file from the browser.


Exercise #2: Comparing a New Sequence
Against Sequence Databases

BLAST (Basic Local Alignment Search Tool)

This analysis is useful if one has a new DNA or protein sequence, and one wants to compare it to other sequences in GenBank. You can compare DNA with

DNA, protein with protein, or a combination of these two.

1. Use a sequence from experimental data or obtain one from Exercise #1.

2. Locate the BLAST site at http://www.ncbi.nlm.nih.gov/BLAST/

3. One can choose between either ADVANCED or BASIC BLAST.

4. Paste in the sequence into the appropriate box.

Now, one must choose between Blastn, Blastp, tBlastn,

tBlastx, or Blastx. (See Additional Information) If one inputs a protein sequence, then select Blastp. Important: If one enters a sequence other than the FASTA format, then one must add the symbol ">" then a carriage return and then the sequence.

Example of a new protein sequence:
----------------------------------------------------------------------------------------------------------
>
MKKTILAIAIPALFASAANAAVIYDKDGTTFDVYGRVQANYYGDTNEADSTAASGYKDVDGELKGSSRL
GWSGKIALNNTWSGIAKTEWQVSAENSANKFDSRHIYVGFDGTQYGKVIFGQTDTAFYDVLEPTDIFNEW
GSEGNFYDGRQEGQVIYSNAIGGFKGKVSYQTNDDQAVKVADVAGGIKTTVFPDVKRKYAYAAAVGYDFD
FGLGFNGGYAYSDLEGKTTDASGKKSEWALGAHYAINGFNFAGVYTQAEVKNDTTGYKDEGRGYELAATY
NVDAWTFLAGYNFKEGKENANLSGSSYEDLMDATLLGVQYSFTSKLKAYTEYKINGVSGKDDDFTVALQY
NF
----------------------------------------------------------------------------------------------------------
The default parameters can be altered once one knows how each
changes the search, but for now just select SUBMIT.
One will receive a list similar the one shown below.
-------------------------------------------------------------------------------------
BLASTP 2.0.4 [Feb-24-1998]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A.
Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman
(1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs", Nucleic Acids Res. 25:3389-3402.
Query=(351 letters)

Database: Non-redundant GenBank CDS
translations+PDB+SwissProt+SPupdate+PIR
313,805 sequences; 94,785,530 total letters

Searching..................................................done
 
 

Sequences producing significant alignments: (bits)             Value Score E
gi|1465755 (U59311) OmpL [Photobacterium sp. SS9]                  110 1e-23
gnl|PID|d1015697 (D90775) Outer membrane protein F precursor (O... 109 2e-23
gi|3273514 (AF035618) porin OmpN [Escherichia coli]                109 2e-23
gi|148373 (M28296) outer membrane protein [Enterobacter cloacae]   104 8e-22
pdb|1PHO| Phosphoporin (Phoe)                                      102 4e-21
[continued...........]
---------------------------------------------------------------------------------------------------------

At the end of the list, one will obtain many alignments like the one shown below:

-----------------------------------------------------------------------------------------------------------
gi|1465755 (U59311) OmpL [Photobacterium sp. SS9]
Length = 341
Score = 110 bits (272), Expect = 1e-23
Identities = 99/366 (27%), Positives = 152/366 (41%), Gaps = 43/366 (11%)

Query: 2 KKTXXXXXXXXXXXXXXXXXXXYDKDGTTFDVYGRVQAN-YYGDTNEADSTAASGYKDVD   60
         KK                    Y  + ++  V GR +A     D N+ ++   +   +V
Sbjct: 3 KKLIALAVAAASISSVATAAEVYSDETSSLAVGGRFEARAVLADVNKDENVTNTASSEVS   62

Query: 61 GELKGSSRLGWSGKIALNNTWSGIAKTEWQVSAENSANKFDSRHIYVGFDGTQYGKVIFG  120
             K   R+  +GK  +   + G+   E + S+ +S N  ++R+ Y G  G+QYG++++G
Sbjct: 63 D--KSRVRINVAGKTDITEDFYGVGFFEKEFSSADSDND-ETRYAYAGV-GSQYGQLVYG  118

Query: 121 QTDTAFYDVLEPTDIFNEWGSE-GNFYDG--RQEGQVIYSNAIGGFKGKVSYQTNDDQAV 177
           + D + + +   TDI    G+E GN      R +  + Y   +G F          D+ V
Sbjct: 119 KADGSLGMLTDFTDIMAYHGNEAGNKLAAADRTDNNLSY---VGSFD-----LNGDNLTV 170

Query: 178 KVADVAGGIKTTVFPDVKRKYAYAAAVGYDFDFGLGFNGGYAYSDLEGKTTDASGKKSEW 237
           K   V GG              Y+AA  Y D  GLGF  GY   D +        K  +
Sbjct: 171 KANYVFGGSD--------ENEGYSAAAMYAMDMGLGFGAGYGEQDGQSSKNGNEDKTGKQ 222

Query: 238 ALGA-HYAINGFNFAGVYTQAE---VKNDTTGYKDEGRGYELAATYNVDAWTFLAGYNFK 293
           A GA  Y I+ F F+G+Y +     V ND     DE  GYE AA Y      F+  YNF
Sbjct: 223 AFGAISYTISDFYFSGLYQDSRNTVVNNDLI---DESTGYEFAAAYTYGKAVFITTYNF- 278

Query: 294 EGKENANLSGSSYEDLMDATLLGVQYSFTSKLKAYTEYKINGVSGK--------DDDFTV 345
              E++N SG  + DL D  +  + Y F    + Y  YK N +            D+F +
Sbjct: 279 --LEDSNASGDA-SDLRDSIAIDGTYYFNKNFRTYASYKFNLLDANSSTTKAQASDEFVL 335

Query: 346 ALQYNF 351
             +Y+F
Sbjct: 336 GARYDF 341

[continued]
------------------------------------------------------------------------------------------------------------------
 

5. If the gi|1465755link is selected, then the sequence of this protein is retrieved.
Finally, one can narrow the search by changing the parameters on the
input page. (See Additional Information for list of parameters.)
 

FASTA (FastA)

FASTA is a good sequence alignment tool when one accepts gaps in the whole sequence comparison. Here is an example.

1. Access a FASTA site at http://www.fasta.genome.ad.jp/ , or for a more extensive query try accessing FASTA at http://bioweb.pasteur.fr/seqanal/interfaces/fasta.html.

2. Enter the protein sequence of interest, modify the parameters if needed (See
Additional Information.), and submit the sequence.

The information received is similar to the truncated example below.

------------------------------------------------------------------------------------------------------
FASTA searches a protein or DNA sequence data bank
version 3.0t74 December, 1996

< 20 173 0:== 22 0 0: one = represents 103 library sequences
24 0 0:
26 3 1:*
28 12 16:*
30 74 95:*
32 352 368:===*
34 957 998:=========*
36 2001 2050:===================*
38 3842 3387:================================*=====
40 5132 4725:=============================================*====
42 6034 5776:========================================================*==
44 6180 6371:===========================================================*
46 6100 6489:===========================================================*
48 6106 6213:===========================================================*
50 5501 5669:====================================================== *
52 4724 4984:============================================== *
54 4160 4257:=========================================*
56 3499 3556:==================================*
58 2904 2919:============================*
60 2386 2365:======================*=
62 1851 1896:==================*
64 1505 1508:==============*
66 1194 1192:===========*
68 969 937:=========*

25083768 residues in 69113 sequences
statistics extrapolated from 50000 to 68920 sequences
Expectation fit: rho(ln(x))= 6.0807+/-0.000545; mu= 2.6432+/- 0.030;
mean_var=84.8171+/-17.696
Kolmogorov-Smirnov statistic: 0.0142 (N=29) at 42
 

FASTA (3.06 Sept, 1996) function (optimized, /bio/db/fasta/matrix/aa/blosum50 matrix) ktup: 2
join: 37, opt: 25, gap-pen: -12/ -2, width: 16 reg.-scaled

Scan time: 8.014

The best scores are: initn init1 opt z-sc E(69082)

sp:PHOE_CITFR OUTER MEMBRANE PORE PROTEIN ( 351) 192 77 393 450.1 2e-18
sp:PHOE_KLEPN OUTER MEMBRANE PORE PROTEIN ( 351) 213 78 381 436.6 1.1e-17
sp:PHOE_ECOLI OUTER MEMBRANE PORE PROTEIN ( 351) 274 99 377 432.1 2e-17
sp:OMPF_ECOLI OUTER MEMBRANE PROTEIN F PRE ( 362) 146 79 367 420.6 8.8e-17
[continued...]
-------------------------------------------------------------------------------------------------------
 



 
 

Exercise #3: Performing Multiple Sequence Alignments

There are several protein (and DNA) alignment tools on the Web. One common tool is ClustalW, and it is very good and one can refine the output. However, if you want to perform a fast and simple alignment, one can access the MSA tool at http://www.ibc.wustl.edu/ibc/msa.html . Once one has acquired this page, the following steps can be carried out.

         1.  Select the type of input (raw sequence, Swiss-Prot ID, GenPept ID, or PIR ID).
         2.  Input the sequence into the correct boxes.
         3.  Then select RUN MSA to begin alignment.
The output looks similar to the truncated version below:
-----------------------------------------------------------------------------------------------------------
Multiply Sequence Alignment Output

Running MSA with options:

Here is the Input Data File

>Seq 1
MKKTILAIAIPALFASAANAAVIYDKDGTTFDVYGRVQANYYGDTNEADSTAASGYKDVDGELKGSSRLÖ

>Seq 2
+++++MMKRNILAVI+VPALLVAGTA+NAAEIYNKDG+NKVDLYGKAV+GLHYFSKGNG+ENSYGGNGDÖ

>Seq 3
AEIYNKDGNKLDLYGKIDGLHYFSDDKDVDGDQTYMRLGVKGETQINDQLTGYGQWEYNVQANNTESSSÖ

Here is the Output from the Run

 *** Heuristic Multiple Alignment ***

132                      ****************231                    ********231

-MKKTILAIAIPALF-ASAANAAVIYDKDGTTFDVYGRVQA-NYYGDTNEADSTAASGYKDVDGELKGSSRLDGW
MMKRNILAVIVPALLVAGTANAAEIYNKDGNKVDLYGKAVGLHYFSKGNGENS-----Y-GGNGDMDTYARL-GF
----------------------AEIYNKDGNKLDLYGKIDGLHYFSDDKD-----------VDGD-QTYMRL-GV

SGKIALNNTWSGIAKTEWQVSAENSANKFDSR--HIYVGFDG-TQYGKV---IF-GQTDTAFYDVLEPTDIFNEW
KGETQINSDLTGYGQWEYNFQGNNSEGADAQTGNKTRLAFAG-LKYADVGSFDYDGRNYGVVYDALGYTDMLPEF
KGETQINDQLTGYGQWEYNVQANNTESSSDQA--WTRLAFAGDLKFGDAGSFDY-GRNYGVVYDVTSWTDVLPEF

DGS---EGNFYDGRQEGQVIYSNA-IGGFKGKVSYQTND-DQAVKVADVAGGIKTTVFPDVKRKYAYAAAVGYD-
GGDTAYSDDFFVGRVGGVATYRNSNFFGLVDGLNFAVQY------LDGKN-ERDTAR---RSNGDGVGGSISYE-
GGDTYGSDNFLQSRANGVATYRNSDFFGLVDGLNFALQYQGKNGSVSGEDGATNNGRGALKQNGDGFGTSVTYDI

FDDFGLGFNGGYAYSDLEGKTTD----ASGKKSEWALGAHYAINGFNFAG--------------------
YEGFGIVG--AYG---------------------------------------------------------
FDGISAGF?AYANSKRTDDQNQLLLGEGDHAETYTGGLKYDANNIYLATQYTQTYNDATRAGSLGFANK


 

Exercise #4: Performing Structure Predictions

Secondary Structure

1. Select and access a server that predicts structural properties.
There are several algorithms that predict structural properties. One of
the sites that one can use is ProtScale (http://www.expasy.ch/cgi-bin/protscale.pl ).
It is very important that one uses several algorithms for predicting
structural properties. The method of analysis is based on different mathematical or
pattern relationships. The following are a list of common algorithms:

Garnier, Osgoodthorpe and Robson
Chou and Fasman
Deleage & Roux

2. Paste the raw protein sequence into the appropriate input box and submit the request.

One will obtain an immediate response through the Web site. The response is usually a graphical format. Usually an arbitrary scale for each algorithm is on the Y-axis, and the residue number is on the X-axis. Here is an example of such a report:


 

If this plot were emphasizing alpha helical content, then the stretches of sequences
that have a high propensity for secondary structure are regions from
residue 35 to residue 55 and from residue 125 to residue 140. One must keep in mind that the lower ranking regions of amino acids donít imply the absence of secondary structure; this plot simply shows the probability of alpha helical structure. In order to determine beta strand, turns, loops, etc., one needs to return to the original input page, change the analysis performed, and continue the analysis.
 

3. Compare your collected data or plots

Next, one will need to compare the different data obtained from the several
algorithms. Typically, if all or most of the analyses recognize the same regions
of sequences with secondary structure, then it is highly probable that those regions
do have a predicted structure. Once one locates these regions, then one must
decide exactly where each secondary structure begins and ends. This is where human judgement becomes important. Finally, it is useful to keep some type of table or master diagram that documents what regions of the protein sequence has specific secondary structure.

3-D modeling

1. One can receive a predicted 3-D structure of a protein sequence from SWISS-MODEL.
(http://www.expasy.ch/swissmod/SM_TOPPAGE.html)

2. Click on First Approach Mode on the new frame of the new page.

3. Enter the necessary information requested on the First Approach Mode page.

Paste in the amino sequence in the dialog box. Note: It is suggested that one select the Short mode because the Normal mode will send more information than average user needs. The analysis can take a few minutes to hours depending on the complexity of the sequence submitted. Many times one will not receive any results, or only a segment of the sequence will be modeled.

4. Once one receives an e-mail file from SWISS-MODEL one can save that file as a text file and view it in RasMac or WebLab ViewerLite ( http://www.msi.com/weblab/) as described in Exercise # 4.
 
 



 
 

Exercise #5: Visualization of Biochemical  and Protein Structures

Viewing Pharmaceutical Biochemicals

1. Locate the Chemicals with Pharmaceutical Activity database (http://www.chem.ox.ac.uk/mom/chemical-database/)

2. Search or select a chemical of interest from the alphabetical index.
For example if penicillin V is selected, the following interactive
model is displayed.
-----------------------------------------------------------------------------------------------------
Penicillin_V
C16H18N2O5S : 350.389


--------------------------------------------------------------------------------------------------------------
Note: If one does not have the correct installed, one must follow the instructions as directed by the links at this site. After installation, one must restart the computer and access this site again.

Viewing Metabolic Biochemical Structures

1. One can view general biochemicals at Klotho: Biochemical Compounds

Declarative Database (http://www.ibc.wustl.edu/klotho/)

2. One can find a compound by selecting COMPOUND LISTING. One can search

or scroll down the list to find a compound.

3. After selecting a compound, the molecule is displayed, and one has the choice of displays: Interactive Viewer, Static 3D GIF Image, or Contents. Note: one needs the correct Plug-in for the Interactive Viewer.

Downloading Structures from the Protein Data Bank

1. Locate the PROTEIN DATA BANK (PDB)(http://www.rcsb.org/pdb/index.html)

2. Select the 3DB Browser link.

3. Type in the name of the protein (eg. cAMP protein kinase) in keyword and "search." If the structure of the protein has been solved, it will have a list of structures. Below is an example of what one will receive if the structure has been solved.

 ------------------------------------------------------------------------

-------------------------------------------------------------------------
 

4. Select one of the files by highlighting it, and click on "Explore"

5. If you want to view the structure, click on View Structure. If one wants the complete PDB file, select Download/Display and select file format complete with coordindates--TEXT.
--------------------------------------------------------------------

---------------------------------------------------------------------
6. Now, if complete with coordinates was selected, then one is ready to
download the document. A truncated file is shown below:
--------------------------------------------------------------------------------------------------
HEADER HYDROLASE (SULFHYDRYL PROTEINASE) 14-MAY-91 1PE6 1PE6 2
COMPND PAPAIN (E.C.4.3.22.2) COMPLEX WITH E-64-C 1PE6 3
SOURCE PAPAYA (CARICA $PAPAYA) FRUIT LATEX 1PE6 4
AUTHOR D.YAMAMOTO,K.MATSUMOTO,H.OHISHI,T.ISHIDA,M.INOUE, 1PE6 5
AUTHOR 2 K.KITAMURA,H.MIZUNO 1PE6 6
REVDAT 2 15-MAY-95 1PE6A 1 SEQRES 1PE6A 1
REVDAT 1 15-APR-93 1PE6 0 1PE6 7
JRNL AUTH D.YAMAMOTO,K.MATSUMOTO,H.OHISHI,T.ISHIDA,M.INOUE, 1PE6 8
JRNL AUTH 2 K.KITAMURA,H.MIZUNO 1PE6 9
JRNL TITL REFINED X-RAY STRUCTURE OF PAPAIN(DOT)E-64-C 1PE6 10
JRNL TITL 2 COMPLEX AT 2.1-ANGSTROMS RESOLUTION 1PE6 11
JRNL REF J.BIOL.CHEM. V. 266 14771 1991 1PE6 12
JRNL REFN ASTM JBCHA3 US ISSN 0021-9258 071 1PE6 13
[continued.....]
--------------------------------------------------------------------------------------------------

7. Click on Save full entry to file link. Designate where to save the file. It is suggested that a simple name like "PKA protein" be used instead of something like PDB Short entry for 1PE6.

8. Later, one can view the molecule in a visualization application like WebLab ViewerLite or RasMol (RasMac).

How to view PDB coordinates in RasMol (Rasmac)

RasMol

1. RasMol can be downloaded from http://www.umass.edu/microbio/rasmol/, and

there is a tutorial of how to use RasMol at http://ahab.life.uiuc.edu/bioph410.html.

2. After downloading RasMol and installing it properly, open the application.

3. Scroll down the FILE menu and select OPEN in order to read the text file desired. The file needs to be in a TEXT or an acceptable format because the software reads the PDB text file and produces a three-dimensional structure which can be rotated freely. It is important that the user is in the MAIN WINDOW in order to view the molecule. The software allows the user to switch from a wireframe to a ribbon to a space-filling format in a matter of seconds to minutes depending on the size of the file and the speed of the computer. Two examples are shown below:
-----------------------------------------------------------------------------------------------------------------------------------

-----------------------------------------------------------------------------------------------------------------------------------
4. The molecular models can then be printed in color using a color
ink-jet printer or a color laser printer. The structures can also be saved as PICT, GIF,
and other formats, or they can be copied and pasted into word processing programs
such as Microsoft Word (v5.0 or higher).

5. To view another molecule one must CLOSE the current file and OPEN a new one.
 
 
 



 
 

Additional Information

BLAST Search main parameters adapted from NCBI page

Program Input Sequence Database Used
BLASTN DNA DNA
BLASTP protein protein
BLASTX DNA protein
TBLASTN Protein DNA

HISTOGRAM
Display a histogram of scores for each search; default is yes.

DESCRIPTIONS 100 is good
Restricts the number of short descriptions of matching sequences reported to the number specified;
default limit is 100 descriptions. See also EXPECT and CUTOFF.

ALIGNMENTS 50 is good
Restricts database sequences to the number specified for which high-scoring segment pairs
(HSPs) are reported; the default limit is 50. If more database sequences than this happen to satisfy
the statistical significance threshold for reporting (see EXPECT and CUTOFF below), only the
matches ascribed the greatest statistical significance are reported

EXPECT-lower is more stringent
The statistical significance threshold for reporting matches against database sequences; the default
value is 10, such that 10 matches are expected to be found merely by chance, according to the
stochastic model of Karlin and Altschul (1990). If the statistical significance ascribed to a match is
greater than the EXPECT threshold, the match will not be reported. Lower EXPECT thresholds
are more stringent, leading to fewer chance matches being reported. Fractional values are
acceptable.

CUTOFF  higher is more stringent
Cutoff score for reporting high-scoring segment pairs. The default value is calculated from the

EXPECT value (see above). HSPs are reported for a database sequence only if the statistical
significance ascribed to them is at least as high as would be ascribed to a lone HSP having a score
equal to the CUTOFF value. Higher CUTOFF values are more stringent, leading to fewer chance
matches being reported.Typically, significance thresholds can be more intuitively managed using

EXPECT.

MATRIX for proteins, low PAM values are good for short sequences
Specify an alternate scoring matrix for BLASTP, BLASTX, TBLASTN and TBLASTX. The
default matrix is BLOSUM62 (Henikoff & Henikoff, 1992). The valid alternative choices include:
PAM40, PAM120, PAM250 and IDENTITY. No alternate scoring matrices are available for
BLASTN; specifying the MATRIX directive in BLASTN requests returns an error response.

STRAND
Restrict a TBLASTN search to just the top or bottom strand of the database sequences; or restrict a
BLASTN, BLASTX or TBLASTX search to just reading frames on the top or bottom strand of
the query sequence.

FILTER

Mask off segments of the query sequence that have low compositional complexity, as determined
by the SEG program of Wootton & Federhen (Computers and Chemistry, 1993), or segments
consisting of short-periodicity internal repeats, as determined by the XNU program of Claverie &
States (Computers and Chemistry, 1993), or, for BLASTN, by the DUST program of Tatusov
and Lipman (in preparation). Filtering can eliminate statistically significant but biologically
uninteresting reports from the blast output (e.g., hits against common acidic-, basic- or
proline-rich regions), leaving the more biologically interesting regions of the query sequence
available for specific matching against database sequences.

FASTA format description (adapted)
A sequence in FASTA format begins with a single-line description, followed by lines of sequence
data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length. An example sequence in FASTA format is:

>gi|532319|pir|TVFV2E|TVFV2E envelope protein
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT
QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC
HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK
MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK
TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF
APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL
LAAVEAQQQMLKLTIWGVK

Sequence Format
Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid
codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single
hyphen or dash can be used to represent a gap of indeterminate length; and in amino acid sequences, U and
* are acceptable letters (see below). Before submitting a request, any numerical digits in the query
sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic
acid residue or X for unknown amino acid residue).
The nucleic acid codes supported are:

A --> adenosine M --> A C (amino)
C --> cytidine S --> G C (strong)
G --> guanine W --> A T (weak)
T --> thymidine B --> G T C
U --> uridine D --> G A T
R --> G A (purine) H --> A C T
Y --> T C (pyrimidine) V --> G C A
K --> G T (keto) N --> A G C T (any)
- gap of indeterminate length

For those programs that use amino acid query sequences (BLASTP and TBLASTN), the accepted amino
acid codes are:

A alanine
P proline
B aspartate or asparagine
Q glutamine
C cystine
R arginine
D aspartate
S serine
E glutamate
T threonine
F phenylalanine
U selenocysteine
G glycine
V valine
H histidine
W tryptophan
I isoleucine
Y tyrosine
K lysine
Z glutamate or glutamine
L leucine
X any
M methionine
* translation stop
N asparagine
- gap of indeterminate length


References

1. León, D.A., Miranda, J & Uridil, S, Structural Analysis and Modeling of Proteins on the Web: An Investigation for Biochemistry Undergraduates, Journal of Chemical Education, June 1998.

2. Understanding Our Genetic Inheritance. The US Human Genome Project.  The First Five Years FY 1991-1995, NIH publication No. 90-1590 April 1990.

3. Benton, David, Bioinformatics, TIBTECH, vol. 14, Aug. 1996.

4. Peruski, L.F. Jr. and Peruski, A.H., The Internet and the New Biology: Tools for Genomic and Molecular Research, ASM, Washington, D.C., 1997.