Using BLAST For Performing Sequence Alignment
Using BLAST For Performing Sequence Alignment
8
Alignment
The BLAST algorithm, available on many public and private genomics-related Web
sites, is a widely used sequence similarity search algorithm developed at the National
Center for Biotechnology Information (NCBI; Altschul et al., 1994; Altschul et al.,
1997; Altschul et al., 2005). BLAST is popular partly because it is among the fastest
similarity search algorithms, and also because the NCBI has made it freely available to the
biological community, both for use by the individual researcher through the Web interface
surveyed in this unit, and for use by bioinformatics specialists who can download the
BLAST programs for installation on their own computers. NCBI has further enhanced
the value of BLAST searches performed on its Web site by adding extensive hyperlinks to
BLAST output, so that a BLAST search on the NCBI becomes a portal into a vast range
of related resources on the NCBI and other public databases (McGinnis and Madden,
2004).
This unit presents examples of using the BLAST algorithm on the NCBI and ENSEMBL
Web sites; there are probably hundreds of other public Web servers that include BLAST
searching features. Since the BLAST program is available for standalone installation to
be run on computers at one’s own institution, the reader should consult local experts
for information about any local BLAST servers that might be available. Readers who
work with confidential or proprietary sequence data may be prohibited from submitting
such sequences to computers outside their own institution. The details of other public
or private BLAST interfaces may differ from the NCBI and ENSEMBL Web servers
described in this unit, but the general principles will be the same.
There are several programs in the BLAST suite; the choice among them is dictated by
which type of query sequence one has (amino acid or nucleotide), which type of database
one wishes to search (amino acid or nucleotide), and whether one needs certain more-
advanced algorithms, such as PSI-BLAST. PSI-BLAST can be an extremely powerful
way of finding more distant relatives of a query sequence, but can also generate highly
misleading results unless used with care (Jones and Swindells, 2002), so the reader should
consult local experts on sequence similarity searching before using PSI-BLAST.
Historically many Web servers asked the user to select from a list of BLAST programs,
but today the trend is for the Web server to determine this automatically from the
types of sequences used for query and database. Should the reader encounter a BLAST
server that requires selecting from a list of BLAST programs, NCBI offers guidance for
selecting a BLAST program at the following URL: http://www.ncbi.nlm.nih.gov/blast/
producttable.shtml#pstab.
In this unit, three Basic Protocols and one Support Protocol will be presented: Basic
Protocol 1, Blastp, Searching the NCBI protein databases using a protein query sequence;
Basic Protocol 2, Blastn, Searching the NCBI nucleotide databases using a nucleotide
query sequence; Basic Protocol 3, Searching the ENSEMBL human genomic nucleotide
database using a nucleotide query sequence; and Support Protocol 1, Downloading
protein and nucleotide sequences from the NCBI databases.
Identifying
Candidate Genes
in Genomic DNA
Contributed by Matthew D. Healy 6.8.1
Current Protocols in Human Genetics (2007) 6.8.1-6.8.23
Copyright
C 2007 by John Wiley & Sons, Inc.
Supplement 52
BASIC BLASTP: SEARCHING THE NCBI PROTEIN DATABASES USING A
PROTOCOL 1 PROTEIN QUERY SEQUENCE
In this protocol, the reader will perform a protein-protein BLAST search on the NCBI
Web server and then interpret the results. The choices of program parameters given in this
example are suggested as a starting point for general-purpose protein sequence analyses.
If one has available both protein and nucleotide sequences for a locus of interest, a protein
search is likely to be more biologically informative than is a nucleotide search.
Materials
A computer with Internet access running any standard Web browser such as Internet
Explorer or Mozilla. While a broadband connection is preferable, most features
of the NCBI BLAST interface perform acceptably over a dialup connection.
1. Open a Web browser and navigate to the NCBI main page at http://www.ncbi.nih.gov
or http://ncbi.nlm.nih.gov.
IMPORTANT NOTE: Since the NCBI BLAST interface opens new browser windows, if
the reader has pop-up blocking software installed it may be necessary to configure that
software so it will let the NCBI Web site open pop-up windows.
2. Near the top of the page, click on BLAST to open the main BLAST page, shown in
Figure 6.8.1.
Note that in recent years NCBI has revised the design of this page so that its current
organization is focused less on algorithmic details and more on biology. Figure 6.8.1
shows the NCBI BLAST page, as of May 2006.
Links on the left side of the page point to BLAST documentation and technical information
at the NCBI. Links under Nucleotide take the user to BLAST search submission forms
for searching nucleotide sequence databases at NCBI using a nucleotide query sequence.
Links under Protein take the user to BLAST search submission forms for searching protein
sequence databases at NCBI using a protein query sequence. Links under Translated are
for doing translated searches (nucleotide query versus protein database, protein query
versus nucleotide database, or nucleotide query versus nucleotide database with all six-
frame conceptual translations of query and database sequences compared as protein
sequences). Note that NCBI changes the design of their BLAST page fairly frequently, so
the page as viewed live over the Internet may differ somewhat from this screenshot. PSI-
BLAST and PHI-BLAST are more-advanced variations of the BLAST algorithm that are
useful in certain contexts. Links under Special and Meta are described in the Commentary
section of this unit.
3. For this protocol, click on the words “Protein-protein BLAST (blastp)” in the section
of the page headed Protein. This will open the protein-protein BLAST search page
at NCBI, which is divided into several sections. Scroll down to view its full content.
The next few steps in this protocol will briefly describe what to do in each section; further
details about the significance of the various options available for BLAST searching are
given below in the Critical Parameters section.
4. Figure 6.8.2 shows the first section of the protein-protein BLAST form, where
the user enters basic options that are needed for every search, such as the query
sequence to be used. Type the NCBI RefSeq accession number NP 006715.2 into
the Search box of this section, as shown in Figure 6.8.2, or the NCBI sequence
identifier gi|55956902.
Most parameter names in NCBI forms are hyperlinked to detailed help pages, so the
reader should get in the habit of mouse-clicking on keywords in all NCBI forms.
Using BLAST for IMPORTANT NOTE: While a RefSeq accession number can be entered as shown in
Performing
Sequence Figure 6.8.2, an NCBI sequence identifier must be entered with the gi| prefix.
Alignment
6.8.2
Supplement 52 Current Protocols in Human Genetics
Figure 6.8.1 The BLAST homepage at the National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/
BLAST/).
5. For a novel sequence, paste it in the FASTA or raw format (see the Support Protocol
for a description of the FASTA sequence format).
NOTE: The NCBI BLAST Web server does not allow pasting multiple query sequences,
so readers needing to perform batch searches should consult local experts to learn about
any such facilities that may be available on in-house systems.
For this example, leave the “Choose database” option at its default setting of “nr” for
the nonredundant protein sequence database maintained at NCBI; this database includes
translations of all coding sequences in GenBank and RefSeq, plus curated public protein
sequence databases from several sources. The term “nonredundant” means that all Identifying
Candidate Genes
database entries having this identical sequence are combined into one entry with multiple in Genomic DNA
6.8.3
Current Protocols in Human Genetics Supplement 52
Figure 6.8.2 The basic options portion of the NCBI protein-protein BLAST search page.
Figure 6.8.3 The advanced options portion of the NCBI protein-protein BLAST search page.
annotations. Finally, the Do CD-Search option should be left as it is—by default it will
be checked.
6. Figure 6.8.3 shows the second section of the protein-protein BLAST form, where the
user enters more advanced options affecting how the search will be performed. The
default values for these options are appropriate in most cases, and will be explained
later in this unit under Critical Parameters. For this example, the reader should make
all options match those shown in Figure 6.8.3.
Note that NCBI often makes minor revisions to these Web forms, so the default choices
Using BLAST for for some options may be slightly different from what is depicted in the figures.
Performing
Sequence 7. Finally, click the BLAST button in the basic options section of the BLAST form.
Alignment
A new Web page should appear, with the words Your request has been
6.8.4
Supplement 52 Current Protocols in Human Genetics
Figure 6.8.4 The formatting options portion of the NCBI protein-protein BLAST search page.
8. For this example, choose formatting options as depicted in Figure 6.8.4 and click the
Format button. A new browser window will appear, and initially it will probably say
WAITING with some status information. When the search is done, this page will be
replaced by the search results.
IMPORTANT NOTE: Do not close the BLAST window while waiting for the search
to complete or while viewing the results. Once a search has been finished the reader
can try out different formatting options by changing the options in the BLAST window
and clicking the Format button. Since the search results are saved on the NCBI Web Identifying
site for several hours, the effects of changing the formatting options can be seen almost Candidate Genes
instantly—the search does not need to be repeated. in Genomic DNA
6.8.5
Current Protocols in Human Genetics Supplement 52
Figure 6.8.6 Graphical overview of where BLAST hits align to the query sequence. For the color
version of this figure go to http://www.currentprotocols.com.
6.8.6
Supplement 52 Current Protocols in Human Genetics
Figure 6.8.8 Pairwise alignment of the query sequence gi|55956902|ref|NP 006715.2 and the hit
sequence gi|60688294|gb|AAH91661.1 from zebrafish (Danio rerio) as generated by the BLAST
Web site at NCBI.
At this point the reader should experiment with different settings of the formatting options
to see how the output page changes. In particular, the reader is advised to try all possi-
bilities for the “Alignment view” because each serves to emphasize different aspects of
the search results.
Figures 6.8.6, 6.8.7, & Figure 6.8.8 show portions of the output from performing a BLAST
search as described in this example. These results and their interpretation are described
in more detail below in the Discussion section.
Materials
A computer with Internet access running any standard Web browser such as Internet
Explorer or Mozilla. While a broadband connection is preferable, most features
of the NCBI BLAST interface perform acceptably over a dialup connection.
1. Open a Web browser and navigate to the NCBI main page at http://www.ncbi.nih.gov
or http://ncbi.nlm.nih.gov.
NOTE: Since the NCBI BLAST interface opens new browser windows, if the reader has
pop-up blocking software installed it may be necessary to configure that software so it
will let the NCBI Web site open pop-up windows.
2. Near the top of the page, click on BLAST to open the main BLAST page. Click on
the words “Nucleotide-nucleotide BLAST (blastn)” in the section of the page headed
Nucleotide. This will open the nucleotide-nucleotide BLAST search page at NCBI,
which is divided into several sections.
Note that in recent years NCBI has revised the design of this page so that its current
organization is focused less on algorithmic details and more on biology. Figure 6.8.1
shows the NCBI BLAST page as of May 2006.
The next few steps in this Protocol will briefly describe what to do in each section; further
details about the significance of the various options available for BLAST searching are
given below in the Critical Parameters section.
3. Figure 6.8.9 shows the first section of the nucleotide-nucleotide BLAST form, where
the user enters basic options that are needed for every search, such as the query
sequence to be used. Type the NCBI sequence identifier gi|55956901 into the
Search box of this section, as shown in Figure 6.8.9.
Identifying
IMPORTANT NOTE: While a RefSeq accession number can be entered without any Candidate Genes
prefix, an NCBI sequence identifier must be entered with the gi| prefix as shown. in Genomic DNA
6.8.7
Current Protocols in Human Genetics Supplement 52
Figure 6.8.9 The basic options portion of the NCBI nucleotide-nucleotide BLAST search page.
Figure 6.8.10 The advanced options portion of the NCBI nucleotide-nucleotide BLAST search
page.
4. For a novel sequence, paste it in the FASTA or raw format (see the Support Protocol
below for a description of the FASTA sequence format). As shown in Figure 6.8.9,
the reader should choose to search the “nr” database, which is a comprehensive
nucleotide sequence database maintained by NCBI.
IMPORTANT NOTE: The NCBI BLAST Web server does not allow pasting multiple query
sequences, so readers needing to perform batch searches should consult local experts to
learn about any such facilities that may be available on in-house systems.
IMPORTANT NOTE: The default nucleotide database, like the default protein database,
is called “nr.” However, NCBI no longer maintains a nonredundant nucleotide sequence
database, so this database will in fact contain many identical sequences; the name “nr”
is maintained for historical reasons.
5. Figure 6.8.10 shows the second section of the protein-protein BLAST form, where
the user enters more advanced options affecting how the search will be performed.
The default values for these options are appropriate in most cases, and will be
explained later in this unit under Critical Parameters.
Until recently, NCBI had the “Low complexity” option under “Choose filter” checked by
default in both their protein query form and their nucleotide query form. Now the default
for protein queries is for this to be unchecked as shown in Figure 6.8.10. The author of
Using BLAST for this unit prefers having the option unchecked for all searches.
Performing
Sequence
Alignment
6.8.8
Supplement 52 Current Protocols in Human Genetics
Figure 6.8.11 The formatting options portion of the NCBI nucleotide-nucleotide BLAST search
page.
Figure 6.8.12 Graphical overview of hits relative to the query sequence on the NCBI BLAST Identifying
interface. For the color version of this figure go to http://www.currentprotocols.com. Candidate Genes
in Genomic DNA
6.8.9
Current Protocols in Human Genetics Supplement 52
Figure 6.8.13 List of hits from BLAST search output at NCBI.
Figure 6.8.14 Further down the list of hits, the raw scores begin getting smaller and the statistical
“E” scores are not as small, but these hits are still very strong hits.
6. Finally, click the BLAST button in the basic options section of the BLAST
form.
After the BLAST button has been clicked to submit a search, a new Web page should appear,
with the words Your request has been successfully submitted and
put into the Blast Queue near the top. The lower section of this new page, as
shown in Figure 6.8.11, has options that affect how the output from this BLAST search
will be formatted.
7. For this example, leave the formatting options at their default values and click the
Format button. A new browser window will appear, and initially it will probably say
WAITING with some status information. When the search is done, this page will be
Using BLAST for replaced by the search results.
Performing
Sequence
Alignment
6.8.10
Supplement 52 Current Protocols in Human Genetics
Figure 6.8.15 Pairwise alignment between gi|5596901|ref|NM 006724.2 and gi|50741727|ref|XM 419617.1
generated by a BLAST search submitted to the NCBI Web site.
IMPORTANT NOTE: Do not close the BLAST window while waiting for the search to
complete or while viewing the results. Once a search has been finished the reader can try
out different formatting options by changing the options in the BLAST window and clicking
the Format button. Since the search results are saved on the NCBI Web site for several
hours, the effects of changing the formatting options can be seen almost instantly—the
search does not need to be repeated.
At this point the reader should experiment with different settings of the formatting options
to see how the output page changes. In particular, the reader is advised to try all possi-
bilities for the “Alignment view” because each serves to emphasize different aspects of
the search results.
Figures 6.8.12, 6.8.13, 6.8.14, & Figure 6.8.15 show portions of the output from perform-
ing a BLAST search as described in this example. These results and their interpretation
are described in more detail below in the Discussion section of this unit.
Identifying
Candidate Genes
in Genomic DNA
6.8.11
Current Protocols in Human Genetics Supplement 52
BASIC SEARCHING THE ENSEMBL HUMAN GENOMIC NUCLEOTIDE
PROTOCOL 3 DATABASE USING A NUCLEOTIDE QUERY SEQUENCE
In this protocol, the reader will use a nucleotide sequence, downloaded from NCBI as
described in Support Protocol 1, to perform a BLAST search of the ENSEMBL human
genome database and view the hits in their genomic context. The ENSEMBL database
provides a rich graphical environment for browsing the human genome; BLAST searching
provides a powerful way to locate regions of interest in this vast biological landscape.
In effect, one is performing an in silico hybridization using one’s query sequence as the
probe much as in pregenome days one might have used fluorescence-in-situ hybridization
(FISH) for mapping the chromosomal location of a probe. UNIT 6.9 of Current Protocols
in Human Genetics describes the ENSEMBL database and user interface in some detail;
here the focus is on its BLAST interface.
Materials
A computer with Internet access running any standard Web browser such as
Internet Explorer or Mozilla. Since ENSEMBL is very graphics-rich, a
broadband connection is strongly preferable to a dial-up connection.
1. Navigate to the ENSEMBL BLAST form at http://www.ensembl.org/Multi/
blastview. As shown in Figure 6.8.16, paste a nucleotide query sequence into the
text field near the top of the form. For this example, use the FASTA-formatted
nucleotide sequence for XM 849999 from NCBI (the Support Protocol explains how
to download this sequence from NCBI). As shown near the bottom of Figure 6.8.16,
the “dna queries” checkbox should be selected since a nucleotide sequence is being
used.
IMPORTANT NOTE: Although the ENSEMBL BLAST Web service has the ability to
obtain the query sequence from a database, it only recognizes a subset of NCBI RefSeq
accession numbers, and at the time of writing could not find XM 849999 in its database. In
the author’s experience, the ENSEMBL BLAST Web service does not handle numbers and
Using BLAST for Figure 6.8.16 Sequence entry section of the BLAST submission page at the ENSEMBL Web
Performing site (http://www.ensembl.org/Multi/blastview).
Sequence
Alignment
6.8.12
Supplement 52 Current Protocols in Human Genetics
Figure 6.8.17 Specifying the search options on the ENSEMBL BLAST search page.
Figure 6.8.18 Retrieving BLAST output from the ENSEMBL BLAST search.
spaces in the sequence quite as well as does the NCBI BLAST Web service; it is therefore
recommended that the reader always submit the query sequence in FASTA format.
2. Figure 6.8.17 shows the portion of the ENSEMBL BLAST form where search
options are specified. As shown in Figure 6.8.17, for species pick “Homo sapiens.”
In the next section pick “Genomic sequence” and check “dna database.” Select the
BLASTN program under “Select the Search Tool.” Pick the “Distant homologies”
setting for “Search sensitivity.” Finally click the RUN button to start the search. The
Web browser should change to a page showing the status of the BLAST search, as
Identifying
shown in Figure 6.8.18, but initially where the Figure shows the number of hits it will Candidate Genes
in Genomic DNA
6.8.13
Current Protocols in Human Genetics Supplement 52
probably say your search is still pending. Clicking the Retrieve button will display
status messages such as Job Queued until the search is done and the results
will appear as shown in Figure 6.8.17. Clicking the View button will display the
results.
3. To check the status of the search, click the Retrieve button. When the search is
finished, a View button will appear as shown in Figure 6.8.18; to the left of this
button it will say 23 alignments, 12 hits or something similar. Click the
View button to view the hits.
For many readers, the most interesting part of this view will be the image showing the
chromosomal locations of the BLAST hits.
Figure 6.8.19 Karyotype view of hits from ENSEMBL BLAST search. Clicking one of the red
Using BLAST for arrows will cause a pop-up menu to appear.
Performing
Sequence
Alignment
6.8.14
Supplement 52 Current Protocols in Human Genetics
Figure 6.8.20 Click on the Chromosome 6 hit, then pick ContigView from this menu.
IMPORTANT NOTE: Unlike the NCBI BLAST Web server, which automatically updates
the status of a search until it is finished, the search status on the ENSEMBL BLAST
Web server will only be updated when the user clicks on the Retrieve button. Also, the
ENSEMBL BLAST server can take 15 min or longer when it is busy, so the user might
need to wait a while for results.
6.8.15
Current Protocols in Human Genetics Supplement 52
Figure 6.8.21 Part of the ENSEMBL view of Chromosome 6 in a region surrounding this BLAST hit. For
the color version of this figure go to http://www.currentprotocols.com.
Materials
A computer with Internet access running any standard Web browser such as
Internet Explorer or Mozilla. While a broadband connection is preferable, most
features of the NCBI search interface perform acceptably over a dialup
connection. Note that if the reader is downloading a large amount of sequence
data from NCBI, a fast connection is essential.
Using BLAST for
Performing
Sequence
Alignment
6.8.16
Supplement 52 Current Protocols in Human Genetics
1. Open a Web browser and navigate to the NCBI main page at http://www.ncbi.nih.gov
or http://ncbi.nlm.nih.gov.
IMPORTANT NOTE: Since the NCBI Web interface opens new browser windows, if
the reader has pop-up blocking software installed it may be necessary to configure that
software so it will let the NCBI Web site open pop-up windows.
2. Near the top of the page is a two-part search form that says Search then a pull-down
menu of databases, then “for” and a place to enter a keyword. Select Nucleotide for
the type of database to search, and enter XM-849999 for the keyword.
3. Click the GO button.
4. In the search results page is a pull-down list of formats beside the word Display, and
the option Summary will be selected. Change the selection in this menu to FASTA
and release the mouse button. Immediately, the page format should change to display
this sequence in FASTA format as shown in Figure 6.8.22.
As can be seen from this example, this is a fairly simple sequence file format, which can
be used to store one or many sequences (however, the NCBI BLAST Web interface does
not allow submitting more than one query sequence at once).
IMPORTANT NOTE: The reader should copy this FASTA sequence into a text editor and
save a local copy of it, because this FASTA sequence will be needed to perform the BLAST
search of ENSEMBL as described in Basic Protocol 3 above.
A line beginning with the greater-than (>) character is a definition line or defline and
succeeding lines that do not begin with the > character have the sequence corresponding
to that defline. An NCBI defline will have one or more identifiers, each consisting of a
prefix and one or more accession number(s) and/or name(s), with vertical bars between
each prefix or accession number and the next prefix, accession number, or name.
Here are a few examples of deflines from NCBI; note that the format varies depending on
the database:
>gi|1705892|sp|P51677|CCR3-HUMAN
The above has NCBI “gi-number” 1705892, SwissProt accession number P51677, and
SwissProt name CCR3 HUMAN
>gi|37655185|gb|AAO65970.2|
The above has NCBI “gi-number” 37655185 and GenBank accession AAO65970.2
>gi|30581170|ref|NP-847899.1|
The above has NCBI “gi-number” 30581170 and RefSeq accession NP 847899.1
>gi|30581169|ref|NM-178329.1|
The above has NCBI “gi-number” 30581169 and RefSeq accession NM 178329.1
>gi|56554118|pdb|1U19|B
Figure 6.8.22 Sequence in FASTA format showing a fairly simple sequence file format, which
can be used to store one or many sequences. Identifying
Candidate Genes
in Genomic DNA
6.8.17
Current Protocols in Human Genetics Supplement 52
Figure 6.8.23 A sequence in Genbank format.
The above has NCBI “gi-number” 56554118 and is associated with Chain B of the Protein
Databank structure entry 1U19.
The official documentation for these and other NCBI defline formats is a text file called
formatdb README that comes with NCBI’s version of the stand-alone BLAST package,
so if the reader’s institution has a local installation of NCBI BLAST then it will include
a copy of this document. If a local copy of this file is not available, the reader can easily
find dozens of copies on public BLAST servers by typing formatb README into an
Internet search engine (at the time of writing, the author of this unit found 66 copies of
this file with the Google search engine).
IMPORTANT NOTE: FASTA sequence files from non-NCBI sources will also have > at
the start of each defline, but may not follow the NCBI defline format—there are dozens of
different defline formats used by public sequence databases.
5. Now, change the Display option to Genbank and release the mouse. The displayed
format should change to display this sequence in Genbank format as shown in
Figure 6.8.23.
The reader should experiment with other formatting and search options, because a thor-
ough familiarity with downloading NCBI data in various formats is an extremely useful
skill when working with genetic sequences.
COMMENTARY
Background Information evolutionary ancestor will often, but not al-
Primary sequence comparison of genetic ways, have related functions. Because humans
entities descended from a common ancestor share a common ancestor with other mammals,
often, though not always, reveals more sim- sequence comparisons can extend experimen-
ilarities than would be expected by chance. tal knowledge about a rat gene to a homol-
Therefore, when one has in hand a nu- ogous gene in humans. Because many genes
cleotide or amino acid sequence, one may have been duplicated in the course of evolu-
wish to identify other sequences that are tion, sequence comparisons can extend exper-
similar to it. The observation that two se- imental knowledge about one rat gene to other,
quences have a significantly-greater-than- related rat genes (and in many cases to human
random degree of similarity provides evi- genes as well). Much of what is thought to be
dence for the hypothesis that the sequences are known in genetics is based on such extrapola-
homologous. tions from the small subset of genes that have
Using BLAST for
Performing Sequence similarity search algorithms such been thoroughly characterized in the labora-
Sequence as BLAST are powerful and justly popular be- tory to the thousands of other putative genes
Alignment cause genetic entities that share a common that have not.
6.8.18
Supplement 52 Current Protocols in Human Genetics
Like such other algorithms as FASTA, Fig. 6.8.1) has one option that can be conve-
BLAST is a “heuristic” approximation to nient: Retrieve Results. Each time one submits
the more rigorous Smith-Waterman algorithm. a search to the NCBI BLAST servers, the re-
While Smith-Waterman is generally regarded sponse includes a Request ID number which
as the “gold standard” algorithm for pairwise identifies the search. If one copies that num-
local alignment of two genetic sequences, it ber, it can be pasted into the Retrieve Results
consumes far too much CPU time to be practi- form at any time up to about 24 hr later and the
cal without the use of massively-parallel com- search results viewed. This can be especially
puting hardware. However, if the reader’s insti- handy if one has a search running when it is
tution has access to such facilities, then using time to go home: e-mail the RID number to
more rigorous algorithms may find similarities yourself and view the output from home.
that BLAST cannot find. The Special section near the bottom left
of the NCBI BLAST page (as shown in Fig.
Other Types of BLAST 6.8.1), lists several more variations on BLAST
The NCBI BLAST Web site offers a large that may be of particular interest to researchers
and increasing number of variations on the ba- in human genetics:
sic algorithm.
GEO BLAST
Translated BLAST GEO is the Gene Expression Omnibus,
Sometimes one wants to compare a nu- a publicly accessible database of transcrip-
cleotide query sequence against the protein tional profiling experiments (see Chapter 11)
databases, or vice versa. This can be particu- conducted with various types of microarrays.
larly useful as a means to find the correct read- One common way of entering this vast re-
ing frame in an uncharacterized nucleotide se- source is by pasting the sequence of a gene of
quence. The most commonly-used translated interest into GEO BLAST and searching for
search algorithms appear in the Translated sec- experiments in which that gene was profiled.
tion of the NCBI BLAST page as shown in
Figure 6.8.1 above. BLASTX compares a nu- SNP BLAST
cleotide query against protein databases and The SNP databases at NCBI (see UNIT 7.11)
TBLASTN compares a protein query against contain information about millions of single-
a nucleotide database. The TBLASTX algo- nucleotide polymorphisms; this tool is a good
rithm compares a nucleotide query sequence way to find SNPs in a sequence of interest. It is
against a nucleotide database, taking all pos- important in many cases to include upstream
sible six-frame translations of both query and and downstream UTR regions because many
database into account. Use of TBLASTX was SNPs are in these regions. Another good way
more common some years ago when protein to look for SNPs in or near a gene of inter-
databases were much less comprehensive than est is by searching the ENSEMBL database
they are today; nowadays it is rare to encounter as described in Basic Protocol 3 above, since
a coding sequence that has no matches in the ENSEMBL contains SNP information.
public protein sequence databases. The reader is cautioned that in many cases
Note that translated BLAST generates the the public SNP databases identify the wrong
six-frame translations of the nucleotide se- strand. After one has found SNPs in a region of
quence, and then compares these as protein interest their orientation should be checked by
sequences. If a nucleotide sequence has many BLAST-searching ENSEMBL with the flank-
insertions or deletions (indels), then translated ing sequences from the SNP record.
BLAST may not work well. There do ex-
ist specialized algorithms (such as TFASTX, Blast 2 sequences
TFASTY, and FRAMESEARCH) for doing a Sometimes a user wants to see the align-
translated search that take single-nucleotide ment between a query sequence and one
indels into account, but since these algorithms specific target sequence. One can of course
are extremely CPU-intensive few public Web BLAST against the entire database in the
servers offer them. If the reader often works standard manner and scroll down to the de-
with such sequences, local experts should be sired alignment, but in such cases it is more
consulted to learn what resources for such convenient to paste both sequences into the
analyses might be available at the reader’s in- Blast 2 Sequences tool (also this tool returns
stitution. The Meta section near the bottom results almost instantly since it skips the search Identifying
right of the NCBI BLAST page (as shown in phase). Candidate Genes
in Genomic DNA
6.8.19
Current Protocols in Human Genetics Supplement 52
Screening for vector contamination Choice of database
Contamination with extraneous sequences For most purposes, the “nr” protein and
such as, cloning vectors, PCR primers, adap- nucleotide databases will be the appropriate
tors, and other oligonucleotides that may be choices. However, NCBI does make a num-
used in laboratory work has caused enor- ber of other databases available for BLAST
mous problems with nearly all large genetic searching. Since the list of these databases
sequence databases, because two otherwise changes frequently, the best source of infor-
unrelated sequences, that happen to be con- mation about them is the NCBI Web site.
taminated with the same commercial vector se-
quence, can produce bogus matches in BLAST Filtering
searches. Therefore, when one has a novel There is one change the author of this
sequence it is wise to check it for such con- unit typically makes from the default choices
tamination using VecScreen at NCBI. It is ex- made by NCBI: he usually turns off low-
tremely important to do this before submit- complexity filtering. NCBI prefers to have
ting a sequence to any public database. Note filtering enabled by default for nucleotide
that VecScreen is designed only to flag possi- query sequences (and until recently also
ble contamination, so it does not report which had filtering enabled by default for pro-
vector sequence is present. As the NCBI docu- tein query sequences) because if the query
mentation for VecScreen notes, “this can usu- has many regions of repetitive sequence
ally be deduced from the cloning history of the the BLAST algorithm can waste spectacu-
sequenced DNA,” since the user should know lar amounts of CPU time examining spuri-
which vectors were employed. ous matches to those low-complexity regions.
In recent years NCBI and other database However, the filtering can also cause BLAST
curators have expended considerable effort on to miss matches that are biologically signif-
scanning their databases for vector contam- icant. Also, such statistics as the percent-
ination. However, the problem has not been age identity between two sequences will be
completely eliminated. Also, some spurious inaccurate when low-complexity filtering is
annotations caused by misleading matches to enabled. The reader should try some searches
vector sequence may remain. Therefore, any with filtering enabled then with filtering dis-
time the reader gets surprising matches in a abled and compare the results.
BLAST search, it is wise to check for vector One filtering option that is often useful is
contamination. the “Mask lowercase” option. For example,
suppose the query sequence contains a Zinc
Finger domain and BLAST finds many other-
Critical Parameters
wise unrelated sequences that also have ZNF
The most important factor in a successful
domains. To remove these uninformative hits,
BLAST search is using the best query se-
the ZNF domain can be changed to lower case
quence available. Short query sequences are
and the search repeated with the “Mask low-
particularly difficult to analyze by BLAST be-
ercase” option checked.
cause there are likely to be many spurious hits
with a short query sequence. Matrix selection
The default choices for the various parame- For most protein searches, the default
ters governing the BLAST algorithm provided choice of BLOSUM62 is appropriate. If the
by the NCBI Web site are recommended for reader is only interested in finding very close
most general-purpose use. There is extensive relatives of the query sequence, it may be ap-
documentation about all aspects of BLAST on propriate to use the BLOSUM80 matrix in-
the NCBI Web site. Many of these online doc- stead, as this will reduce the number of hits
uments can be found by clicking links on the from sequences that are not highly similar
BLAST Web pages. to the query sequence. On the other hand, if
The documentation about BLAST at NCBI no hits are found with BLOSUM62, using
ranges from simple tutorials to very advanced BLOSUM45 will increase the sensitivity of
technical information, so if a particular doc- BLAST to distant matches.
ument is at the wrong level, keep exploring. For most purposes nowadays the BLOSUM
And if the reader cannot find the answer to matrices are preferred. However, according
his or her question in those documents, he or to NCBI in the case of a short query se-
Using BLAST for she can email the NCBI help desk—the peo- quence the PAM30 or PAM70 matrix may
Performing ple at NCBI are very good about responding reduce the number of spurious hits. How-
Sequence to questions from users.
Alignment ever, BLAST searches with short query se-
quences almost always generate large numbers
6.8.20
Supplement 52 Current Protocols in Human Genetics
of spurious matches no matter what choices Other parameters
one makes about search parameters. There are In most cases the other BLAST parameters
search algorithms that do somewhat better than should be left unchanged from their default
BLAST at separating signal from noise for values, with the exception of the Expect score.
such tasks as determining whether an oligonu- As discussed below under Anticipated Results,
cleotide primer is likely to bind at off-target the Expect or E-Score is an index of statisti-
sites. However, these algorithms are extremely cal significance. The E-score is the number of
CPU-intensive and therefore, may be imprac- hits similar in quality to this hit that would be
tical unless the reader’s institution has ac- expected to have occurred by chance alone.
cess to specialized bioinformatics computing Typically, somewhere between an E-score of
resources. 0.001 and an E-score of 1, hits that represent
Note that when one changes the choice of true biology will be outnumbered by hits that
matrix in the NCBI BLAST Web interface the are not meaningful (theoretically, at an E-score
gap penalty parameters automatically change. of 1 junk hits should be about as common as
This is because there is no good theoretical ba- real hits, but for various reasons BLAST some-
sis for choosing gap penalty values, but NCBI times tends to overestimate the significance of
has found by experience that for each matrix weak hits).
certain values usually work best. If an align- Therefore, if the reader is only interested in
ment has few gaps, then it matters little what finding a few close relatives to the sequence of
values are chosen for the gap parameters. But interest, rather than in hundreds of distantly-
if there are many gaps, the reader should ex- related sequences, he or she can pick a smaller
periment with changing the values of the gap cutoff value for Expect and remove most of
parameters and see how changing them affects the hits.
the alignment. Regions where small changes of A few years ago, the NCBI BLAST Web
gap penalties cause big changes of the align- page had a plethora of other options; today
ment are suspect and should not be trusted. these are all subsumed by the text entry field
near the bottom of Figure 6.8.3 labeled “Other
Compositional adjustments advanced.” Most readers will never need to
Regions where the amino acid or nu- type anything here unless advised to do so by
cleotide sequence is atypical can cause spu- a local bioinformatics specialist. Table 6.8.1
rious matches to be given misleadingly low lists common problems with BLAST searches
probability values. Figure 6.8.3 shows the and their solutions.
“No adjustment” option selected; NCBI has
recently changed this, and, as of November
2006, “compositional adjustment” is selected Anticipated Results
by default for protein query sequences. Chang- The default output format on the NCBI
ing this option will cause the BLAST signifi- BLAST Web site has several major sections.
cance estimates to be adjusted to account for At the top is some header information about
compositional bias. According to the docu- BLAST, but the first portion of interest to a
mentation of this feature at the NCBI Web typical biologist is the graphical summary of
site, the adjusted significance value might be hits as shown in Figures 6.8.6 and 6.8.12. This
more accurate or it might be less accurate. The picture shows where the database sequences
author of this unit advises taking a conser- match the query sequence and the color of the
vative approach—if a hit has a borderline E- line representing each hit indicates its quality
score (between about 0.001 and 1), then try the (with red for the highest-scoring hits). Fre-
search both with, and without, compositional quently, as is the case in these examples, the
adjustment and use whichever E-score number hits will be clustered below specific segments
is higher. of the query sequence; these clusters proba-
bly correspond to functional domains that are
Taxonomic and ENTREZ filters highly conserved. For a protein search it can
In some cases the reader will wish to fo- be instructive to compare these regions to con-
cus on a specific taxonomic group; in this case served domain search results like those de-
the “select from all organisms” option shown picted in Figure 6.8.5.
in Figures 6.8.3 and 6.8.10 can be changed to Next, is a table of hit sequences, with the
restrict the output. Sometimes the “Limit by highest-scoring hits coming first, as shown in
entrez query” option can be convenient; for in- Figures 6.8.7, 6.8.8, 6.8.13, and 6.8.14. The
formation about ENTREZ queries see UNIT 6.10 Score column gives the similarity score as Identifying
Candidate Genes
of Current Protocols in Human Genetics. calculated by BLAST using the matrix and in Genomic DNA
6.8.21
Current Protocols in Human Genetics Supplement 52
Table 6.8.1 Troubleshooting BLAST searches
gap parameters that were chosen when the To interpret such a table, it is very important
search was submitted. The “E” column gives to think about the biological relevance of the
the expected number of hits with this, or a bet- hit descriptions. The annotations of sequences
ter, score that would appear purely by chance in large public databases vary widely in qual-
when searching the NCBI ‘nr’ database; here ity; particularly vexing is the “transitive anno-
all these values are zero indicating these hits tation” problem (Jackson et al., 2003). Many
are highly unlikely to have occurred by chance. annotations in the databases are not based
This table has extensive hyperlinks to other in- on experimental evidence, but on BLAST
formation at NCBI; the reader should move the searches—and there can be many steps be-
computer mouse around watching for places tween the annotation and actual wet-lab biol-
where the cursor becomes a pointing hand ogy. Usually, however, one can discern some
(indicating hyperlinks) and right-click those common biological themes when looking over
links. By right-clicking, one can pick New the table of hits from a BLAST search. For
Window and see where the link points without instance, most of the annotations in the ex-
losing one’s place in the BLAST output. amples above refer to MAP kinase activity,
In Figure 6.8.8, the Expect value of 4e- so it would be reasonable to conclude that the
174 means the BLAST algorithm estimates query sequences in these examples are kinases.
the number of random hits with this raw score However, if the descriptions did not appear to
expected to happen by chance when searching share common biological themes, one should
the ‘nr’ database is extremely small. This is, be cautious about drawing conclusions from
therefore, a highly significant hit. Note that, them.
while a hit with strong sequence similarity Of the many numbers appearing in the re-
is probably also biologically significant, the sults from a BLAST search, the most informa-
converse is not true. Proteins can have a sig- tive is usually the “E value” or “Expect score”.
nificant biological relationship but lack suf- Biologists often talk about percentage identity,
ficient sequence similarity to be detected by but when searching very large databases it is
BLAST or similar algorithms. Therefore, from common to find short chance hits with very
the LACK of significant sequence similarity high percentage identity—especially with nu-
between two proteins one cannot necessar- cleotide searches. The statistical calculations
ily conclude no biological relationship exists, behind the E-score take into account the length
only that BLAST could not find evidence for of the match, the size of the database, and the
such a relationship. typical distribution of raw BLAST similarity
Notice that in Figure 6.8.15 these human scores. Generally, the smaller the E-Score the
and chicken sequences have about 82% iden- less likely it is for this match to have occurred
tity of sequence in this region; such a high by chance. If the E-score is more than about
degree of conservation since the last common 0.01 that indicates a rather weak match; if the
ancestor of mammals and birds suggests that E-score is less than about 0.00001 that indi-
Using BLAST for
this kinase must have an important biological cates a strong match (note that many of the
Performing function. E-scores for the examples given in this unit
Sequence
Alignment
6.8.22
Supplement 52 Current Protocols in Human Genetics
are so small that BLAST presents them in sci- Literature Cited
entific notation). The range between 0.00001 Altschul, S.F., Boguski, M.S., Gish, W., and
and 0.01 is where one must make a judgment Wootton, J.C. 1994. Issues in searching molecu-
lar sequence databases. Nat. Genet. 6:119-129.
call whether to trust this match; for this, bio-
logical insight is needed. Altschul, S.F., Madden, T.L., Schäffer, A.A.,
Zhang, J., Zhang, Z., Miller, W., and Lipman,
The longest section of BLAST output D.J. 1997. Gapped BLAST and PSI-BLAST: A
presents individual pairwise alignments be- new generation of protein database search pro-
tween the query sequence and each hit se- grams. Nucleic Acids Res. 25:3389-3402.
quence, as shown in Figures 6.8.7, 6.8.8, and Altschul, S.F., Wootton, J.C., Gertz, E.M.,
6.8.15. As noted above, many biologists look Agarwala, R., Morgulis, A., Schaffer, A.A., and
first to the percentage identity value given Yu, Y.K. 2005. Protein database searches using
above each alignment, but those can some- computationally adjusted substitution matrices.
FEBS Journal 272:5099-5100.
times be misleading. It is important to become
familiar with alignments by looking at a num- Gilks, W.R., Audit, B., de Angelis, D., Tsoka, S.,
and Ouzounis, C.A. 2005. Percolation of annota-
ber of examples. The reader should use query tion errors through hierarchically structured pro-
sequences with which he or she is especially tein sequence databases. Math. Biosci. 193:223-
familiar to perform some BLAST searches 234.
similar to the examples in this unit and in- Jackson, D.G., Healy, M.D., Davison, D.B. 2003.
vest some time scrolling through the align- Bioinformatics: Not just for sequences anymore.
ments. Note how the overall character of the Biosilico 1:103-111.
alignments changes as one scrolls down the Jones, D.T. and Swindells, M.B. 2002. Getting the
report. Often there will be clear taxonomic most from PSI-BLAST. Trends Biochem. Sci.
patterns; first there will be strong hits from 27:161-164.
species closely related to the species, from McGinnis, S., and Madden, T.L. 2004. BLAST:
which the query sequence was derived. After At the core of a powerful and diverse set of
sequence analysis tools. Nucleic Acids Res.
the close relatives there may be weaker hits 32:W20-W25.
from more-distant species. Eventually there
will usually be a point where the descriptions Key References
no longer point to common biological themes, Altschul et al., 1997. See above.
but look more random (hits below this point are In the late 1990s NCBI made major improvements
unlikely to be informative). to the BLAST algorithms; this paper summarizes
how those improvements work and why they matter.
Korf, I., Yandell, M., and Bedell, J. 2003. BLAST.
Time Considerations O’Reilly Media, Sebastopol, CA.
In most cases, performing a BLAST search This is an entire book dedicated to the BLAST
will only take a few minutes, but understand- program, from a leading publisher of technical
ing its biological implications can take much books.
longer. To extract full value from the NCBI Woodford, N. 2004. Public databases: retriev-
BLAST interface one should spend consider- ing and manipulating sequences for beginners.
able time exploring related data by clicking on Methods Mol. Biol. 266:17-28.
hyperlinks in the output. No amount of algo- This general discussion of how to use the major
sequence databases can supplement Support Pro-
rithmic sophistication can replace biological tocol 1 of this Unit.
insight.
The author recommends right-clicking on
the hyperlinks and opening them in new Contributed by Matthew D. Healy
browser windows, so that the context provided Bristol Myers Squibb Pharmaceutical
by the BLAST output is not lost. Research Institute
Wallingford, Connecticut
Identifying
Candidate Genes
in Genomic DNA
6.8.23
Current Protocols in Human Genetics Supplement 52