Last updated: May 2017
Tutorial on NCBI BLAST
Websites used in this tutorial:
BLAST homepage: http://blast.ncbi.nlm.nih.gov/Blast.cgi
BLAST help page is accessible as a Tab from the main BLAST page. It has information about all
the different BLAST programs and databases.
Activities:
Do a BLASTN and a BLASTP using NCBI BLAST homepage.
Restrict BLAST searches to particular species/taxonomic groups
Use Blast2Sequences to compare sequences with different parameters
Follow ENTREZ links from BLAST output to find additional information about the query
sequence. This information includes conserved protein domains (CDD), taxonomic reports,
structure links and publications.
HINT: Sign in to your myNCBI account before starting your BLAST analyses. If you
accidentally close a window, you can always return to the BLAST homepage and find your
searches under the Recent Results tab.
Nucleotide BLAST:
In this section of the tutorial you will use Protein Kinase Inhibitor alpha (PKI alpha) from
mouse as the query in the search. Go to the NCBI homepage and search the Nucleotide
database for protein kinase inhibitor using the Refseq ID: NM_008862.
• Once the record is displayed, click on the FASTA link
• High-light and copy (CTRL C) the sequence, including the definition line that begins
with >NM_008862 line
Figure 1: Mouse PKIalpha sequence in FASTA format
Now that we have a sequence to use for our search, open the BLAST homepage in your
browser.
• Click on the “Nucleotide BLAST” link.
• Paste the sequence into the text window at the top of the page.
BCHM 6280 2017 NCBI BLAST tutorial Page 1 of 11
Last updated: May 2017
• Notice that as you tab out of the text box for the sequence, the definition line from the
sequence is added to the job title box. Change this to a more descriptive title such as
mousePKIalpha_nr.
• If you log into myNCBI before initiating BLAST searches, all previous searches are
available for 36 hours.
• For Database, select Nucleotide collection (nr/nt).
• Under program selection, choose Highly similar sequences (megablast).
• To see the parameters of the different programs, click on the Algorithm parameters link
below the BLAST button. Note the Word size, Match/mismatch scores, ect.
• NOTE that any time a default parameter is not used, that parameter is highlighted in
yellow
• Leave all other menu choices with default settings and launch the query by pressing
the BLAST button.
The next page returned is a status page with Job Title that contains the name of your query
sequence or a job title if you typed one into the text box. The next 5 lines give you in order:
• Request ID: a 11 character string which you can use to retrieve results later
• Status of the search
• Time submitted
• Current time
• Time since submission
After some time, the results will appear, or you can go on and do another search while waiting.
The blast output is divided into 4 sections. The bottom 3 can be expanded/collapsed with the
blue arrows to the left of each section.
• Header region which lists the parameters of the search
• Graphical summary
• Descriptions
• Alignments
Figure 2: BLAST output sections
BCHM 6280 2017 NCBI BLAST tutorial Page 2 of 11
Last updated: May 2017
Figure 3: BLAST graphical summary
Within the graphical output:
• Bars correspond to regions of similarity.
• Color coding is based on alignment scores.
• Information is displayed by moving the mouse over the bars.
• Bars are hot links to the actual alignments displayed below on the page.
Figure 4: BLAST output descriptions section
Within the descriptions section:
• The links under the Description column take you to the corresponding alignment
• The links under Accession column take you to the Nucelotide record.
• The E-value is a statistical measure of the significance of the alignment
• The Query coverage indicates how much of the query sequence was aligned
BCHM 6280 2017 NCBI BLAST tutorial Page 3 of 11
Last updated: May 2017
Figure 5: BLAST alignments section
Within the alignments section:
• Each alignment with a score > threshold (up to limit defined) is shown.
• The sequence listed is the one which matched the query sequence
• Just above the alignment are shown the Score, Expect value, Identities, Gaps and
which strands of the query and database sequence aligned.
• The Related Information gives you the option to view the matching sequence in
other databases, such as Gene.
NOTE: Change the program selection from Meagablast to BLASTN. Note how many more hits
you get with this program than with Megablast. If you have a relatively short sequence (<500
bp) of unknown origin, then you would want to use BLASTN rather than MEGABLAST to
identify the sequence. If you know there is a high likelihood of a match to your protein in the
database, then you would use Megablast.
Protein BLAST:
Next we will do a BLASTP using the mouse PRI alpha protein sequence.
• Retrieve the protein record for mouse PKIa (NP_032888).
• Obtain the FASTA formatted sequence and copy it.
• Return to the BLAST home page.
• Chose the Protein BLAST link from the BLAST homepage.
• Paste the mouse protein sequence into the text box.
• Select the database Non-redundant protein sequences (nr)
• Click on the arrow to the left of algorithm parameters to expand that area. The
parameters should be set as shown in Figure 7.
• Make sure the algorithm blastp is selected and click the BLAST button at the bottom.
BCHM 6280 2017 NCBI BLAST tutorial Page 4 of 11
Last updated: May 2017
Figure 6: Default parameters for BLASTP
You should get a status page very similar to the one you saw with the BLASTN search. When
the search is done, an output page appears with the same sections as we saw with BLASTN,
but with some additions. If conserved protein domains were identified in the query sequence,
they will be shown above the graphic for the alignments. There are two differences from the
BLASTN output sections. Within the alignments section, the top of each alignment has an
additional score for the number of positives within the alignment. This represents matches
within the sequence that have a positive BLOSUM score.
Figure 7: BLASTP alignment with scoring parameters.
If you want to know more about the conserved protein domains found in your query
sequence, click on the graphics shown at the top of the BLAST output page. This will take you
to the Conserved Domain database and display a summary page for that domain.
BCHM 6280 2017 NCBI BLAST tutorial Page 5 of 11
Last updated: May 2017
Figure 8: Conserved domain summary page.
On the conserved domain summary page, you can view the consensus sequence for that
conserved protein domain. By clicking on the PKI superfamily graphic, a page will open with
more comprehensive information about the conserved domain. For PKI, this is shown in
Figure 9.
Figure 9: Conserved Domain Database listing for PKI conserved domain.
Included on this page is the source for the Conserved Domain, which in this case is a PFAM
model. A description of the conserved domain is given. Click on the [+] links located to the
left of the Links, Statistics and Interactive view bars. Under the Links menu, The Taxa link
shows the largest taxonomic group in which this domain has been found. In the case of PKI,
the group listed is Euteleostomi. Click on “Euteleostomi” to view what taxonomic groups are
included. In this case, Euteleostomi represents bony vertebrates. The presence or absence of
the protein domain in various taxonomic groups is also a clue as to which species you might
expect to find homologous proteins.
To return to the BLAST output page, close the window that shows the Conserved Domain
information or click on the tab with the BLAST results.
BCHM 6280 2017 NCBI BLAST tutorial Page 6 of 11
Last updated: May 2017
Note the following features of the sequences in the match list:
• Description of matching protein with links to alignments
• Accession numbers with links to the records
• Scores in bits
• Score probabilities as E values.
• Hits are displayed if the E value is below the E threshold set (default is 10).
Follow the link to the highest scoring pair by clicking on the description and note:
• Complete description of the subject sequence.
• Exact sequence match to the probe.
• Score in bits with raw score in parenthesis.
• Expect value in log base 10.
• Identities = (100%), Positives = (100%).
• Alignment of query and subject sequences; this is a single, uninterrupted HSP. (HSP
stands for High Scoring Pair).
Scroll down to the next few hits:
• These are very similar to the query sequence.
• The further down the list, the less similar the sequences are to the query sequence
Note the identities and positives. Identities correspond to exact matches and positives are
similarities based on the scoring matrix used.
Taxonomy Report:
Up to now, you’ve only looked at the search summary report. The BLAST search also returns
4 other reports: taxonomy, distance tree, related structures and multiple alignments.
Scroll up to the top of the BLAST report and click on “Taxonomy report”. It is located just
above the Query information and is a single text link.
• Scroll down and look at the organisms represented in the taxonomy report. Note that
most of the hits fall within rodents and primates, with a fewer matches to lagomorphs,
fish, birds and amphibians.
• Are there any sequences that do not fall under the taxonomy tree? What are they?
• Think about what is not represented in this list in terms of other eukaryotic species. Are
they not there because there are no homologs in those species or because of limitations on
the blast search? The default BLAST search returns only 100 matches. What is the highest
e-value listed in the report? Is it still lower than the default cut-off of 1? If so, then to
confirm the lack of a match, you should redo the BLAST search and either increase the
number of hits returned or restrict the database by excluding groups that have a very
larger number of matches.
BCHM 6280 2017 NCBI BLAST tutorial Page 7 of 11
Last updated: May 2017
Figure 10: List of taxonomic groups represented in BLAST search of Mouse PKI to NR database
There are several links on each line of the Tax BLAST Report. Within the Lineage report,
clicking on the Organism name (a) or Blast name (b) will open up the Taxonomy browser for
that group. Clicking on the number of hits (c) will open up a page with all of the protein
database matches from that group. Clicking on links in the (d) column will bring up the
alignments from that group.
If you scroll down further in the taxonomy report, you will see the Organism Report. This
report lists the score for each organism, listed from the most similar to least similar. Again,
you can click on a link under Description to see the alignment or click on an Accession link to
see the NCBI protein database record.
Scroll to the top of your blast output page. There should be a link “Edit and Resubmit”.
Click this and then scroll down to the parameters section. Click on the + button to open the
menu and change the Max Target Sequences from 100 to 1000. Click the BLAST button to
start the search.
• Once the search is complete, examine the number of hits, their respective scores and
look at the taxonomy report
• Do you now see matches to other non-vertebrate eukaryotes? How do the e-values
compare to matches within vertebrate species? Would you call these homologs?
You can also leave the database set to NR and then restrict it by taxonomic group or exclude
taxonomic groups. If you start to type vertebrate and pause, it will finish it for you. You can
also type in mouse or fungi or bacteria or gallus gallus or use the TaxID number. By
restricting the search to a smaller database, the search times are speeded up and you have
fewer hits to sort through. Restricting the database also allows you to determine if there are
matches in specific taxonomic groups that may have been missed because of the limit on the
number of search results returned.
BCHM 6280 2017 NCBI BLAST tutorial Page 8 of 11
Last updated: May 2017
Specialized BLAST searches
If you want to compare two sequences directly, you can do that via either the BLASTN or
BLASTP interfaces. From within the BLASTN page, click the box titled Align two or more
sequences located just below Job Title section.
You will use three homologs of the Toll-like receptor 3 from human, mouse and zebra fish.
• Nucleotide records: NM_003265, NM_126166 & NM_001013269 (human, mouse & zebra
fish)
• Protein records: NP_003256, NP_569054 & NP_001013287 (human, mouse & zebra fish)
The parameters that can be changed will depend on whether you want to align nucleotide or
protein sequences. We will start with nucleotide sequences.
• Start out by leaving everything as default.
• Align the human and mouse mRNAs first by putting their Refseq accession numbers
into the text box.
• You can also paste in the sequence in FASTA format or upload it from a text file.
• Click the Align button.
On the next page, there are same sections as you get with any BLASTN search, but with the
addition of a section titled Dot Matrix View. Expand it by clicking on the + sign next to it. The
image should like that shown in Figure 11. Note that the % coverage of 92% with 81%
identity.
Figure 11: Dot Matrix view of the alignment of human and mouse TLR3 transcript sequences
Using the Edit and Resubmit button at the top of the report, return to the submission page and
align the human and zebrafish mRNAs using the default parameters. Did you get a significant
result? Change the program setting to optimize for more dissimilar sequences (discontiguous
blast) and resubmit. Now look at the Dot Matrix view of the result. Notice that the region of
alignment is much shorter than for the human and mouse. These two sequences have
diverged a fair amount. Use the Edit and Resubmit button & change the program to Somewhat
similar sequences (blastn). Now the region of alignment is a bit longer.
BCHM 6280 2017 NCBI BLAST tutorial Page 9 of 11
Last updated: May 2017
Now align the human and mouse TLR3 protein and human and zebrafish TLR3 protein
sequences. Compare the results of the nucleotide alignments to those of the protein
alignments. Which is a more sensitive method to find related sequences?
Using Primer-BLAST
Primer-BLAST can be used to design PCR primers as well as check their specificity. The
interface is not quite as well developed as the other BLAST programs. The results are not
stored the way other BLAST results are kept and there is no link for editing and resubmitting
the same query. However, it does provide a quick way to check your primers. The primer
design algorithm is Primer3, which is the standard algorithm used by most sequence analysis
programs.
From the BLAST home page, scroll down and click on the link Primer-BLAST under Specialized
searches section. It brings up a window with 4 sections that have a number of options. The
default options are set to design and test primers for human mRNAs. Before you start
designing primers, you should have some plan for what portion of the template DNA you want
to amplify. For this exercise, we are going to design primers to amplify a region of the mouse
Tlr3 gene. This transcript is ~4300 bp long and covers ~15 kB in the genome.
For example, if we want to amplify the region of gene, covering the 2nd or 3rd exon then
we can identify the exon/intron locations from the Refseq mRNA record (NM_126166). Scroll
down in this record and you will see the 5 exons defined by their position relative to the
transcript. The first 4 exon positions relative to the entire transcript are:
Exon 1: 1-87
Exon 2: 88-282
Exon 3: 283-352
Exon 4: 353-805
In this example, we will target a PCR product size of 300-500 bp. Now return to the
Primer-BLAST page and we will go through it by section.
Section: PCR Templates
This is where you define the template or target sequence for amplification. Our plan is
to amplify a 300-600 bp region near or across exons 2 and 3 of the mouse Tlr3 gene.
We also want the primers span an intron/exon junction.
• Type in or paste in the Refseq accession number for the mouse Tlr3 gene
(NM_126166).
• Define a Range for the forward and reverse primers so that the product size
comes close to 300-500. Generally, when designing primers, you define a range
and let the primer-picking program select the actual primers as the algorithm
takes into account factors such as melting temperature which affect the
compatibility of the 2 primers to work in the same reaction. I would suggest a
range of ~200 bp for each primer. I used 80-250 for the forward primer and
600-850 for the reverse primer.
Section: Primer Parameters.
• Set the PCR product size to a Min=200 and Max = 600
Section: Exon/intron selection
BCHM 6280 2017 NCBI BLAST tutorial Page 10 of 11
Last updated: May 2017
• Leave the Exon junction span set to No Preference. You can review the other
options.
Section: Primer Pair Specificity Checking parameters
• Enable search for primer pairs specific to intended PCR template
• Set Database to Refseq mRNA
• Change the Organism to house mouse (taxid:10090).
• Then select the Get Primers button.
• Depending on the transcript, you may see a dialog box/specificity warning page
that lists the different transcript isoforms that can be included in your search.
Select all of them and continue.
• Sometimes the primer design fails for the given parameters. Usually increasing
the range for both primers will give the program enough leeway to find
compatible primers. When you do get output, it should look something like
Figure 12.
Figure 12: Graphical Primer-BLAST output
Spend a few moments looking at the information provided in the PrimerBlast output. Note
that the exon boundaries are indicated by black bars of alternate height placement. The red
bar indicates where the protein translation starts. All of the primer pairs span exon 3 and
extend into the protein coding region.
To test the specificity of already designed primers:
• On the exercise 3 home page is a set of primers to test
• Copy the forward and reverse primer sequences into the appropriate text boxes under
the section Primer Parameters.
• Under Primer Pair Specificity Checking parameters, change the Database Refseq mRNA
• In Organism, change to house mouse (taxid: 10090) (or organism you are working in)
NOTE: Make sure to leave the template, range boxes and Min. and Max. product sizes blank if
you want to test primers
• Click the Get Primers button to execute.
• What mRNAs were matched and what are the predicted product sizes?
• Repeat, changing the Database to test to for Genome (Reference assembly from selected
organisms)
• What organism was matched and predicted product size did you get? Was there a
perfect match in the genomic sequence?
BCHM 6280 2017 NCBI BLAST tutorial Page 11 of 11