KEMBAR78
Bioinformatics Intro | PDF | Sequence Alignment | Bioinformatics
0% found this document useful (0 votes)
19 views69 pages

Bioinformatics Intro

Uploaded by

NICHOLAS BARASA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views69 pages

Bioinformatics Intro

Uploaded by

NICHOLAS BARASA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 69

Introduction to bioinformatics

What is bioinformatics?

• an emerging interdisciplinary research area

• deals with the computational management and


analysis of biological information: genes,
genomes, proteins, cells, ecological systems,
medical information, robots, artificial
intelligence...
The Core of Bioinformatics to date
•Relationships between

TDQAAFDTNIVTLTRFVM
EQGRKARGTGEMTQLLNS
LCTAVKAISTAVRKAGIA
HLYGIAGSTNVTGDQVKK
LDVLSNDLVINVLKSSFA
TCVLVTEEDKNAIIVEPE
KRGKYVVCFDPLDGSSNI
DCLVSIGTIFGIYRKNST
DEPSEKDALQPGRNLVAA
sequence 3D structure protein functions
GYALYGSATMLV

•Properties and evolution of genes, genomes, proteins, metabolic


pathways in cells

•Use of this knowledge for prediction, modelling, and design


“The holy grail of bioinformatics”
GCTCCTCACTGTCTGTGTTTATTCTTTTAGCTTCTTCAGA
TCTTTTAGTCTGAGGAAGCCTGGCATGTGCAAATGAAG > 500, 000 genes
TTAACCTAA... sequenced to date

Expected number of unique


protein structures:
~ 700-1, 000
Basic concepts

• conceptual foundations of bioinformatics:


evolution
protein folding
protein function

• bioinformatics builds mathematical models


of these processes -
to infer relationships between components of
complex biological systems
Information processing in cells

nucleic acids proteins

coding regions

regulatory
sites transcripts

One-to-many mappings!
Context-dependence!
Global approaches: Toward a new Systems Biology

Global cell state

Genome

Protein population:
Genome activation
proteomics
patterns: transcriptomics

•How does the spatial and


temporal organisation of
living matter give rise to
biological processes? Organisation:
tissue imaging EM X-ray, NMR
cells
molecular complexes
Global approaches: Toward a new Systems Biology

Perturbation Living cell Dynamic response

Biological knowledge
(computerised)
•Basic principles
Sequence information “Virtual cell”
•Practical
applications
Structural information
Bioinformatics
Mathematical
modelling
Simulation
We do not know yet whether the information in the genome is sufficient to
reconstruct an entire biological system. Information on building blocks not
enough, information on their interactions is essential.

External environment

Internal environment

Metabolic net

Genetic networks

DNA hRNA mRNAs proteins


Bioinformatics in context

Mathematics/com
Genomics puter science

Molecular
biology Bioinformatics Biophysics

Ethical, legal, and


social implications Molecular
evolution
Current challenges to users
• Potential hurdles:
Methods are in flux and not fully developed-
scattered and heterogeneous resources

• Remedies: Web resources


navigation guides
integration of tools and databanks

http://www.biochem.ucl.ac.uk/~nagl/bioinformatics.html
Biological databases
The challenge

(Boguski, 1999)

In 1995, the number of genes in the database started to exceed the number of papers
on molecular biology and genetics in the literature!
Data types
primary data sequence primary database

AATGCGTATAGGC DNA
DMPVERILEALAVE amino acid

secondary data secondary protein secondary db


structure
“motifs”: regular
expressions, blocks,
profiles, fingerprints e. g., alpha-helices, beta-
strands
tertiary data tertiary protein tertiary db
structure

atomic co-ordinates domains, folding units


Primary biological databases

• Nucleic acid • Protein


EMBL
GenBank PIR
DDBJ (DNA MIPS
Data Bank of Japan) SWISS-PROT
TrEMBL
NRL-3D
International nucleotide data banks

GenBank
EMBL
USA
Europe

EMBL International NLM

EBI Advisory Meeting NCBI

Collaborative Meeting

TrEMBL DDBJ NRDB


Japan

NIG

CIB
GenBank file format
GenBank file format
Swiss-Prot
SWISS-PROT file format
SWISS-PROT file format
SWISS-PROT file format
SWISS-PROT file format
Other primary protein databases

• TrEMBL (translated EMBL) in SWISS-PROT format


rapid access to sequence data from genome projects
computer-annotated supplement to SWISS-PROT
translations of all coding sequences (CDS) in EMBL

• SP-TrEMBL

• REM-TrEMBL: immunoglobulins, T-cell receptors, short


fragments, synthetic and patented sequences
Other primary protein databases

The Protein Information Resource (PIR)

• integrated system of protein sequence databases and


derived related databases, e. g., alignment databases

• rapid searching, comparison, and pattern matching of


protein sequences
• retrieval of descriptive, bibliographic, feature, and
concurrent cross-reference information
• aims to be comprehensive and consistently annotated
PIR: related databases

NRL-3D Sequence-Structure Database

• produced by PIR from sequence and annotation


information extracted from three-dimensional
structures in the Protein Databank (PDB)

• allows keyword and similarity searches


PIR: related databases

PATCHX integrated with PIR

• a non-redundant database of protein sequences produced by MIPS, the


European branch of PIR-International

The PIR Protein Sequence Database and PATCHX together provide the
most complete collection of protein sequence data currently available in
the public domain.
Composite protein sequence dbs
NRDB OWL MIPSX(PIR+PATCHX) SP+TrEMBL
PIR PIR PIR TrEMBL
SP SP SP SP
PDB GenBank MIPSOwn
GenPept NRL-3D NRL-3D
MIPSH
PIRMOD
MIPSTrn
EMTrans
GBTrans
Kabat
PseqIP
OWL composite database

By accession number
• By database code
• By text
• By sequence
• By title
• By author
• By query language
• By regular expression
Direct OWL access:
OWL only released every 6-8 weeks

OWL Blast server


Two other useful sites
INFOBIOGEN-The Public Catalog of Databases
http://www.infobiogen.fr/services/dbcat/

KEGG-Kyoto Encyclopedia of Genes and Genomes


http://www.genome.ad.jp/kegg/
Kyoto Encyclopedia of Genes and Genomes (KEGG) is an effort to
computerize current knowledge of molecular and cellular biology in
terms of the information pathways that consist of interacting molecules
or genes and to provide links from the gene catalogs produced by
genome sequencing projects.
Sequence Retrieval System (SRS)

Database browser that allows users to


•retrieve
•link
•access
entries from all interconnected resources.
Users can formulate queries across a
range of different database types.
Sequence Alignment
What is sequence alignment?
➢Sequence alignment is a way of arranging the sequences of
DNA, RNA or protein to identify regions of similarity that may be
a consequence of functional, structural or evolutionary
relationships between the sequences.
➢The procedure of comparing two (pair-wise alignment) or
more multiple sequences is to search for a series of individual
characters or patterns that are in the same order in the
sequences.
➢ There are two types of alignment: local and global.
Global alignment vs Local alignment

➢ Global alignment is attempting to match as much of the sequence as possible.


The tool for Global alignment is based on Needleman-Wunsch algorithm.

➢ Local alignment is to try to find the regions with highest density of matches. The tool
for local alignment is based on Smith-Waterman.

➢ Both algorithms are derivates from the basic dynamic programming algorithm.
L G P S S K Q T G K G S - S R I W D N
Global alignment
L N - I T K S A G K G A I M R L G D A

- - - - - - - T G K G - - - - - - - -
Local alignment
- - - - - - - A G K G - - - - - - - -
Why do sequence alignment?
➢ Sequence alignment is useful for discovering structural,
functional and evolutionary information in biological sequences.
➢ Sequences that are very much alike may have similar secondary
and 3D structure, similar function and likely a common ancestral
sequence. It is extremely unlikely that such sequences obtained
similarity by chance.
-- For DNA molecules with n nucleotides such probability is very
low P = 4-n.
-- For proteins with n nucleotides, the probability even much lower
P = 20 –n.
➢Sequence alignment makes the following tasks easy: 1.annotation
of new sequences; 2. modelling of protein structures; 3. design and
analysis of gene expression experiments
Methods of pairwise alignment

➢ Dot matrix analysis


➢ The dynamic programming (DP) algorithm
➢ Word methods
What is Dot matrix analysis
➢ A dot matrix analysis is a method for comparing two
sequences to look for possible alignment (Gibbs and McIntyre
1970)
➢ The algorithm for a dot matrix:
1. One sequence (A) is listed across the top of the matrix and the
other (B) is listed down the left side
2. Starting from the first character in B, one moves across the page
keeping in the first row and placing a dot in many column where the
character in A is the same
3. The process is continued until all possible comparisons between
A and B are made
4. Any region of similarity is revealed by a diagonal row of dots
5. Isolated dots not on diagonal represent random matches
What can Dot matrix analysis do?

➢ It can detect of matching regions can be improved by


filtering out random matches and this can be achieved by
using a sliding window
➢ It can be used to assess repetitiveness in a single
sequence, such as direct and inverted repeats within the
sequences
Dynamic programming algorithm

➢ The approach compares every pair of characters in the two sequences


and generates an alignment, which is the best or optimal.
➢ The method can be useful in aligning nucleotide to protein sequences.
➢The method requires large amounts of computing power and is a highly
computationally demanding because the nature of dynamic programming
technique is recursion.
➢New algorithmic improvements as well as increasing computer capacity
make possible to align a query sequence against a large DB in a few
minutes.
➢Two approaches for dynamic programming: Top-down approach and
Bottom-up.
The procedure of the dynamic programming algorithm
➢ The alignment procedure depends upon scoring system based on
probability that:
1) a particular amino acid pair is found in alignments of related proteins
(pxy);
2) the same amino acid pair is aligned by chance (p xpy);
3) introduction of a gap would be a better choice as it increases the score.
➢ A substitution matrix is composed of the ratio of the first two
probabilities. There are many such matrices, two of them PAM and
BLOSUM will be talked in next few slides.
➢ The calculation of scores for the gap introduction and its extension is
from the matrices and represent a prior knowledge and some assumptions.
For example: one of them is quite simple, if negative cost of a gap is too
high a reasonable alignment between slightly different sequences will be
never achieved but if it is too low an optimal alignment is hardly possible.
Other assumptions are based on sophisticated statistical procedures.
Word methods
➢Word methods, also known as k-tuple methods, are heuristic
methods that are not guaranteed to find an optimal alignment
solution, but are significantly more efficient than dynamic
programming.

➢The typical tools used for this method is BLAST and FASTA.
Why do we want to compare sequences?

Evolutionary relationships
• Phylogenetic trees can be constructed based on comparison of the
sequences of a molecule (example: 16S rRNA) taken from different
species
• Residues conserved during evolution play an important role

Prediction of protein structure and function


• Proteins which are very similar in sequence generally have similar
3D structure and function as well
• By searching a sequence of unknown structure against a database
of known proteins the structure and/or function can in many cases
be predicted
BLAST

BLAST (Basic Local Alignment Search Tool)


allows rapid sequence comparison of a query
sequence against a database.

The BLAST algorithm is fast, accurate,


and web-accessible.
Why use BLAST?

BLAST searching is fundamental to understanding


the relatedness of any favorite query sequence
to other known proteins or DNA sequences.

Applications include
• identifying orthologs and paralogs
• discovering new genes or proteins
• discovering variants of genes or proteins
• investigating expressed sequence tags (ESTs)
• exploring protein structure and function
Four components to a BLAST search

(1) Choose the sequence (query)

(2) Select the BLAST program

(3) Choose the database to search

(4) Choose optional parameters

Then click “BLAST”

page 102
Types of Blast searching

• blastp compares an amino acid query sequence against a


protein sequence database

• blastn compares a nucleotide query sequence against a


nucleotide sequence database

• blastx compares the six-frame conceptual protein translation


products of a nucleotide query sequence against a protein
sequence database

• tblastn compares a protein query sequence against a


nucleotide sequence database translated in six reading frames

• tblastx compares the six-frame translations of a nucleotide


by Bob Friedman

query sequence against the six-frame translations of a


nucleotide sequence database.
Routine BlastP search

FASTA formatted
text
or Genbank ID#

Protein
database

Run
by Bob Friedman
E value Threshold
• Alignments will be reported
with E-values less than or
equal to the expect values
threshold
• Setting a larger E threshold
will result in more
reported hits
• Setting a smaller E
threshold will result in
fewer reported hits

49 49
Kerfeld and Scott, PLoS Biology 2011
BlastP parameters

Restrict by taxonomic
group

Filter repetitive regions

Statistical cut-off
Size of words in
look-up table
Similarity matrix
(cost of gaps)
by Bob Friedman
BLAST as an Experiment:
Parameters to manipulate in a BLAST search

• Expect
• Word size
• Matrix
• Gap costs
• Filter
• Mask
51 51
Kerfeld and Scott, PLoS Biology 2011
Blast databases
• EST - Expression Sequence Tags; cDNA
• wgs – whole genome shotgun reads
• Reference genome sequences
• NR - non-redundant DNA or amino acid sequence database
• NT - NR database excluding EST, STS, GSS, HTGS
• PDB - DNA or amino acid sequences accompanied by 3d
structures
• STS - Sequence Tagged Sites; short genomic markers for
mapping
• Swissprot - well-annotated amino-acid sequences

• Also, to obtain organism-specific sequence set:


ftp://ftp.ncbi.nih.gov/genomes/
by Bob Friedman
Example of web based BLAST

program: BLASTP
sequence: vma1
gi:137464

BLink provides similar


information
E – Value of a Blast Hit
Helps us to determine whether or not an alignment occurs by chance.
Blast Hit

• A match between a word and a database entry.


• A word(3 proteins, 11 nucleotides)
• Keep track of the scores using a scoring matrix (BLOSUM 62 and PAM
32).
Blast Hit
• Used as a cut-off to define a blast hit.
• The lower the e-value the more significant the hit.
• Go for a higher e-value if you are expecting higher diversity between your
sequence and the database.
• The e-value depends on the size of the database.
• An observed E value of 1 would indicate in a database of the current size
you could expect to see 1 match with a similar score (S) simply by chance.
• Also, e-value computations takes into account the length of the query
sequence, so nearly identical short alignments have inherently high e-
values since short query sequences inherently have a greater chance of
spurious(fake) hits than longer sequences.
Blast Hit
• E-values have to be interpreted in the specific context of the search
that generated them. For any given query sequence and database,
the lower the e value, the greater the probability that a score equal to
the observed score would not be expected by chance. So people float
their e-value cutoff based on what they are seeing, and how stringent
(or relaxed) they feel they need to be to extract the best information
for their particular question of interest.
• Just like a statistical p-value, there are no hard and fast rules, and
cutoffs or thresholds are inherently arbitrary, as one tries to balance
stringency with the need to return sufficient information to make use
of or sense of.
Blast Hit
• There is the old rule-of-thumb that an e-value less than 0.1 means
that the hit is significant. This was however true some 15 years ago
but it does not hold anymore when you search against a real big
databank like the NCBI nr.
• A remark : when you search a small oligonucleotide against a
complete genome or transcriptome in order to test the specificity of a
primer or siRNA it is recommended to set the cutoff high (100 or even
1000, de facto not using a cutoff). The reason is that here you really
want to find all sequences that give a 100% or close to 100% hit, since
all are really significant, the probability of finding a hit when
searching a random databank of same size is not relevant.
Blast Hit
• It is depend on your sample. If your sample is fully sequenced and
available in the database you can choose 1e -4 to 1e-10 (lower more
confidence). If it is not in the database, you can choose 1e-30. In this
case you increase the number of hit and then you can select the best
hit based on what you are seeing.
The hit list

• BLAST lists the best matches (hits)


• For each hit, BLAST provides:
• Accession number – links to Genbank flatfile
• Description
• “G” = genome link
• E-value
• An indicator of how good a match to the query sequence
• Score
• Link to an alignment
What is an E-value?

• E-value
• The chance that the match could be random

• The lower the E-value, the more significant the match


• E = 10-4 is considered the cutoff point
• E = 0 means that the two sequences are statistically identical
E values E= kmNe-λs
m= query size
k= minor constant
N= database size
λ = constant to adjust fro scoring matrix
S= score of High-scoring segment pair (HSP)

E (expect) value: Expectation value. The number of chance alignments with scores
equivalent to or better than S that are expected to occur in a database search by
chance. The lower the E value, the more significant the score.
• The E value decreases exponentially as the Score (S) that is assigned to a match between
two sequences increases.
• The E value depends on the size of database and the scoring system in use.
• When the Expect value threshold is increased from the default value of 10, more hits can
be reported.

Bit score: The bit score is calculated from the raw score by normalizing with the
statistical variables that define a given scoring system. Therefore, bit scores from
different alignments, even those employing different scoring matrices can be
compared.

Tips:
• Repeated amino acid stretches (e.g. poly glutamine) are unlikely to reflect
meaningful similarity between the query and the match.
• If those present use BLAST filters to mask low complexity regions.
• RepeatMasker can be used to mask repeats before blasting
E value Threshold
• Alignments will be reported
with E-values less than or
equal to the expect values
threshold
• Setting a larger E threshold
will result in more
reported hits
• Setting a smaller E
threshold will result in
fewer reported hits

64 64
Kerfeld and Scott, PLoS Biology 2011
Establishing a significant “hit”

Blast’s E-value indicates statistical significance of a sequence match


Karlin S, Altschul SF (1990) Methods for assessing the statistical significance of molecular
sequence features by using general scoring schemes. PNAS 87:2264-8

E-value is the Expected number of sequence (HSPs) matches in database of


n number of sequences
• database size is arbitrary
• multiple testing problem
• E-value calculated from many assumptions
• E-value depends on size of data bank.

Examples:
E-value = 1 = expect the match to occur in the database by chance 1x

E-value = .05 = expect 5% chance of match occurring

E-value = 1x10-20 = strict match between protein domains


BLAST search output: tabular output

High scores
low E values

Cut-off:
.05?
10-10?
Why set the E value to 20,000?

Suppose you perform a search with a short query


(e.g. 9 amino acids). There are not enough residues to
accumulate a big score (or a small E value).

Indeed, a match of 9 out of 9 residues could yield a


small score with an E value of 100 or 200. And yet, this
result could be “real” and of interest to you.

By setting the E value cutoff to 20,000 you do not


change the way the search was done, but you do
change which results are reported to you.
Sometimes a real match has an E value > 1

real
match?

…try a reciprocal BLAST to confirm


Database searching: E-values in BLAST

BLAST uses precomputed extreme value distributions to calculate E-values


from alignment scores

For this reason BLAST only allows certain combinations of substitution


matrices and gap penalties.

This also means that the fit is based on a different data set than the one you
are working on.

A word of caution: BLAST tends to overestimate the significance of its matches

E-values from BLAST are fine for identifying sure hits


One should be careful using BLAST’s E-values to judge if a marginal hit can be
trusted (e.g., you may want to use E-values of 10-4 to 10-5).

You might also like