KEMBAR78
Lecture_1_Introduction_Bioinformatics.pptx
College of Basic and Applied Sciences (CBAS)
School of Physical and Mathematical Sciences (SPMS)
2021/2022/2nd
Semester
CSCD 606
Bioinformatics
Lecture 1 – Introduction of Basic
Concepts in Bioinformatics
Course Lecturer: Dr Kofi Sarpong Adu-Manu
Contact Information: ksadu-manu@ug.edu.gh
Introduction
Bioinformatics is a branch of
science that integrates
computer science, mathematics
and statistics, chemistry and
engineering for analysis,
exploration, integration and
exploitation of biological
sciences data, in Research and
Development.
Bioinformatics deals with
storage, retrieval, analysis and
interpretation of biological
data using computer based
software and tools.
Bioinformatics
… What is Bioinformatics?...
https://www.youtube.com/watch?v=J3HVVi2k2No
https://www.youtube.com/watch?v=7Hk9jct2ozY
https://www.youtube.com/watch?v=9kOGOY7vthk
https://www.youtube.com/watch?v=gG7uCskUOrA
• A 1st
perspective of the field
of bioinformatics is the cell.
– Bioinformatics has emerged as a
discipline as biology has
become transformed by the
emergence of molecular
sequence data
• A 2nd
perspective of
bioinformatics is the organism.
– Broadening our view from the
level of the cell to the organism,
we can consider the individual’s
genome (collection of genes),
including the genes that are
expressed as RNA transcripts
and the protein products.
– For an individual organism, bioinformatics tools can therefore be
applied to describe changes through developmental time, changes
across body regions, and changes in a variety of physiological or
pathological states.
… What is Bioinformatics?...
• A 3rd
perspective of the field of
bioinformatics is represented by the
tree of life.
– The scope of bioinformatics includes
all of life on Earth, including the
three major branches of bacteria,
archaea, and eukaryotes.
• Viruses, which exist on the borderline
of the definition of life, are not
depicted here.
– For all species, the collection and
analysis of molecular sequence data
allow us to describe the complete
collection of DNA that comprises
each organism (the genome).
– We can further learn the variations
that occur between species and
among members of a species, and
we can deduce the evolutionary
history of life on Earth
… What is Bioinformatics?...
• From a practical sense, bioinformatics is a science that involves
– collecting,
– manipulating,
– analyzing,
– transmitting
huge quantities of data,
• uses computers whenever appropriate.
• bioinformatics refers to computational bioinformatics.
…What is Bioinformatics?...
Bioinformatics
• an interdisciplinary field that develops
– methods and software tools for understanding
biological data
• combines
– computer science,
– statistics,
– mathematics,
– engineering
to analyze and interpret biological data
…What is Bioinformatics?...
• has been used for in silico analyses of biological queries
using mathematical and statistical techniques.
• [In silico (Latin for "in silicon") is an expression
used to mean "performed on computer or via
computer simulation.]
• primary goal is to increase the understanding of biological
processes.
• focuses on developing and applying computationally
intensive techniques to achieve this goal.
…What is Bioinformatics?...
• Techniques used include
– pattern recognition, data mining, machine learning
algorithms, and visualization
• Analyzing biological data to produce meaningful
information involves writing and running software
programs that use algorithms from
– graph theory, artificial intelligence, soft computing,
data mining, signal processing, image processing, and
computer simulation.
…What is Bioinformatics?...
• The algorithms in turn depend on theoretical
foundations such as
– discrete mathematics
– control theory
– system theory
– information theory
– statistics
...What is Bioinformatics?...
• Bioinformatics derives knowledge from computer
analysis of biological data that can consist of the
information stored in the
– genetic code,
– experimental results from various sources,
– patient statistics,
– scientific literature.
• Research in bioinformatics includes method
development for
– storage,
– retrieval,
– analysis
of the data.
• Bioinformatics emerged in mid 1990s.
• From 1965-78 Margaret O. Dayhoff established first database of protein
sequences, published annually as series of volume entitled “Atlas of protein
sequence and structure”.
• During 1977 DNA sequences began to accumulate slowly in literature and it
became more common to predict protein sequences by translating sequenced
genes than by direct sequencing of proteins.
• Thus number of uncharacterised proteins began to increase.
• In 1980, there were enough DNA sequences to justify the establishment of the
first nucleotide sequence database, GenBank at National Centre for
Biotechnology Information (NCBI), USA. NCBI served as primary
databank provider for information.
History of Bioinformatics
History of Bioinformatics (contd..)
• The European Molecular Biology Laboratory (EMBL)
established at European Bioinformatics Institute (EBI) in 1980. The
aim of this data library was to collect, organize and distribute
nucleotide sequence data and related information.
• In 1986 DNA Data Bank was established by GemonNet, Japan.
• In 1984, the National Biomedical Research Foundation (NBRF)
established the protein information Resource (PIR).
• All these data banks operate in close collaboration and
regularly exchange data.
Management and analysis of the rapidly accumulating
sequence data required new computer software and
statistical tools.
This attracted scientists from computer science and
mathematics to the fast emerging field of
bioinformatics.
History of Bioinformatics (contd..)
Goals of Bioinformatics
• The ultimate goal of bioinformatics is to better understand a living
cell and how it functions at the molecular level.
• By analyzing raw molecular sequence and structural data,
bioinformatics research can generate new insights and provide a
“global” perspective of the cell
• The reason that the functions of a cell can be better understood by
analyzing sequence data is ultimately because the flow of genetic
information is dictated by the “central dogma” of biology in
which DNA is transcribed to RNA, which is translated to proteins.
Objectives of Bioinformatics
1. Development of new algorithms and
statistics for assessing the relationships
among large sets of biological data.
2. Application of these tools for the analysis and
interpretation of the various biological data.
3. Development of database for an efficient
storage, access and management of the large body
of various biological information.
• DNA is transcribed to
messenger RNA in the cell
nucleus, which is in turn
translated to protein in the
cytoplasm.
• The Central Dogma, shown
here from a structural
perspective, can also be
depicted from an information
flow perspective
The Central Dogma of Molecular Biology
Path to the Bioinformatics
– 1st,
• Learn Biology.
– 2nd,
• Decide and pick a problem that interests you for experiment.
– 3rd,
• Find and learn about the Bioinformatics tools.
– 4th,
• Learn the Computer Programming Languages.
– Perl, Pyton, R, Java, etc.
– 5th,
• Experiment on your computer and learn different programming
techniques.
Why is Bioinformatics Important?
• Applications areas include
– Medicine
– Pharmaceutical drug design
– Toxicology
– Molecular evolution
– Biosensors
– Biomaterials
– Biological computing models
– DNA computing
Key Applications
22
What skills are needed?
• Well-grounded in one of the following areas:
– Computer science
– Molecular biology
– Statistics
• Working knowledge and appreciation in the others!
Scope of Computational Biology
Computational
Biology
Bioinformatics
Genomics
Proteomics
Functional
genomics
Structural
bioinformatics
Bioinformatics Software: Two Cultures
Web-based or
graphical user interface (GUI)
Command line (often Linux)
Central resources
(NCBI,
EBI,)
Genome browsers
(UCSC, Ensembl)
Biopython,
Python, BioPerl, R:
manipulate data files
Next generation
sequencing tools
Data analysis
software: sequences,
proteins, genomes
GUI software
(Partek, MEGA,
RStudio,
BioMart,
IGV)
Galaxy
(web access
to NGS tools,
browser data)
• Many bioinformatics tools and resources are available on the
internet, such as major genome browsers and major portals (NCBI,
Ensembl, UCSC).
• These are:
– accessible (requiring no programming expertise)
– easy to browse to explore their depth and breadth
– very popular
– familiar (available on any web browser on any platform)
Bioinformatics Software: Two Cultures
• Many bioinformatics tools and resources are available on the
command-line interface (sometimes abbreviated CLI).
– These are often on the Linux platform (or other Unix-like
platforms such as the Mac command line).
– They are essential for many bioinformatics and genomics
applications.
– Most bioinformatics software is written for the Linux platform.
• Many bioinformatics datasets are so large (e.g. high throughput
technologies generate millions to billions or even trillions of data
points) requiring command-line tools to manipulate the data.
Bioinformatics Software: Two Cultures
• Should you learn to use the Linux operating
system?
– Yes, if you want to use mainstream
bioinformatics tools.
• Should you learn Python or Perl or R or another
programming language?
– It’s a good idea if you want to go deeper
into bioinformatics, but also, it depends
what your goals are.
– Many software tools can be run in Linux on
the command-line without needing to
program.
• Think of this figure like a map.
– Where are you now?
– Where do you want to go?
CLI
Web-based or
graphical user interface (GUI)
Command line (often Linux)
Central resources
(NCBI,
EBI,)
Genome browsers
(UCSC, Ensembl)
Biopython,
Python, BioPerl, R:
manipulate data files
Next generation
sequencing tools
Data analysis
software: sequences,
proteins, genomes
GUI software
(Partek, MEGA,
RStudio,
BioMart,
IGV)
Galaxy
(web access
to NGS tools,
browser data)
Some web-based (GUI) and command-line (CLI) software
Some web-based (GUI) and command-line (CLI) software
• Many informatics disciplines have emerged in recent years.
• Bioinformatics is distinguished by its particular focus on DNA
and proteins (impacting its databases, its tools, and its entire
culture).
Tool makers and tool users across informatics disciplines
Limitations
• Bioinformatics predictions are not formal proofs of any concepts.
• They do not replace the traditional experimental research methods of actually
testing hypotheses.
• In addition, the quality of bioinformatics predictions depends on the quality of
data and the sophistication of the algorithms being used.
• Sequence data from high throughput analysis often contain errors.
• If the sequences are wrong or annotations incorrect, the results from the
downstream analysis are misleading as well.
• That is why it is so important to maintain a realistic perspective of the role of
bioinformatics.
Limitations con’t
• Most algorithms lack the capability and sophistication to truly
reflect reality.
• Errors in sequence alignment, for example, can affect the outcome
of structural or phylogenetic analysis
• The outcome of computation also depends on the computing power
available.
• Many accurate but exhaustive algorithms cannot be used because of
the slow rate of computation.
• Instead, less accurate but faster algorithms have to be used. This is a
necessary trade-off between accuracy and computational feasibility
New Themes
Bioinformatics field is undergoing major expansion.
• Providing more reliable and rigorous computational tools for
sequence, structural, and functional analysis is expected
• Development of tools for elucidation of the functions and
interactions of all gene products in a cell.
• Requires integration of disparate fields of biological knowledge and
a variety of complex mathematical and statistical tools.
• System-level simulation and integration are considered the future of
bioinformatics
• Transform biology from a qualitative science to a quantitative and
predictive science
Components of Bioinformatics
Data
Database
Database Mining Tools
Data
 Nucleic Acid Sequences
• Raw DNA Sequences
• Genomic sequence tags (GSTs)
• DNA sequences
• Expressed sequence tags (ESTs)
• Organellar DNA sequences
• RNA Sequences
 Protein sequences
 Protein structures
 Metabolic pathways
 Gel pictures
 Literature
Databases
A database is a vast collection of data pertaining to a
specific topic e.g. nucleotide sequence, protein
sequence etc., in an electronic environment.
• They are heart of bioinformatics.
• Computerized storehouse of data (records).
• Allows extraction of specified records.
•Allows adding, changing, removing, and merging
of records.
• Uses standardized formats.
Types of Databases
• Flat file format: The flat file format which is a long text file that
contains many entries separated by a delimiter, a special character
such as a vertical bar (|).
• Relational database management systems: Relational databases
can be created using a special programming language called
structured query language (SQL).
• Object-oriented database management systems: object-oriented
databases have been developed that store data as objects. In an
object-oriented programming language, an object can be considered
as a unit that combines data and mathematical routines that act on
the data. The database is structured such that the objects are linked
by a set of pointers defining predetermined relationships between
the objects.
Biological Databases
• Biological databases use all three types of database structures: flat
files, relational, and object oriented.
• Despite the obvious drawbacks of using flat files in database
management, many biological databases still use this format.
• The justification for this is that this system involves minimum
amount of database design and the search output can be easily
understood by working biologists.
• Biological databases can be roughly divided into three categories:
primary databases, secondary databases, and specialized
databases
Categorization of Biological Databases
• Primary databases contain original biological data. They are archives of raw
sequence or structural data submitted by the scientific community. GenBank and
Protein Data Bank (PDB) are examples of primary databases.
• Secondary databases contain computationally processed or manually curated
information, based on original information from primary databases. Translated
protein sequence databases containing functional annotation belong to this
category.
• Specialized databases are those that cater to a particular research interest. For
example, Flybase, HIV sequence database, and Ribosomal Database Project are
databases that specialize in a particular organism or a particular type of data.
Databases: Types
Sequence Databases
Structural Databases
Enzyme Databases Micro-
Array Databases Clinical
Database Pathway
Databases Chemical
Databases Integrated
Databases Bibliographic
Databases
Kusum Yadav, Department of Biochemistry
Nucleotide Sequence Databases
– NCBI - GenBank: (www.ncbi.nlm.nih.gov/GenBank)
– EMBL: (www.ebi.ac.uk/embl)
– DDBJ: (www.ddbj.nig.ac.jp)
The 3 databases are updated and exchanged on a
daily basis and the accession numbers are consistent.
There are no legal restriction in the usage of these
databases. However, there are some patented sequences
in the database.
The International Nucleotide Sequence
Database Collaboration (INSD)
National Center for Biotechnology Information
(NCBI)
EMBL Database
European Molecular Biology Laboratory (EMBL) :
 Maintained by European Bioinformatics Institute (EBI)
 GSS (genome survey sequences)
 HTC (high-throughput c-DNA sequences)
 HTG (high-throughput genomic sequences)
 EST (expressed sequence tag)
Patents
European Bioinformatics Institute (EBI)
• Developed in 1986 as a collaboration with
EMBL and GenBank.
• Produced, maintained and distributed by the
National Institute of Genetics, Japan.
• Sequences is submitted via Web based data
submission tool.
DDBJ (DNA Database of GenomNet, Japan)
GenomeNet, Japan
• ESTs - Expressed Sequence Tags
– dbEST (http://www.ncbi.nlm.nih.gov/dbEST)
• GenBank subset with additional EST-specific data
• Implemented in a Sybase relational database
• SNPs - Single Nucleotide Polymorphisms
– dbSNP (http://www.ncbi.nlm.nih.gov/SNP/)
• Very similar to dbEST in philosophy and
implementation
• Many commercial databases
– Celera, Incyte, etc.
Other Databases
Protein sequence database
• Functions as repository of raw data: two types
• Primary
• Secondary
Protein structure database
Protein Databases
Kusum Yadav, Department of Biochemistry
Primary databases
1. SWISS-PROT: Groups at Swiss Institute of Bioinformatics (SIB).
• It annotate the sequences
• Describe protein functions
• Its domain structures
• Its post translations modifications
• Provides high level of annotation
• Minimum level of redundancy
• High level of integration with other databases
2. TrEMBL:
• Computer annotated supplements of SWISS-PROT that contains all the
translations of EMBL nucleotide entries not yet integrated in SWISS-PROT.
2. PIR: Protein Information Resource, a division of NBRF in US.
• Collaborated with Munich Information Centre for Protein
Sequences (MIPS) and Japanese International Protein Sequence Database
(JIPID).
• One an search for entries
• Do sequence similarity
• PIR also produces MRL-3D (db of sequences extracted from 3D structures
Swiss-Prot
Secondary databases
• Secondary db compile and filter sequence data from different primary db.
• These db contain information derived from protein sequences and help the user
determine whether a new sequence belong to a known protein family.
1. PROSITE:
• db of short protein sequence patterns and profiles that characterise biologically significant sites in proteins
• It is based on regular expressions describing characteristic sequences of specific protein families and domains.
• It is part of SWISS-PROT, and maintained in the same way
2. PRINTS
• PRINTS provides a compendium of protein fingerprints (groups of conserved motifs that characterise a protein
family)
• Now has a relational version, "PRINTS-S“
3. BLOCKS
• BLOCK patterns without gaps in aligned protein families defined by PROSITE, found by pattern searching and statistical
sampling algorithms.
• Automatically determined un-gapped conserved segments
4. Pfam
• Db of protein families defined as domains
• For each domain, it contains a multiple alignment of a set of defining sequences and the other
sequences in SWISS-PRKOuTsumanYdadaTvr, EDMepaBrtLmtenhtaotf Bciaonchebmeistmry atched to the alignment.
Protein Structural Database
1. PDB (Protein Data Bank):
• Main db of 3D structures of biological macromolecules (determined by
X-ray crystallography and NMR).
• PDB entrys contain the atomic coordinates, and some structural parameters connected
with the atoms or computed from the structures (secondary structure).
• PDB provide primary archive of all 3D structures for macromolecules such as proteins,
DNA, RNA and various complexes.
2. SCOP (Structural Classification of Proteins):
• Db was started to with objective to classify protein 3D structures in a hierarchical
scheme of structural classes.
• It is based on data in a primary db, but adds information through analysis and
organization (such as classification of 3D structures into hierarchical scheme of folds,
super-families and families)
3. CATH (Class, architecture, topology, homologous super-family):
• CATH perform hierarchical classification of protein domain structures.
• Clusters proteins at four major structural levels
Enzyme Database
BRENDA [BRaunshchweig ENzyme DAtabase]
 Enzyme, a part of ExPaSy (Expert
Protein Analysis System, the proteomic
server of Swiss Institute of Bioinformatics)
Clinical Databases
Generally contain information from the Human
Human Gene Mutation Database, Cardiff, UK:
http://www.hgmd.org
Registers known mutations in the human genome and the
diseases they cause.
OMIM database
Online Mendelian Inheritance in Man
http://www.ncbi.nlm.nih.gov/Omim
The OMIM database contains abstracts and texts describing genetic
disorders to support genomics efforts and clinical genetics. It provides gene
maps, and known disorder maps in tabular listing formats. Contains
keyword search.
Kyoto Encyclopedia of Genes and
Genomes (KEGG)www.genome.jp/kegg/
Database and associated software which
integrates several databases such as,
Pathway database
Genes database
Genome database
Drug database
Reaction database
Compound database
KO database etc.
Bibliographic Databases
Used for searching for reference articles
PubMed
1.It enables user to do keyword searches, provides links to a
selection of full articles, and has text mining capabilities, e.g.
provides links to related articles, and GenBank entries,
among others.
2.It contains entries for more than 30 million abstracts of
scientific publications.
Database Mining Tools (Analysis Tools)
Utilization of various databases requires the use of
suitable search engines and analysis tools. These tools
are called Database mining tools and the process of data
utilization is known as database mining.
Some Analysis Tools are as follows:
Analysis Tools
Analysis Tool Function
BLAST (NCBI, USA) Used to analyse sequence information and detect homologous
sequences
ENTREZ (NCBI, USA) Used to access literature (abstracts), sequence and structure db
DNAPLOT (EBI, UK) Sequence alignment tool
LOCUS LINK (NCBI,
USA)
Assessing information on homologous genes
LIGAND (GenomNet,
Japan)
A chemical db, allows search for a combination of enzymes and links
to all publically accessible db.
BRITE (GenomNet,
Japan)
Biomolecular relations information transmission and expression db;
links to all publically accessible db.
TAXONOMY BROWSER
(NCBI, USA)
Taxonomic classification of various species as well as genetic
information
STRUCTURE It support Molecular Modelling Database (MMDB) and software
tools forKsusturmucYatduarv,eDaepnaratmlyenstisof Biochemistry
BLAST
(Basic Local Alignment Search Tool) for Homology Analyses
• BLASTn
– Nucleotide query vs nucleotide database
• BLASTp
– protein query vs protein database
• BLASTx
– automatic 6-frame translation of nucleotide query vs protein database
– If you have a DNA sequence and you want to now what protein (if any) it
encodes, you can perform BLASTx search.
• tBLASTn
– protein query vs automatic 6-frame translation of nucleotide database
– You can use this program to ask whether a DNA or ESTs db contains a
nucleotide sequence encoding a protein that matches your protein of
interest.
• tBLASTx
– automatic 6-frame translation of nucleotide query vs automatic 6-frame
translation of
Nucleotide database
BLASTn
BLASTp
BLASTx
DNA
protein
DNA
1
1
6
6
36
tBLASTn protein
tBLASTx DNA
DNA
protein
protein
DNA
DNA
(Basic Local Alignment Search Tool) for
Homology Analyses
Program Input
Database
SEQUENCE ALIGNMENT
What is Sequence Alignment ?
A sequence alignment is a way of arranging the sequences of DNA
or protein to identify regions of similarity that may be a
consequence of functional, structural, or evolutionary
relationships between the sequences.
Definitions
Similarity
The extent to which nucleotide or protein sequences are related. It is based
upon identity plus conservation.
Identity
The extent to which two sequences are invariant.
Conservation
Changes at a specific position of an amino acid or (less commonly, DNA)
sequence that preserve the physico- chemical properties of the original
residue.
Types of Sequence Alignment
• Pairwise Sequence Alignment (PSA)
• Multiple Sequence Alignment (MSA)
• The process of lining up two sequences to achieve
maximal levels of identity (and conservation, in the case
of amino acid sequences) for the purpose of assessing
the degree of similarity and the possibility of
homology.
• Pairwise sequence alignment is the most
fundamental operation of bioinformatics.
PSA
PSA Importance
PSA is the most fundamental operation in bioinformatics. There are
four basic uses of pairwise sequence alignment:
1. to decide if 2 proteins or genes are related structurally or
functionally
2. to identify domains or motifs shared between proteins
3. basis of BLAST searching
4. to analyze genomes
Reason why pairwise alignment uses protein sequences: protein
sequences can be more informative than DNA
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP
. ||| | . |. . . | : .||||.:|
:
1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin
51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP
: | | | | :: | .| . || |: || |.
45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK
93 lactoglobulin
98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC
136 RBP
|| ||. | :.|||| | . .|
94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin
137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP
136
. | | | : || . | || |
QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin
PSA of retinol-binding protein 4 and b-lactoglobulin
. | | | : || .
RQRQ.EELCLA
| || |
NPTQLEEQCHI
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSF ....... 178 lactoglobulin
Pairwise alignment of retinol-binding
protein and b-lactoglobulin
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP
. ||| | . |. . . | : .||||.:| :
1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin
51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP
: | | | | :: | .| . || |: || |.
45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK
93 lactoglobulin
98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC
136 RBP
|| ||. | :.|||| | . .|
94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC
135 lactoglobulin
137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIV Identity RQYRLIV
185 RBP
Page 46
(bar)
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP
. ||| | . |. . . | : .||||.:| :
1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin
51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP
: | | | | :: | .| . || |: || |.
45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK
93 lactoglobulin
98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC
136 RBP
|| ||. | :.|||| | . .|
94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC
135 lactoglobulin
RBP
. |
136 QCLVRTPEVD
DSYSFVFSRDPNGLP
| | :
|
DEALEKFDKALKALP
PEAQKIVRQRQ.EELC
|
.
| || |
MHIRLSFNPTQLEEQC
LARQYRLIV 185
HI....... 178 lactoglobulin
Pairwise alignment of retinol-binding protein and �-
lactoglobulin
137 RLLNLDGTCA Somewhat
similar
(one dot)
Very
similar
(two dots)
Page 46
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP
. ||| | . |. . . | : .||||.:|
:
1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin
51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP
: | | | | :: | .| . || |: || |.
45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK
93 lactoglobulin
98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC
136 RBP
|| ||. | :.|||| | . .|
94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin
137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP
136
. | | | : || . | || |
QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin
Pairwise alignment of retinol-binding
protein and b-lactoglobulin
Internal
gap
Terminal
gap
Kusum Yadav, Department of Biochemistry
• Homologs: similar sequences in different organisms derived
from a common ancestor sequence.
• Orthologs : homologous sequences in different related species
that arose from a common ancestral gene during speciation.
Orthologs are presumed to have similar biological function.
e.g. Human and rats myoglobins both transport oxygen in
muscle
• Paralogs: homologous genes within the same organism
e.g. human α and β globins are paralogs. Paralogs are the
result of gene duplication events
• Xenologs: similar sequences that have arisen out of horizontal
transfer events (symbiosis, viruses, etc)
Sequence Analyses for relatedness
• Partial or complete alignment of three or
more related proteins/ nucleotide sequences
• Conserved domain analysis
• Primer Designing
Kusum Yadav, Department of Biochemistry
Multiple sequence Alignment
Kusum Yadav, Department of Biochemistry
Tools of Multiple Alignment
• CLUSTALW
• T-Coffee
• MUSCLE
• KALIGN
• CLC & GCG WorkBench
Kusum Yadav, Department of Biochemistry
Various categories of Analyses
1. Analysis of a single gene (protein) sequence
– Similarity with other known genes
– Phylogenetic trees; evolutionary relationships
– Identification of well-defined domains in the
sequence
– Sequence features (physical properties, binding
sites, modification sites)
– Prediction of sub-cellular localization
– Prediction of protein secondary and tertiary
structures
2. Analysis of whole genomes
– Location of variuos genes on the chromosomes,
correlation with function or evolution
– Expansion/duplication of gene families
– Which gene families are present, which
missing?
– Presence or absence of biochemical pathways
– Identification of "missing" enzymes
– Large-scale events in the evolution of organisms
– Transcriptomics : Expression analysis; micro array data
(mRNA/transcript analyses)
– Proteomics; protein qualitative and
quantitative analyses, covalent modifications
– Comparison and analysis of
biochemical pathways
– Deletion or mutant genotypes vs phenotypes
– Identification of essential genes, or
genes involved in specific processes
3. Analysis of genes and genomes with respect to
function (Functional Annotation)
Kusum Yadav, Department of Biochemistry
4. Comparative genomics
⚫ Identifying pathogen specific unique
targets for designing novel drugs.
Kusum Yadav, Department of Biochemistry
• The phylogenetic trees aim at reconstructing the history of
successive divergence which took place during the evolution,
between the considered sequences and their common ancestor.
• Nucleic acid and protein sequences are used to infer
Phylogenetic relationships
• Molecular phylogeny methods allow the suggestion of
phylogenetic trees, from a given set of aligned sequences.
Phylogenetic Analysis
Phylogenetic Analysis Tools
Kusum Yadav, Department of Biochemistry
MEGA
PHYLIP
PAUP
Treeview
ODEN
PHYLOWIN
TREECON
DENDRON

Lecture_1_Introduction_Bioinformatics.pptx

  • 1.
    College of Basicand Applied Sciences (CBAS) School of Physical and Mathematical Sciences (SPMS) 2021/2022/2nd Semester CSCD 606 Bioinformatics Lecture 1 – Introduction of Basic Concepts in Bioinformatics Course Lecturer: Dr Kofi Sarpong Adu-Manu Contact Information: ksadu-manu@ug.edu.gh
  • 2.
  • 3.
    Bioinformatics is abranch of science that integrates computer science, mathematics and statistics, chemistry and engineering for analysis, exploration, integration and exploitation of biological sciences data, in Research and Development. Bioinformatics deals with storage, retrieval, analysis and interpretation of biological data using computer based software and tools. Bioinformatics
  • 4.
    … What isBioinformatics?... https://www.youtube.com/watch?v=J3HVVi2k2No https://www.youtube.com/watch?v=7Hk9jct2ozY https://www.youtube.com/watch?v=9kOGOY7vthk https://www.youtube.com/watch?v=gG7uCskUOrA • A 1st perspective of the field of bioinformatics is the cell. – Bioinformatics has emerged as a discipline as biology has become transformed by the emergence of molecular sequence data
  • 5.
    • A 2nd perspectiveof bioinformatics is the organism. – Broadening our view from the level of the cell to the organism, we can consider the individual’s genome (collection of genes), including the genes that are expressed as RNA transcripts and the protein products. – For an individual organism, bioinformatics tools can therefore be applied to describe changes through developmental time, changes across body regions, and changes in a variety of physiological or pathological states. … What is Bioinformatics?...
  • 6.
    • A 3rd perspectiveof the field of bioinformatics is represented by the tree of life. – The scope of bioinformatics includes all of life on Earth, including the three major branches of bacteria, archaea, and eukaryotes. • Viruses, which exist on the borderline of the definition of life, are not depicted here. – For all species, the collection and analysis of molecular sequence data allow us to describe the complete collection of DNA that comprises each organism (the genome). – We can further learn the variations that occur between species and among members of a species, and we can deduce the evolutionary history of life on Earth … What is Bioinformatics?...
  • 7.
    • From apractical sense, bioinformatics is a science that involves – collecting, – manipulating, – analyzing, – transmitting huge quantities of data, • uses computers whenever appropriate. • bioinformatics refers to computational bioinformatics. …What is Bioinformatics?...
  • 8.
    Bioinformatics • an interdisciplinaryfield that develops – methods and software tools for understanding biological data • combines – computer science, – statistics, – mathematics, – engineering to analyze and interpret biological data
  • 9.
    …What is Bioinformatics?... •has been used for in silico analyses of biological queries using mathematical and statistical techniques. • [In silico (Latin for "in silicon") is an expression used to mean "performed on computer or via computer simulation.] • primary goal is to increase the understanding of biological processes. • focuses on developing and applying computationally intensive techniques to achieve this goal.
  • 10.
    …What is Bioinformatics?... •Techniques used include – pattern recognition, data mining, machine learning algorithms, and visualization • Analyzing biological data to produce meaningful information involves writing and running software programs that use algorithms from – graph theory, artificial intelligence, soft computing, data mining, signal processing, image processing, and computer simulation.
  • 11.
    …What is Bioinformatics?... •The algorithms in turn depend on theoretical foundations such as – discrete mathematics – control theory – system theory – information theory – statistics
  • 12.
    ...What is Bioinformatics?... •Bioinformatics derives knowledge from computer analysis of biological data that can consist of the information stored in the – genetic code, – experimental results from various sources, – patient statistics, – scientific literature. • Research in bioinformatics includes method development for – storage, – retrieval, – analysis of the data.
  • 13.
    • Bioinformatics emergedin mid 1990s. • From 1965-78 Margaret O. Dayhoff established first database of protein sequences, published annually as series of volume entitled “Atlas of protein sequence and structure”. • During 1977 DNA sequences began to accumulate slowly in literature and it became more common to predict protein sequences by translating sequenced genes than by direct sequencing of proteins. • Thus number of uncharacterised proteins began to increase. • In 1980, there were enough DNA sequences to justify the establishment of the first nucleotide sequence database, GenBank at National Centre for Biotechnology Information (NCBI), USA. NCBI served as primary databank provider for information. History of Bioinformatics
  • 14.
    History of Bioinformatics(contd..) • The European Molecular Biology Laboratory (EMBL) established at European Bioinformatics Institute (EBI) in 1980. The aim of this data library was to collect, organize and distribute nucleotide sequence data and related information. • In 1986 DNA Data Bank was established by GemonNet, Japan. • In 1984, the National Biomedical Research Foundation (NBRF) established the protein information Resource (PIR). • All these data banks operate in close collaboration and regularly exchange data.
  • 15.
    Management and analysisof the rapidly accumulating sequence data required new computer software and statistical tools. This attracted scientists from computer science and mathematics to the fast emerging field of bioinformatics. History of Bioinformatics (contd..)
  • 16.
    Goals of Bioinformatics •The ultimate goal of bioinformatics is to better understand a living cell and how it functions at the molecular level. • By analyzing raw molecular sequence and structural data, bioinformatics research can generate new insights and provide a “global” perspective of the cell • The reason that the functions of a cell can be better understood by analyzing sequence data is ultimately because the flow of genetic information is dictated by the “central dogma” of biology in which DNA is transcribed to RNA, which is translated to proteins.
  • 17.
    Objectives of Bioinformatics 1.Development of new algorithms and statistics for assessing the relationships among large sets of biological data. 2. Application of these tools for the analysis and interpretation of the various biological data. 3. Development of database for an efficient storage, access and management of the large body of various biological information.
  • 18.
    • DNA istranscribed to messenger RNA in the cell nucleus, which is in turn translated to protein in the cytoplasm. • The Central Dogma, shown here from a structural perspective, can also be depicted from an information flow perspective The Central Dogma of Molecular Biology
  • 19.
    Path to theBioinformatics – 1st, • Learn Biology. – 2nd, • Decide and pick a problem that interests you for experiment. – 3rd, • Find and learn about the Bioinformatics tools. – 4th, • Learn the Computer Programming Languages. – Perl, Pyton, R, Java, etc. – 5th, • Experiment on your computer and learn different programming techniques.
  • 20.
    Why is BioinformaticsImportant? • Applications areas include – Medicine – Pharmaceutical drug design – Toxicology – Molecular evolution – Biosensors – Biomaterials – Biological computing models – DNA computing
  • 21.
  • 22.
    22 What skills areneeded? • Well-grounded in one of the following areas: – Computer science – Molecular biology – Statistics • Working knowledge and appreciation in the others!
  • 23.
    Scope of ComputationalBiology Computational Biology Bioinformatics Genomics Proteomics Functional genomics Structural bioinformatics
  • 24.
    Bioinformatics Software: TwoCultures Web-based or graphical user interface (GUI) Command line (often Linux) Central resources (NCBI, EBI,) Genome browsers (UCSC, Ensembl) Biopython, Python, BioPerl, R: manipulate data files Next generation sequencing tools Data analysis software: sequences, proteins, genomes GUI software (Partek, MEGA, RStudio, BioMart, IGV) Galaxy (web access to NGS tools, browser data)
  • 25.
    • Many bioinformaticstools and resources are available on the internet, such as major genome browsers and major portals (NCBI, Ensembl, UCSC). • These are: – accessible (requiring no programming expertise) – easy to browse to explore their depth and breadth – very popular – familiar (available on any web browser on any platform) Bioinformatics Software: Two Cultures
  • 26.
    • Many bioinformaticstools and resources are available on the command-line interface (sometimes abbreviated CLI). – These are often on the Linux platform (or other Unix-like platforms such as the Mac command line). – They are essential for many bioinformatics and genomics applications. – Most bioinformatics software is written for the Linux platform. • Many bioinformatics datasets are so large (e.g. high throughput technologies generate millions to billions or even trillions of data points) requiring command-line tools to manipulate the data. Bioinformatics Software: Two Cultures
  • 27.
    • Should youlearn to use the Linux operating system? – Yes, if you want to use mainstream bioinformatics tools. • Should you learn Python or Perl or R or another programming language? – It’s a good idea if you want to go deeper into bioinformatics, but also, it depends what your goals are. – Many software tools can be run in Linux on the command-line without needing to program. • Think of this figure like a map. – Where are you now? – Where do you want to go? CLI Web-based or graphical user interface (GUI) Command line (often Linux) Central resources (NCBI, EBI,) Genome browsers (UCSC, Ensembl) Biopython, Python, BioPerl, R: manipulate data files Next generation sequencing tools Data analysis software: sequences, proteins, genomes GUI software (Partek, MEGA, RStudio, BioMart, IGV) Galaxy (web access to NGS tools, browser data)
  • 28.
    Some web-based (GUI)and command-line (CLI) software
  • 29.
    Some web-based (GUI)and command-line (CLI) software
  • 30.
    • Many informaticsdisciplines have emerged in recent years. • Bioinformatics is distinguished by its particular focus on DNA and proteins (impacting its databases, its tools, and its entire culture). Tool makers and tool users across informatics disciplines
  • 31.
    Limitations • Bioinformatics predictionsare not formal proofs of any concepts. • They do not replace the traditional experimental research methods of actually testing hypotheses. • In addition, the quality of bioinformatics predictions depends on the quality of data and the sophistication of the algorithms being used. • Sequence data from high throughput analysis often contain errors. • If the sequences are wrong or annotations incorrect, the results from the downstream analysis are misleading as well. • That is why it is so important to maintain a realistic perspective of the role of bioinformatics.
  • 32.
    Limitations con’t • Mostalgorithms lack the capability and sophistication to truly reflect reality. • Errors in sequence alignment, for example, can affect the outcome of structural or phylogenetic analysis • The outcome of computation also depends on the computing power available. • Many accurate but exhaustive algorithms cannot be used because of the slow rate of computation. • Instead, less accurate but faster algorithms have to be used. This is a necessary trade-off between accuracy and computational feasibility
  • 33.
    New Themes Bioinformatics fieldis undergoing major expansion. • Providing more reliable and rigorous computational tools for sequence, structural, and functional analysis is expected • Development of tools for elucidation of the functions and interactions of all gene products in a cell. • Requires integration of disparate fields of biological knowledge and a variety of complex mathematical and statistical tools. • System-level simulation and integration are considered the future of bioinformatics • Transform biology from a qualitative science to a quantitative and predictive science
  • 34.
  • 35.
    Data  Nucleic AcidSequences • Raw DNA Sequences • Genomic sequence tags (GSTs) • DNA sequences • Expressed sequence tags (ESTs) • Organellar DNA sequences • RNA Sequences  Protein sequences  Protein structures  Metabolic pathways  Gel pictures  Literature
  • 36.
    Databases A database isa vast collection of data pertaining to a specific topic e.g. nucleotide sequence, protein sequence etc., in an electronic environment. • They are heart of bioinformatics. • Computerized storehouse of data (records). • Allows extraction of specified records. •Allows adding, changing, removing, and merging of records. • Uses standardized formats.
  • 37.
    Types of Databases •Flat file format: The flat file format which is a long text file that contains many entries separated by a delimiter, a special character such as a vertical bar (|). • Relational database management systems: Relational databases can be created using a special programming language called structured query language (SQL). • Object-oriented database management systems: object-oriented databases have been developed that store data as objects. In an object-oriented programming language, an object can be considered as a unit that combines data and mathematical routines that act on the data. The database is structured such that the objects are linked by a set of pointers defining predetermined relationships between the objects.
  • 38.
    Biological Databases • Biologicaldatabases use all three types of database structures: flat files, relational, and object oriented. • Despite the obvious drawbacks of using flat files in database management, many biological databases still use this format. • The justification for this is that this system involves minimum amount of database design and the search output can be easily understood by working biologists. • Biological databases can be roughly divided into three categories: primary databases, secondary databases, and specialized databases
  • 39.
    Categorization of BiologicalDatabases • Primary databases contain original biological data. They are archives of raw sequence or structural data submitted by the scientific community. GenBank and Protein Data Bank (PDB) are examples of primary databases. • Secondary databases contain computationally processed or manually curated information, based on original information from primary databases. Translated protein sequence databases containing functional annotation belong to this category. • Specialized databases are those that cater to a particular research interest. For example, Flybase, HIV sequence database, and Ribosomal Database Project are databases that specialize in a particular organism or a particular type of data.
  • 40.
    Databases: Types Sequence Databases StructuralDatabases Enzyme Databases Micro- Array Databases Clinical Database Pathway Databases Chemical Databases Integrated Databases Bibliographic Databases
  • 41.
    Kusum Yadav, Departmentof Biochemistry Nucleotide Sequence Databases – NCBI - GenBank: (www.ncbi.nlm.nih.gov/GenBank) – EMBL: (www.ebi.ac.uk/embl) – DDBJ: (www.ddbj.nig.ac.jp) The 3 databases are updated and exchanged on a daily basis and the accession numbers are consistent. There are no legal restriction in the usage of these databases. However, there are some patented sequences in the database. The International Nucleotide Sequence Database Collaboration (INSD)
  • 42.
    National Center forBiotechnology Information (NCBI)
  • 43.
    EMBL Database European MolecularBiology Laboratory (EMBL) :  Maintained by European Bioinformatics Institute (EBI)  GSS (genome survey sequences)  HTC (high-throughput c-DNA sequences)  HTG (high-throughput genomic sequences)  EST (expressed sequence tag) Patents
  • 44.
  • 45.
    • Developed in1986 as a collaboration with EMBL and GenBank. • Produced, maintained and distributed by the National Institute of Genetics, Japan. • Sequences is submitted via Web based data submission tool. DDBJ (DNA Database of GenomNet, Japan)
  • 46.
  • 47.
    • ESTs -Expressed Sequence Tags – dbEST (http://www.ncbi.nlm.nih.gov/dbEST) • GenBank subset with additional EST-specific data • Implemented in a Sybase relational database • SNPs - Single Nucleotide Polymorphisms – dbSNP (http://www.ncbi.nlm.nih.gov/SNP/) • Very similar to dbEST in philosophy and implementation • Many commercial databases – Celera, Incyte, etc. Other Databases
  • 48.
    Protein sequence database •Functions as repository of raw data: two types • Primary • Secondary Protein structure database Protein Databases Kusum Yadav, Department of Biochemistry
  • 49.
    Primary databases 1. SWISS-PROT:Groups at Swiss Institute of Bioinformatics (SIB). • It annotate the sequences • Describe protein functions • Its domain structures • Its post translations modifications • Provides high level of annotation • Minimum level of redundancy • High level of integration with other databases 2. TrEMBL: • Computer annotated supplements of SWISS-PROT that contains all the translations of EMBL nucleotide entries not yet integrated in SWISS-PROT. 2. PIR: Protein Information Resource, a division of NBRF in US. • Collaborated with Munich Information Centre for Protein Sequences (MIPS) and Japanese International Protein Sequence Database (JIPID). • One an search for entries • Do sequence similarity • PIR also produces MRL-3D (db of sequences extracted from 3D structures
  • 50.
  • 51.
    Secondary databases • Secondarydb compile and filter sequence data from different primary db. • These db contain information derived from protein sequences and help the user determine whether a new sequence belong to a known protein family. 1. PROSITE: • db of short protein sequence patterns and profiles that characterise biologically significant sites in proteins • It is based on regular expressions describing characteristic sequences of specific protein families and domains. • It is part of SWISS-PROT, and maintained in the same way 2. PRINTS • PRINTS provides a compendium of protein fingerprints (groups of conserved motifs that characterise a protein family) • Now has a relational version, "PRINTS-S“ 3. BLOCKS • BLOCK patterns without gaps in aligned protein families defined by PROSITE, found by pattern searching and statistical sampling algorithms. • Automatically determined un-gapped conserved segments 4. Pfam • Db of protein families defined as domains • For each domain, it contains a multiple alignment of a set of defining sequences and the other sequences in SWISS-PRKOuTsumanYdadaTvr, EDMepaBrtLmtenhtaotf Bciaonchebmeistmry atched to the alignment.
  • 52.
    Protein Structural Database 1.PDB (Protein Data Bank): • Main db of 3D structures of biological macromolecules (determined by X-ray crystallography and NMR). • PDB entrys contain the atomic coordinates, and some structural parameters connected with the atoms or computed from the structures (secondary structure). • PDB provide primary archive of all 3D structures for macromolecules such as proteins, DNA, RNA and various complexes. 2. SCOP (Structural Classification of Proteins): • Db was started to with objective to classify protein 3D structures in a hierarchical scheme of structural classes. • It is based on data in a primary db, but adds information through analysis and organization (such as classification of 3D structures into hierarchical scheme of folds, super-families and families) 3. CATH (Class, architecture, topology, homologous super-family): • CATH perform hierarchical classification of protein domain structures. • Clusters proteins at four major structural levels
  • 53.
    Enzyme Database BRENDA [BRaunshchweigENzyme DAtabase]  Enzyme, a part of ExPaSy (Expert Protein Analysis System, the proteomic server of Swiss Institute of Bioinformatics)
  • 54.
    Clinical Databases Generally containinformation from the Human Human Gene Mutation Database, Cardiff, UK: http://www.hgmd.org Registers known mutations in the human genome and the diseases they cause. OMIM database Online Mendelian Inheritance in Man http://www.ncbi.nlm.nih.gov/Omim The OMIM database contains abstracts and texts describing genetic disorders to support genomics efforts and clinical genetics. It provides gene maps, and known disorder maps in tabular listing formats. Contains keyword search.
  • 55.
    Kyoto Encyclopedia ofGenes and Genomes (KEGG)www.genome.jp/kegg/ Database and associated software which integrates several databases such as, Pathway database Genes database Genome database Drug database Reaction database Compound database KO database etc.
  • 56.
    Bibliographic Databases Used forsearching for reference articles PubMed 1.It enables user to do keyword searches, provides links to a selection of full articles, and has text mining capabilities, e.g. provides links to related articles, and GenBank entries, among others. 2.It contains entries for more than 30 million abstracts of scientific publications.
  • 57.
    Database Mining Tools(Analysis Tools) Utilization of various databases requires the use of suitable search engines and analysis tools. These tools are called Database mining tools and the process of data utilization is known as database mining. Some Analysis Tools are as follows:
  • 58.
    Analysis Tools Analysis ToolFunction BLAST (NCBI, USA) Used to analyse sequence information and detect homologous sequences ENTREZ (NCBI, USA) Used to access literature (abstracts), sequence and structure db DNAPLOT (EBI, UK) Sequence alignment tool LOCUS LINK (NCBI, USA) Assessing information on homologous genes LIGAND (GenomNet, Japan) A chemical db, allows search for a combination of enzymes and links to all publically accessible db. BRITE (GenomNet, Japan) Biomolecular relations information transmission and expression db; links to all publically accessible db. TAXONOMY BROWSER (NCBI, USA) Taxonomic classification of various species as well as genetic information STRUCTURE It support Molecular Modelling Database (MMDB) and software tools forKsusturmucYatduarv,eDaepnaratmlyenstisof Biochemistry
  • 59.
    BLAST (Basic Local AlignmentSearch Tool) for Homology Analyses • BLASTn – Nucleotide query vs nucleotide database • BLASTp – protein query vs protein database • BLASTx – automatic 6-frame translation of nucleotide query vs protein database – If you have a DNA sequence and you want to now what protein (if any) it encodes, you can perform BLASTx search. • tBLASTn – protein query vs automatic 6-frame translation of nucleotide database – You can use this program to ask whether a DNA or ESTs db contains a nucleotide sequence encoding a protein that matches your protein of interest. • tBLASTx – automatic 6-frame translation of nucleotide query vs automatic 6-frame translation of Nucleotide database
  • 60.
    BLASTn BLASTp BLASTx DNA protein DNA 1 1 6 6 36 tBLASTn protein tBLASTx DNA DNA protein protein DNA DNA (BasicLocal Alignment Search Tool) for Homology Analyses Program Input Database
  • 61.
  • 62.
    What is SequenceAlignment ? A sequence alignment is a way of arranging the sequences of DNA or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.
  • 63.
    Definitions Similarity The extent towhich nucleotide or protein sequences are related. It is based upon identity plus conservation. Identity The extent to which two sequences are invariant. Conservation Changes at a specific position of an amino acid or (less commonly, DNA) sequence that preserve the physico- chemical properties of the original residue.
  • 64.
    Types of SequenceAlignment • Pairwise Sequence Alignment (PSA) • Multiple Sequence Alignment (MSA)
  • 65.
    • The processof lining up two sequences to achieve maximal levels of identity (and conservation, in the case of amino acid sequences) for the purpose of assessing the degree of similarity and the possibility of homology. • Pairwise sequence alignment is the most fundamental operation of bioinformatics. PSA
  • 66.
    PSA Importance PSA isthe most fundamental operation in bioinformatics. There are four basic uses of pairwise sequence alignment: 1. to decide if 2 proteins or genes are related structurally or functionally 2. to identify domains or motifs shared between proteins 3. basis of BLAST searching 4. to analyze genomes Reason why pairwise alignment uses protein sequences: protein sequences can be more informative than DNA
  • 67.
    1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP 136 . | | | : || . | || | QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin PSA of retinol-binding protein 4 and b-lactoglobulin
  • 68.
    . | || : || . RQRQ.EELCLA | || | NPTQLEEQCHI 136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSF ....... 178 lactoglobulin Pairwise alignment of retinol-binding protein and b-lactoglobulin 1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIV Identity RQYRLIV 185 RBP Page 46 (bar)
  • 69.
    1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin RBP . | 136 QCLVRTPEVD DSYSFVFSRDPNGLP | | : | DEALEKFDKALKALP PEAQKIVRQRQ.EELC | . | || | MHIRLSFNPTQLEEQC LARQYRLIV 185 HI....... 178 lactoglobulin Pairwise alignment of retinol-binding protein and �- lactoglobulin 137 RLLNLDGTCA Somewhat similar (one dot) Very similar (two dots) Page 46
  • 70.
    1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50RBP . ||| | . |. . . | : .||||.:| : 1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP : | | | | :: | .| . || |: || |. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin 98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP || ||. | :.|||| | . .| 94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin 137 RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV 185 RBP 136 . | | | : || . | || | QCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI....... 178 lactoglobulin Pairwise alignment of retinol-binding protein and b-lactoglobulin Internal gap Terminal gap Kusum Yadav, Department of Biochemistry
  • 72.
    • Homologs: similarsequences in different organisms derived from a common ancestor sequence. • Orthologs : homologous sequences in different related species that arose from a common ancestral gene during speciation. Orthologs are presumed to have similar biological function. e.g. Human and rats myoglobins both transport oxygen in muscle • Paralogs: homologous genes within the same organism e.g. human α and β globins are paralogs. Paralogs are the result of gene duplication events • Xenologs: similar sequences that have arisen out of horizontal transfer events (symbiosis, viruses, etc) Sequence Analyses for relatedness
  • 73.
    • Partial orcomplete alignment of three or more related proteins/ nucleotide sequences • Conserved domain analysis • Primer Designing Kusum Yadav, Department of Biochemistry Multiple sequence Alignment
  • 74.
    Kusum Yadav, Departmentof Biochemistry Tools of Multiple Alignment • CLUSTALW • T-Coffee • MUSCLE • KALIGN • CLC & GCG WorkBench
  • 75.
    Kusum Yadav, Departmentof Biochemistry Various categories of Analyses 1. Analysis of a single gene (protein) sequence – Similarity with other known genes – Phylogenetic trees; evolutionary relationships – Identification of well-defined domains in the sequence – Sequence features (physical properties, binding sites, modification sites) – Prediction of sub-cellular localization – Prediction of protein secondary and tertiary structures
  • 76.
    2. Analysis ofwhole genomes – Location of variuos genes on the chromosomes, correlation with function or evolution – Expansion/duplication of gene families – Which gene families are present, which missing? – Presence or absence of biochemical pathways – Identification of "missing" enzymes – Large-scale events in the evolution of organisms
  • 77.
    – Transcriptomics :Expression analysis; micro array data (mRNA/transcript analyses) – Proteomics; protein qualitative and quantitative analyses, covalent modifications – Comparison and analysis of biochemical pathways – Deletion or mutant genotypes vs phenotypes – Identification of essential genes, or genes involved in specific processes 3. Analysis of genes and genomes with respect to function (Functional Annotation)
  • 78.
    Kusum Yadav, Departmentof Biochemistry 4. Comparative genomics ⚫ Identifying pathogen specific unique targets for designing novel drugs.
  • 79.
    Kusum Yadav, Departmentof Biochemistry • The phylogenetic trees aim at reconstructing the history of successive divergence which took place during the evolution, between the considered sequences and their common ancestor. • Nucleic acid and protein sequences are used to infer Phylogenetic relationships • Molecular phylogeny methods allow the suggestion of phylogenetic trees, from a given set of aligned sequences. Phylogenetic Analysis
  • 80.
    Phylogenetic Analysis Tools KusumYadav, Department of Biochemistry MEGA PHYLIP PAUP Treeview ODEN PHYLOWIN TREECON DENDRON

Editor's Notes

  • #21 Overview of various subfields of bioinformatics. Biocomputing tool development is at the foundation of all bioinformatics analysis. The applications of the tools fall into three areas: sequence analysis, structure analysis, and function analysis. There are intrinsic connections between different areas of analyses represented by bars between the boxes.
  • #23 Bioinformatics differs from a related field known as computational biology. Bioinformatics is limited to sequence, structural, and functional analysis of genes and genomes and their corresponding products and is often considered computational molecular biology. However, computational biology encompasses all biological areas that involve computation. For example, mathematical modeling of ecosystems, population dynamics, application of the game theory in behavioral studies, and phylogenetic construction using fossil records all employ computational tools, but do not necessarily involve biological macromolecules.
  • #31 Having recognized the power of bioinformatics, it is also important to realize its limitations and avoid over-reliance on and over-expectation of bioinformatics output. In fact, bioinformatics has a number of inherent limitations. Fighting a battle without intelligence is inefficient and dangerous It is no stretch in analogy that fighting diseases or other biological problems using bioinformatics is like fighting battles with intelligence