This document provides an overview of bioinformatics. It defines bioinformatics as the science of collecting, analyzing and conceptualizing biological data through computational techniques. It discusses that bioinformatics involves managing, organizing and processing biological information from databases, as well as analyzing, visualizing and sharing biological data over the internet. It also outlines some of the goals of bioinformatics like organizing the human and mouse genomes, as well as some applications like genomic and protein sequence analysis, protein structure prediction, and characterizing genomes.
Introduction to bioinformatics, which combines biological data with informatics techniques for analysis and conceptualization.
Discusses biological data quantities, including human genome size (3 billion base pairs), and goals of bioinformatics such as data management and visualization.
Bioinformatics defined through molecular biology aspects and central dogma, highlighting data processing and statistical analysis in biological contexts.
Details computational methods in bioinformatics, including algorithm development and the human genome project, emphasizing data integration and accessibility.
Information analysis processes in bioinformatics, focusing on data management, hypothesis derivation, and the significance of quick analysis.
Different domains of bioinformatics applications including computational biology, medical informatics, and drug development.
Main goals in the post-genomics era, including gene annotation and prediction of gene functions from DNA sequences.
Assesses methods for gene identification within genomes and comparative genomics for evolutionary studies.
Focus on structural genomics, emphasizing protein structures, functionality, and evolutionary relationships derived from 3D structures.
Different biological databases including sequence, structure, and literature databases, and their interrelation for data accessibility.
Applications in genomics, including gene finding, characterization of genomic elements, and comparative analyses across species.
Applications in protein sequence analysis including alignment methods, prediction of secondary and tertiary structures, and evolutionary implications.
Overall characterization of genomes, including expression analysis, comparisons among organisms, and statistical evaluations.
Closure and thanks, likely summarizing the significance of bioinformatics in modern biological research.
Science ofcollecting, analyzing and conceptualizing
biological data by implication of informatics techniques.
2
Bioinformatics
Biology
Informa-
tics
Bioinformatics
Manage biologicalinformation
organize biological information using databases
Process, analyze, and visualize biological data
Share biological information to the public using the Internet.
4
Goals of Bioinformatics
5.
Bio –informatics
Bioinformatics is conceptualizing biology in terms of
molecules (in the sense of physical-chemistry)
applying “informatics” techniques (derived from
disciplines such as applied math, CS, and statistics)
to understand and organize the information
associated with these molecules, on a large-scale.
Bioinformatics is a practical discipline with many
applications.
5
Definition
7
Biological Information
CentralDogma
of Molecular Biology
DNA
-> RNA
-> Protein
-> Phenotype
-> DNA
Molecules
Sequence, Structure, Function,
Interaction
Processes
Mechanism, Specificity,
Regulation
Central Paradigm
for Bioinformatics
Genomic Sequence Information
-> mRNA (level)
-> Protein Sequence
-> Protein Structure
-> Protein Function
-> Protein Interaction
-> Phenotype
Large Amounts of Information
Statistical
Computer Processing
Could nothave been achieved without bioinformatics
Goals
3 billion DNA subunits
Discover all the human genes
Make them accessible for further biological study
then ?
Need to bring together and store vast amounts of information
from
Lab equipment and experiments
Computer Analysis
Human Analysis
Make visible to the world’s scientists 10
Human genome project
11.
11
How to analyze
information
Data
–Management.
–Analysis.
–Derive Hypothesis.
–Design and Implement an in silico experiment.
–Confirm in the wet lab.
12.
Find ananswer quickly
Most in silico biology is faster than in vitro
2. Massive amounts of data to analyze
Need to make use of all information
Not possible to do analysis by hand
Can’t organize and store information only using lab note
books•
Automation is key
However!
Verification ?
12
Why bioinformatics
13.
1. Computational biology-
Computing methods for classical biology
Primarily concerned ----> Evolutionary, population and
theoretical biology,
Cellular/Molecular biology ?
2. Medical informatics-
Computing methods to improve communication,
understanding, and management of medical data
Data Manipulation
Applications
14.
3. Chemo -informatics
Chemical and biological technology, for drug design
and development
4. Genomics
Analysis and comparison of the entire genome of a
single species or of multiple species
Genomics existed before any genomes were
completely sequenced, but in a very primitive state
Continued…
15.
5. Proteomics
Studyof how the genome is expressed in proteins, and of
how these proteins function and interact
Concerned with the actual states of specific cells, rather
than the potential states described by the genome
6. Pharmacogenomics
The application of genomic methods to identify drug
targets
For example, searching entire genomes for potential drug
receptors, or by studying gene expression patterns in
tumors
Continued….
16.
7. Pharmacogenetics :
The use of genomic methods to determine what
causes variations in individual response to drug
treatments
The goal is to identify drugs that may be only be
effective for subsets of patients, or to tailor drugs for
specific individuals or groups
21
Comparison between thefull drafts of the human and chimp
genomes revealed that they differ only by 1.23%
How humans
are chimps?
Perhaps not surprising!!!
22.
So where arewe different ??
22
Human ATAGCGGGGGGATGCGGGCCCTATACCC
Chimp ATAGGGG - - GGATGCGGGCCCTATACCC
Mouse ATAGCG - - - GGATGCGGCGC -TATACCA
24
The protein threedimensional structure can tell
much more than the sequence alone
Protein-ligand complexes
Functional sites
fold Evolutionary
relationship
Shape and electrostatics
Active sites
protein complexes
Biologic processes
25.
The different typesof data are collected in database
Sequence databases
Structural databases
Databases of Experimental Results
All databases are connected
25
Resources and Databases
3-dimensional structuresof proteins, nucleic acids,
molecular complexes etc
3-d data is available due to techniques such as NMR
and X-Ray crystallography
27
Structure Databases
28.
Data suchas experimental microarray images- gene
expression data
Proteomic data- protein expression data
Metabolic pathways, protein-protein interaction
data, regulatory networks
28
Databases of Experimental
Results
29.
29
PubMed
Service of theNational Library of Medicine
http://www.ncbi.nlm.nih.gov/pubmed/
Literature Databases
30.
Each Databasecontains specific information
Like other biological systems also these databases are
interrelated
30
Putting it all Together
Applications I-- Genomics
Finding Genes in Genomic DNA
introns
exons
Promotors
Characterizing Repeats in Genomic DNA
Statistics
Patterns
Expression Analysis
Time Course Clustering
Identifying regulatory Regions
Measuring Differences
• Genome Comparisons
Ortholog Families
Genome annotation
Evolutionary Phylogenetic
trees
• Characterizing Intergenic
Regions
Finding Pseudo genes
Patterns
• Duplications in the Genome
Large scale genomic
alignment
33.
Application II-
Protein
Sequence
SequenceAlignment
non-exact string matching,
gaps
How to align two strings
optimally via Dynamic
Programming
Local vs Global Alignment
Suboptimal Alignment
Hashing to increase speed
(BLAST, FASTA)
Amino acid substitution
scoring matrices
Multiple Alignment and
Consensus Patterns
How to align more than one
sequence and then fuse the
result in a consensus
representation
Transitive Comparisons
HMMs, Profiles
Motifs
Scoring schemes and
Matching statistics
How to tell if a given
alignment or match is
statistically significant
A P-value (or an e-value)?
Score Distributions
(extreme val. dist.)
Low Complexity Sequences
Evolutionary Issues
Rates of mutation and change
34.
Application
III-- Protein
Structure
SecondaryStructure
“Prediction”
via Propensities
Neural Networks, Genetic
Algorithm.
Simple Statistics
Trans Membrane Regions
Assessing Secondary Structure
Prediction
Tertiary Structure Prediction
Fold Recognition
Threading
Ab initio
Function Prediction
Active site identification
Relation of Sequence Similarity to
Structural Similarity
Overall Occurrenceof a
Certain Feature in the
Genome
e.g. how many kinases in
Yeast
Compare Organisms and
Tissues
Expression levels in
Cancerous vs Normal
Tissues
Databases, Statistics
Example Application IV:
Overall Genome Characterization