KEMBAR78
bioinformatics simple | PPTX
1

 Science of collecting, analyzing and conceptualizing
biological data by implication of informatics techniques.
2
Bioinformatics
Biology
Informa-
tics
Bioinformatics
Biological
Data
Computer
Analysis+
Mouse Genome: 2.5 billion base pairs
Human Genome: 3 billion base pairs 3

 Manage biological information
 organize biological information using databases
 Process, analyze, and visualize biological data
 Share biological information to the public using the Internet.
4
Goals of Bioinformatics

 Bio – informatics
 Bioinformatics is conceptualizing biology in terms of
molecules (in the sense of physical-chemistry)
applying “informatics” techniques (derived from
disciplines such as applied math, CS, and statistics)
to understand and organize the information
associated with these molecules, on a large-scale.
 Bioinformatics is a practical discipline with many
applications.
5
Definition

Computational biology
6
Bioinformatics
Systems
biology
Genomics
Bioinformatics

7
Biological Information
 Central Dogma
of Molecular Biology
DNA
-> RNA
-> Protein
-> Phenotype
-> DNA
 Molecules
 Sequence, Structure, Function,
Interaction
 Processes
 Mechanism, Specificity,
Regulation
 Central Paradigm
for Bioinformatics
Genomic Sequence Information
-> mRNA (level)
-> Protein Sequence
-> Protein Structure
-> Protein Function
-> Protein Interaction
-> Phenotype
 Large Amounts of Information
 Statistical
 Computer Processing

Systems Analysis
Information Theory
Graph Theory
Robotics
Algorithms
Artificial IntelligenceStatistics
8

9
Domains of
bioinformatics
Bio-informatist
Development of new
software
Algorithms
Bio-informaticians.
Using different algorithms
and computer software

 Could not have been achieved without bioinformatics
 Goals
 3 billion DNA subunits
 Discover all the human genes
 Make them accessible for further biological study
 then ?
 Need to bring together and store vast amounts of information
from
 Lab equipment and experiments
 Computer Analysis
 Human Analysis
 Make visible to the world’s scientists 10
Human genome project

11
How to analyze
information
 Data
 –Management.
 –Analysis.
 –Derive Hypothesis.
 –Design and Implement an in silico experiment.
 –Confirm in the wet lab.

 Find an answer quickly
 Most in silico biology is faster than in vitro
 2. Massive amounts of data to analyze
 Need to make use of all information
 Not possible to do analysis by hand
 Can’t organize and store information only using lab note
books•
 Automation is key
 However!
 Verification ?
12
Why bioinformatics

1. Computational biology-
 Computing methods for classical biology
 Primarily concerned ----> Evolutionary, population and
theoretical biology,
 Cellular/Molecular biology ?
2. Medical informatics-
 Computing methods to improve communication,
understanding, and management of medical data
 Data Manipulation
Applications

3. Chemo -informatics
 Chemical and biological technology, for drug design
and development
4. Genomics
 Analysis and comparison of the entire genome of a
single species or of multiple species
 Genomics existed before any genomes were
completely sequenced, but in a very primitive state
Continued…

5. Proteomics
 Study of how the genome is expressed in proteins, and of
how these proteins function and interact
 Concerned with the actual states of specific cells, rather
than the potential states described by the genome
6. Pharmacogenomics
 The application of genomic methods to identify drug
targets
 For example, searching entire genomes for potential drug
receptors, or by studying gene expression patterns in
tumors
Continued….

7. Pharmacogenetics :
 The use of genomic methods to determine what
causes variations in individual response to drug
treatments
 The goal is to identify drugs that may be only be
effective for subsets of patients, or to tailor drugs for
specific individuals or groups

17
Main Goal:
?
Annotation Comparative
genomics
Structural
genomics
Functional
genomics
The “post-genomics” era
18
Annotation
Identify the genes within a
given sequence of DNA
Identify the sites
Which regulate the gene
Predict the function

 A gene is characterized by several features
(promoter, ORF…)
 some are easier and some harder to detect…
19
How do we identify a gene
in a genome?
20
Comparative
genomics

21
Comparison between the full drafts of the human and chimp
genomes revealed that they differ only by 1.23%
How humans
are chimps?
Perhaps not surprising!!!

So where are we different ??
22
Human ATAGCGGGGGGATGCGGGCCCTATACCC
Chimp ATAGGGG - - GGATGCGGGCCCTATACCC
Mouse ATAGCG - - - GGATGCGGCGC -TATACCA

23
Structural
Genomics
24
The protein three dimensional structure can tell
much more than the sequence alone
Protein-ligand complexes
Functional sites
fold Evolutionary
relationship
Shape and electrostatics
Active sites
protein complexes
Biologic processes

The different types of data are collected in database
 Sequence databases
 Structural databases
 Databases of Experimental Results
All databases are connected
25
Resources and Databases

Gene database
Genome database
Disease related mutation database
26
Sequence databases

 3-dimensional structures of proteins, nucleic acids,
molecular complexes etc
 3-d data is available due to techniques such as NMR
and X-Ray crystallography
27
Structure Databases

 Data such as experimental microarray images- gene
expression data
 Proteomic data- protein expression data
 Metabolic pathways, protein-protein interaction
data, regulatory networks
28
Databases of Experimental
Results

29
PubMed
Service of the National Library of Medicine
http://www.ncbi.nlm.nih.gov/pubmed/
Literature Databases

 Each Database contains specific information
 Like other biological systems also these databases are
interrelated
30
Putting it all Together
31
GENOMIC DATA
GenBank
DDBJ
EMBL
ASSEMBLED
GENOMES
GoldenPath
WormBase
TIGR
PROTEIN
PIR
SWISS-PROT
STRUCTURE
PDB
MMDB
SCOP
LITERATURE
PubMed
PATHWAY
KEGG
COG
DISEASE
LocusLink
OMIM
OMIA
GENES
RefSeq
AllGenes
GDBSNPs
dbSNP
ESTs
dbEST
unigene
MOTIFS
BLOCKS
Pfam
Prosite
GENE
EXPRESSION
Stanford MGDB
NetAffx
ArrayExpress

Applications I-- Genomics
 Finding Genes in Genomic DNA
 introns
 exons
 Promotors
 Characterizing Repeats in Genomic DNA
 Statistics
 Patterns
 Expression Analysis
 Time Course Clustering
 Identifying regulatory Regions
 Measuring Differences
• Genome Comparisons
 Ortholog Families
 Genome annotation
 Evolutionary Phylogenetic
trees
• Characterizing Intergenic
Regions
 Finding Pseudo genes
 Patterns
• Duplications in the Genome
 Large scale genomic
alignment

Application II-
Protein
Sequence
 Sequence Alignment
 non-exact string matching,
gaps
 How to align two strings
optimally via Dynamic
Programming
 Local vs Global Alignment
 Suboptimal Alignment
 Hashing to increase speed
(BLAST, FASTA)
 Amino acid substitution
scoring matrices
 Multiple Alignment and
Consensus Patterns
 How to align more than one
sequence and then fuse the
result in a consensus
representation
 Transitive Comparisons
 HMMs, Profiles
 Motifs
 Scoring schemes and
Matching statistics
 How to tell if a given
alignment or match is
statistically significant
 A P-value (or an e-value)?
 Score Distributions
(extreme val. dist.)
 Low Complexity Sequences
 Evolutionary Issues
 Rates of mutation and change

Application
III-- Protein
Structure
 Secondary Structure
“Prediction”
 via Propensities
 Neural Networks, Genetic
Algorithm.
 Simple Statistics
 Trans Membrane Regions
 Assessing Secondary Structure
Prediction
 Tertiary Structure Prediction
 Fold Recognition
 Threading
 Ab initio
 Function Prediction
 Active site identification
 Relation of Sequence Similarity to
Structural Similarity

Example Application IV: Finding Homologs
Core

 Overall Occurrence of a
Certain Feature in the
Genome
 e.g. how many kinases in
Yeast
 Compare Organisms and
Tissues
 Expression levels in
Cancerous vs Normal
Tissues
 Databases, Statistics
Example Application IV:
Overall Genome Characterization

37
Thanks

bioinformatics simple

  • 1.
  • 2.
      Science ofcollecting, analyzing and conceptualizing biological data by implication of informatics techniques. 2 Bioinformatics Biology Informa- tics Bioinformatics
  • 3.
    Biological Data Computer Analysis+ Mouse Genome: 2.5billion base pairs Human Genome: 3 billion base pairs 3
  • 4.
      Manage biologicalinformation  organize biological information using databases  Process, analyze, and visualize biological data  Share biological information to the public using the Internet. 4 Goals of Bioinformatics
  • 5.
      Bio –informatics  Bioinformatics is conceptualizing biology in terms of molecules (in the sense of physical-chemistry) applying “informatics” techniques (derived from disciplines such as applied math, CS, and statistics) to understand and organize the information associated with these molecules, on a large-scale.  Bioinformatics is a practical discipline with many applications. 5 Definition
  • 6.
  • 7.
     7 Biological Information  CentralDogma of Molecular Biology DNA -> RNA -> Protein -> Phenotype -> DNA  Molecules  Sequence, Structure, Function, Interaction  Processes  Mechanism, Specificity, Regulation  Central Paradigm for Bioinformatics Genomic Sequence Information -> mRNA (level) -> Protein Sequence -> Protein Structure -> Protein Function -> Protein Interaction -> Phenotype  Large Amounts of Information  Statistical  Computer Processing
  • 8.
     Systems Analysis Information Theory GraphTheory Robotics Algorithms Artificial IntelligenceStatistics 8
  • 9.
     9 Domains of bioinformatics Bio-informatist Development ofnew software Algorithms Bio-informaticians. Using different algorithms and computer software
  • 10.
      Could nothave been achieved without bioinformatics  Goals  3 billion DNA subunits  Discover all the human genes  Make them accessible for further biological study  then ?  Need to bring together and store vast amounts of information from  Lab equipment and experiments  Computer Analysis  Human Analysis  Make visible to the world’s scientists 10 Human genome project
  • 11.
     11 How to analyze information Data  –Management.  –Analysis.  –Derive Hypothesis.  –Design and Implement an in silico experiment.  –Confirm in the wet lab.
  • 12.
      Find ananswer quickly  Most in silico biology is faster than in vitro  2. Massive amounts of data to analyze  Need to make use of all information  Not possible to do analysis by hand  Can’t organize and store information only using lab note books•  Automation is key  However!  Verification ? 12 Why bioinformatics
  • 13.
     1. Computational biology- Computing methods for classical biology  Primarily concerned ----> Evolutionary, population and theoretical biology,  Cellular/Molecular biology ? 2. Medical informatics-  Computing methods to improve communication, understanding, and management of medical data  Data Manipulation Applications
  • 14.
     3. Chemo -informatics Chemical and biological technology, for drug design and development 4. Genomics  Analysis and comparison of the entire genome of a single species or of multiple species  Genomics existed before any genomes were completely sequenced, but in a very primitive state Continued…
  • 15.
     5. Proteomics  Studyof how the genome is expressed in proteins, and of how these proteins function and interact  Concerned with the actual states of specific cells, rather than the potential states described by the genome 6. Pharmacogenomics  The application of genomic methods to identify drug targets  For example, searching entire genomes for potential drug receptors, or by studying gene expression patterns in tumors Continued….
  • 16.
     7. Pharmacogenetics : The use of genomic methods to determine what causes variations in individual response to drug treatments  The goal is to identify drugs that may be only be effective for subsets of patients, or to tailor drugs for specific individuals or groups
  • 17.
  • 18.
    18 Annotation Identify the geneswithin a given sequence of DNA Identify the sites Which regulate the gene Predict the function
  • 19.
      A geneis characterized by several features (promoter, ORF…)  some are easier and some harder to detect… 19 How do we identify a gene in a genome?
  • 20.
  • 21.
     21 Comparison between thefull drafts of the human and chimp genomes revealed that they differ only by 1.23% How humans are chimps? Perhaps not surprising!!!
  • 22.
     So where arewe different ?? 22 Human ATAGCGGGGGGATGCGGGCCCTATACCC Chimp ATAGGGG - - GGATGCGGGCCCTATACCC Mouse ATAGCG - - - GGATGCGGCGC -TATACCA
  • 23.
  • 24.
    24 The protein threedimensional structure can tell much more than the sequence alone Protein-ligand complexes Functional sites fold Evolutionary relationship Shape and electrostatics Active sites protein complexes Biologic processes
  • 25.
     The different typesof data are collected in database  Sequence databases  Structural databases  Databases of Experimental Results All databases are connected 25 Resources and Databases
  • 26.
     Gene database Genome database Diseaserelated mutation database 26 Sequence databases
  • 27.
      3-dimensional structuresof proteins, nucleic acids, molecular complexes etc  3-d data is available due to techniques such as NMR and X-Ray crystallography 27 Structure Databases
  • 28.
      Data suchas experimental microarray images- gene expression data  Proteomic data- protein expression data  Metabolic pathways, protein-protein interaction data, regulatory networks 28 Databases of Experimental Results
  • 29.
     29 PubMed Service of theNational Library of Medicine http://www.ncbi.nlm.nih.gov/pubmed/ Literature Databases
  • 30.
      Each Databasecontains specific information  Like other biological systems also these databases are interrelated 30 Putting it all Together
  • 31.
  • 32.
     Applications I-- Genomics Finding Genes in Genomic DNA  introns  exons  Promotors  Characterizing Repeats in Genomic DNA  Statistics  Patterns  Expression Analysis  Time Course Clustering  Identifying regulatory Regions  Measuring Differences • Genome Comparisons  Ortholog Families  Genome annotation  Evolutionary Phylogenetic trees • Characterizing Intergenic Regions  Finding Pseudo genes  Patterns • Duplications in the Genome  Large scale genomic alignment
  • 33.
     Application II- Protein Sequence  SequenceAlignment  non-exact string matching, gaps  How to align two strings optimally via Dynamic Programming  Local vs Global Alignment  Suboptimal Alignment  Hashing to increase speed (BLAST, FASTA)  Amino acid substitution scoring matrices  Multiple Alignment and Consensus Patterns  How to align more than one sequence and then fuse the result in a consensus representation  Transitive Comparisons  HMMs, Profiles  Motifs  Scoring schemes and Matching statistics  How to tell if a given alignment or match is statistically significant  A P-value (or an e-value)?  Score Distributions (extreme val. dist.)  Low Complexity Sequences  Evolutionary Issues  Rates of mutation and change
  • 34.
     Application III-- Protein Structure  SecondaryStructure “Prediction”  via Propensities  Neural Networks, Genetic Algorithm.  Simple Statistics  Trans Membrane Regions  Assessing Secondary Structure Prediction  Tertiary Structure Prediction  Fold Recognition  Threading  Ab initio  Function Prediction  Active site identification  Relation of Sequence Similarity to Structural Similarity
  • 35.
     Example Application IV:Finding Homologs Core
  • 36.
      Overall Occurrenceof a Certain Feature in the Genome  e.g. how many kinases in Yeast  Compare Organisms and Tissues  Expression levels in Cancerous vs Normal Tissues  Databases, Statistics Example Application IV: Overall Genome Characterization
  • 37.