KEMBAR78
databases in bioinformatics | PPTX
DATABASES IN
BIOINFORMATICS
Introduction
 Fast increase in biological information
 Biological science has now turned into a
data rich science
 Gene sequences
 Amino acid sequences in proteins
 Motifs and domains in proteins
 Structural data from XRD & NMR
 Metabolic pathways
 Protein-protein interactions
 Gene expression data DNA microarrays
Biological databases
 Biological database is a collection of
data which is structured, searchable,
updated periodically and also cross-
referenced.
 Some databases are multi functional
 Major purposes of databases is as
follows:Availability of
biological data
Systemization
of data
Analysis of
computed
biological data
History
 1956; first sequence database when insulin
was sequenced
 51 amino acids
 Atlas of protein sequences and structures in
1965 by Margaret Day Hoff et al was a
printed book.
 Became base for PIR protein information
resource
 First nucleotide sequence: yeast tRNA
 77 bases
 During this time 3D structure of proteins was
being studied and renowned PDB was made.
…
 First genome published was of free
living virus haemophilus influenzae in
1995
 Genome?
 All genes ? Or all DNA?
 Why are complete genome
interesting?
Aspects of genome analysis
Ab initio Gene
prediction
Locus
Gene
identification by
EST (expressed
sequence tags)
Gene prediction
via EST
Gene prediction
via comparison,
coding and
regulatory
regions
Features of biological
databases
1) Data heterogeneity
2) High volume data
3) Uncertainty
4) Data Curation
5) Large scale data integration
6) Data sharing
7) Dynamic and subject to change
Classification scheme for
biological databases
Data type
Maintenance status
Data access
Data source
Database design
Organism
Data type
 Genome database
 Sequence database
 Structure database
 Microarray database
 Chemical database
 Pathway database
 Enzyme database
 Disease database
 Literature database
Based on maintenance status
NCBI EMBL SIB
Based on data access
1) Publicly available
2) Available with copy wright
3) Browsing only, accessible but not
downloadable
4) Academic but not freely available
5) Proprietary commercial
6) Restricted
Based on data sources
Based on
data
sources
Primary databases
 Contains original data from the
researchers
 Public or open access mostly
 NCBI , GENEBANK
 EMBL
 SWISS-PROT
 NDB
Secondary databases
 Results from entries of primary
database
 Manually created or automatically
generated
 Swiss-prot is an example of secondary
database
Thanks…
Biological
sequence
databases
Lecture # 5
By:
Hira Shahzad
DDBJ
 DNA databank of japan
 Nucleotide sequence database
 Established in 1986
 Has been working in collaboration
with EMBL & NCBI
 After 20 years another collaborative
project named INSDC was formed
EMBL Genebank DDBJ
SWISS-PROT
 Protein sequence database
 Maintained by SIB Swiss institute of
bioinformatics in Switzerland and also
the European bioinformatics institute
EBI
 The output format is swiss-prot file
 That has been explained in molecular
file formats
Good luck 

databases in bioinformatics

  • 1.
  • 2.
    Introduction  Fast increasein biological information  Biological science has now turned into a data rich science  Gene sequences  Amino acid sequences in proteins  Motifs and domains in proteins  Structural data from XRD & NMR  Metabolic pathways  Protein-protein interactions  Gene expression data DNA microarrays
  • 3.
    Biological databases  Biologicaldatabase is a collection of data which is structured, searchable, updated periodically and also cross- referenced.  Some databases are multi functional  Major purposes of databases is as follows:Availability of biological data Systemization of data Analysis of computed biological data
  • 4.
    History  1956; firstsequence database when insulin was sequenced  51 amino acids  Atlas of protein sequences and structures in 1965 by Margaret Day Hoff et al was a printed book.  Became base for PIR protein information resource  First nucleotide sequence: yeast tRNA  77 bases  During this time 3D structure of proteins was being studied and renowned PDB was made.
  • 5.
    …  First genomepublished was of free living virus haemophilus influenzae in 1995  Genome?  All genes ? Or all DNA?  Why are complete genome interesting?
  • 6.
    Aspects of genomeanalysis Ab initio Gene prediction Locus Gene identification by EST (expressed sequence tags) Gene prediction via EST Gene prediction via comparison, coding and regulatory regions
  • 7.
    Features of biological databases 1)Data heterogeneity 2) High volume data 3) Uncertainty 4) Data Curation 5) Large scale data integration 6) Data sharing 7) Dynamic and subject to change
  • 8.
    Classification scheme for biologicaldatabases Data type Maintenance status Data access Data source Database design Organism
  • 9.
    Data type  Genomedatabase  Sequence database  Structure database  Microarray database  Chemical database  Pathway database  Enzyme database  Disease database  Literature database
  • 10.
    Based on maintenancestatus NCBI EMBL SIB
  • 11.
    Based on dataaccess 1) Publicly available 2) Available with copy wright 3) Browsing only, accessible but not downloadable 4) Academic but not freely available 5) Proprietary commercial 6) Restricted
  • 12.
    Based on datasources Based on data sources
  • 13.
    Primary databases  Containsoriginal data from the researchers  Public or open access mostly  NCBI , GENEBANK  EMBL  SWISS-PROT  NDB
  • 14.
    Secondary databases  Resultsfrom entries of primary database  Manually created or automatically generated  Swiss-prot is an example of secondary database
  • 15.
  • 16.
  • 17.
    DDBJ  DNA databankof japan  Nucleotide sequence database  Established in 1986  Has been working in collaboration with EMBL & NCBI  After 20 years another collaborative project named INSDC was formed EMBL Genebank DDBJ
  • 21.
    SWISS-PROT  Protein sequencedatabase  Maintained by SIB Swiss institute of bioinformatics in Switzerland and also the European bioinformatics institute EBI  The output format is swiss-prot file  That has been explained in molecular file formats
  • 22.