KEMBAR78
Bioinformatic_Databases and Sequence Analysis | PPT
Biological Databases
Some databases in the field of molecular biology…
AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb,
ARR, AsDb,BBDB, BCGD,Beanref,Biolmage,
BioMagResBank, BIOMDB, BLOCKS, BovGBASE,
BOVMAP, BSORF, BTKbase, CANSITE, CarbBank,
CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP,
ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG,
CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb,
Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC,
ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db,
ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView,
GCRDB, GDB, GENATLAS, Genbank, GeneCards,
Genline, GenLink, GENOTK, GenProtEC, GIFTS,
GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB,
HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD,
HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB,
HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat,
KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB,
Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP5
Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us,
MPDB, MRR, MutBase, MycDB, NDB, NRSub, 0-lycBase,
OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase, PDB,
PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD,
PPDB, PRESAGE, PRINTS, ProDom, Prolysis, PROSITE,
PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE,
SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase,
SPAD, SRNA db, SRPDB, STACK, StyGene,Sub2D,
SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS-
MODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB,
TOPS, TRANSFAC, TRR, UniGene, URNADB, V BASE,
VDRR, VectorDB, WDCM, WIT, WormPep, YEPD, YPD,
YPM, etc .................. !!!!
What we expect from a database..!!
• Sequence, functional, structural information,
related bibliography
• Well Structured and Indexed
• Well cross-referenced (with other databases)
• Periodically updated
• Tools for analysis and visualization
PRIMARY DATABASES
 Contains bio-molecular data in its original form.
 Experimental results are submitted directly into the database by researchers, and
the data are essentially archival in nature.
 Once given a database accession number, the data in primary databases are
never changed.
 Examples :- GenBank, EMBL and DDBJ for DNA/RNA sequences, SWISS-PROT
and PIR for protein sequences and PDB for molecular structures.
Biological Sequence Databases
• Sequence databases
• Structure databases
Nucleotide databases
• International Nucleotide Sequence
Database Collaboration (INSDC)
– NCBI
– EMBL
– DDBJ
Standard contents of a sequence
database
• Sequences
• Accession number
• References
• Taxonomic data
• Annotation/curation
• Keywords
• Cross-references
• Documentation
Genbank
• An annotated collection of all publicly
available nucleotide and proteins
• Set up in 1979 at the LANL (Los Alamos).
• Maintained since 1992 NCBI (Bethesda).
NCBI
• Very comprehensive biological database
• GENBANK: The nucleotide sequence database
• Provides 42 different resource
• Provides a simple and easy to use web
interface
http://www.ncbi.nlm.nih.gov/
• Sequence submission: done using Bankit or
Sequin
• Search Engine for data retrieval: Entrez
• Retrieves information across all the resources
under NCBI
Example: PubMed, taxonomy, SNP, PubChem
etc.
Tools for analysis
• BLAST
• Primer-BLAST
• ORF finder
• Genome workbench
GenBank file format
Protein Sequence databases
• UniProt
• PFAM
• Prosite
• Motif scan
UniProt
• Universal Protein Resource
• Formed through the merger of :
– SIB
– EBI-SwissProt
– TrEMBL
– PIR
Uniprot features
• Blast
• Align
• Retrieve
• ID mapping
Structural databases
• PDB – Protein Data Bank
• CATH
• SCOP – Structural Classification of Proteins
Protein DataBank (PDB)
• Important in solving real problems in molecular
biology
• Protein Databank
• PDB Established in 1972 at Brookhaven National
Laboratory (BNL)
• Sole international repository of macromolecular
structure data
• Moved to Research Collaboratory
for Structural Bioinformatics
http://www.rcsb.org/
PDB: example
HEADER LYASE(OXO-ACID) 01-OCT-91 12CA 12CA 2
COMPND CARBONIC ANHYDRASE /II (CARBONATE DEHYDRATASE) (/HCA II) 12CA 3
SOURCE HUMAN (HOMO SAPIENS) RECOMBINANT PROTEIN 12CA 5
AUTHOR S.K.NAIR,D.W.CHRISTIANSON 12CA 6
REVDAT 1 15-OCT-92 12CA 0 12CA 7
JRNL AUTH S.K.NAIR,T.L.CALDERONE,D.W.CHRISTIANSON,C.A.FIERKE 12CA 8
JRNL TITL ALTERING THE MOUTH OF A HYDROPHOBIC POCKET. 12CA 9
JRNL TITL 2 STRUCTURE AND KINETICS OF HUMAN CARBONIC ANHYDRASE 12CA 10
JRNL TITL 3 /II$ MUTANTS AT RESIDUE VAL-121 12CA 11
JRNL REF J.BIOL.CHEM. V. 266 17320 1991 12CA 12
JRNL REFN ASTM JBCHA3 US ISSN 0021-9258 071 12CA 13
REMARK 1 12CA 14EMARK 3 AUTHORS
HENDRICKSON,KONNERT 12CA 20
REMARK 3 R VALUE 0.170 12CA 21
REMARK 3 RMSD BOND DISTANCES 0.011 ANGSTROMS 12CA 22
REMARK 3 RMSD BOND ANGLES 1.3 DEGREES 12CA 23
REMARK 4 12CA 24
REMARK 4 N-TERMINAL RESIDUES SER 2, HIS 3, HIS 4 AND C-TERMINAL 12CA 25
REMARK 4 RESIDUE LYS 260 WERE NOT LOCATED IN THE DENSITY MAPS AND, 12CA 26
REMARK 4 THEREFORE, NO COORDINATES ARE INCLUDED FOR THESE RESIDUES. 12CA 27
………
wwPDB
• Contains information about experimentally
determined structures of proteins, nucleic
acids, and complex assemblies
• RCSB-PDB, PDBe, PDBj, BMRB – repositories of
protein structure data
• Files in PDB, mmCIF, PDBML/XML formats
• Advanced search – provides comprehensive
information about a protein.
• Sequence info, domain info, sequence
similarity, literature, apart from the details of
the structure.
• Cross referenced to SCOP and CATH
CATH
• Classification of proteins based on domain
structures
• Each protein chopped into individual domains
and assigned into homologous superfamilies.
• Hierarchial domain classification of PDB
entries.
CATH hierarchy
• Class – derived from secondary structure content is assigned
automatically
• Architecture – describes gross orientation of secondary
structures, independent of connectivity
• Topology – clusters structures according to their
topological connections and numbers of secondary
structures
• Homologous superfamily – this level groups
together protein domains which are thought to
share a common ancestor and can therefore be
described as homologous
SCOP
• Description of structural and evolutionary
relationships between all the proteins with
known structures
• Uses the PDB entries
• Search using keywords or PDB identifiers
Hierarchy in SCOP
• Class
• Fold
• Superfamily
• Family
• Species
SECONDARY DATABASES
Contains data derived from the results of
analysing primary data
Manually created or automatically generated
Contains more relevant and useful information
structured to specific requirements
Example :- PROSITE, PRINTS, BLOCKS, Pfam
Pfam
• Proteins contain conserved regions
• Based on the conserved regions, proteins are
classified into families
• Provides links to external databases like PDB,
SCOP, CATH etc.
Pfam: Features
• Sequence search
• View Pfam family
• View a clan
• View a sequence
• View a structure
• Keyword search
PROSITE http://ca.expasy.org/
prosite/
BLOCKS
COMPOSITE DATABASES
 Collection of various primary database sequences
 Renders sequence searching highly efficient as it
searches multiple resources
 Examples :- NRDB (Non Redundant Database), OWL,
MIPSX, SWISS PROT + TrEMBL
Thank you

Bioinformatic_Databases and Sequence Analysis

  • 1.
  • 2.
    Some databases inthe field of molecular biology… AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb, ARR, AsDb,BBDB, BCGD,Beanref,Biolmage, BioMagResBank, BIOMDB, BLOCKS, BovGBASE, BOVMAP, BSORF, BTKbase, CANSITE, CarbBank, CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP, ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG, CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb, Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC, ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db, ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView, GCRDB, GDB, GENATLAS, Genbank, GeneCards, Genline, GenLink, GENOTK, GenProtEC, GIFTS, GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB, HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD, HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB, HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat, KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB, Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP5 Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us, MPDB, MRR, MutBase, MycDB, NDB, NRSub, 0-lycBase, OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase, PDB, PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD, PPDB, PRESAGE, PRINTS, ProDom, Prolysis, PROSITE, PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE, SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase, SPAD, SRNA db, SRPDB, STACK, StyGene,Sub2D, SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS- MODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB, TOPS, TRANSFAC, TRR, UniGene, URNADB, V BASE, VDRR, VectorDB, WDCM, WIT, WormPep, YEPD, YPD, YPM, etc .................. !!!!
  • 3.
    What we expectfrom a database..!! • Sequence, functional, structural information, related bibliography • Well Structured and Indexed • Well cross-referenced (with other databases) • Periodically updated • Tools for analysis and visualization
  • 4.
    PRIMARY DATABASES  Containsbio-molecular data in its original form.  Experimental results are submitted directly into the database by researchers, and the data are essentially archival in nature.  Once given a database accession number, the data in primary databases are never changed.  Examples :- GenBank, EMBL and DDBJ for DNA/RNA sequences, SWISS-PROT and PIR for protein sequences and PDB for molecular structures.
  • 5.
    Biological Sequence Databases •Sequence databases • Structure databases
  • 6.
    Nucleotide databases • InternationalNucleotide Sequence Database Collaboration (INSDC) – NCBI – EMBL – DDBJ
  • 7.
    Standard contents ofa sequence database • Sequences • Accession number • References • Taxonomic data • Annotation/curation • Keywords • Cross-references • Documentation
  • 8.
    Genbank • An annotatedcollection of all publicly available nucleotide and proteins • Set up in 1979 at the LANL (Los Alamos). • Maintained since 1992 NCBI (Bethesda).
  • 9.
    NCBI • Very comprehensivebiological database • GENBANK: The nucleotide sequence database • Provides 42 different resource • Provides a simple and easy to use web interface http://www.ncbi.nlm.nih.gov/
  • 10.
    • Sequence submission:done using Bankit or Sequin • Search Engine for data retrieval: Entrez • Retrieves information across all the resources under NCBI Example: PubMed, taxonomy, SNP, PubChem etc.
  • 11.
    Tools for analysis •BLAST • Primer-BLAST • ORF finder • Genome workbench
  • 12.
  • 14.
    Protein Sequence databases •UniProt • PFAM • Prosite • Motif scan
  • 15.
    UniProt • Universal ProteinResource • Formed through the merger of : – SIB – EBI-SwissProt – TrEMBL – PIR
  • 16.
    Uniprot features • Blast •Align • Retrieve • ID mapping
  • 17.
  • 18.
    • PDB –Protein Data Bank • CATH • SCOP – Structural Classification of Proteins
  • 19.
    Protein DataBank (PDB) •Important in solving real problems in molecular biology • Protein Databank • PDB Established in 1972 at Brookhaven National Laboratory (BNL) • Sole international repository of macromolecular structure data • Moved to Research Collaboratory for Structural Bioinformatics http://www.rcsb.org/
  • 20.
    PDB: example HEADER LYASE(OXO-ACID)01-OCT-91 12CA 12CA 2 COMPND CARBONIC ANHYDRASE /II (CARBONATE DEHYDRATASE) (/HCA II) 12CA 3 SOURCE HUMAN (HOMO SAPIENS) RECOMBINANT PROTEIN 12CA 5 AUTHOR S.K.NAIR,D.W.CHRISTIANSON 12CA 6 REVDAT 1 15-OCT-92 12CA 0 12CA 7 JRNL AUTH S.K.NAIR,T.L.CALDERONE,D.W.CHRISTIANSON,C.A.FIERKE 12CA 8 JRNL TITL ALTERING THE MOUTH OF A HYDROPHOBIC POCKET. 12CA 9 JRNL TITL 2 STRUCTURE AND KINETICS OF HUMAN CARBONIC ANHYDRASE 12CA 10 JRNL TITL 3 /II$ MUTANTS AT RESIDUE VAL-121 12CA 11 JRNL REF J.BIOL.CHEM. V. 266 17320 1991 12CA 12 JRNL REFN ASTM JBCHA3 US ISSN 0021-9258 071 12CA 13 REMARK 1 12CA 14EMARK 3 AUTHORS HENDRICKSON,KONNERT 12CA 20 REMARK 3 R VALUE 0.170 12CA 21 REMARK 3 RMSD BOND DISTANCES 0.011 ANGSTROMS 12CA 22 REMARK 3 RMSD BOND ANGLES 1.3 DEGREES 12CA 23 REMARK 4 12CA 24 REMARK 4 N-TERMINAL RESIDUES SER 2, HIS 3, HIS 4 AND C-TERMINAL 12CA 25 REMARK 4 RESIDUE LYS 260 WERE NOT LOCATED IN THE DENSITY MAPS AND, 12CA 26 REMARK 4 THEREFORE, NO COORDINATES ARE INCLUDED FOR THESE RESIDUES. 12CA 27 ………
  • 21.
    wwPDB • Contains informationabout experimentally determined structures of proteins, nucleic acids, and complex assemblies • RCSB-PDB, PDBe, PDBj, BMRB – repositories of protein structure data • Files in PDB, mmCIF, PDBML/XML formats
  • 22.
    • Advanced search– provides comprehensive information about a protein. • Sequence info, domain info, sequence similarity, literature, apart from the details of the structure. • Cross referenced to SCOP and CATH
  • 23.
    CATH • Classification ofproteins based on domain structures • Each protein chopped into individual domains and assigned into homologous superfamilies. • Hierarchial domain classification of PDB entries.
  • 24.
    CATH hierarchy • Class– derived from secondary structure content is assigned automatically • Architecture – describes gross orientation of secondary structures, independent of connectivity • Topology – clusters structures according to their topological connections and numbers of secondary structures • Homologous superfamily – this level groups together protein domains which are thought to share a common ancestor and can therefore be described as homologous
  • 25.
    SCOP • Description ofstructural and evolutionary relationships between all the proteins with known structures • Uses the PDB entries • Search using keywords or PDB identifiers
  • 26.
    Hierarchy in SCOP •Class • Fold • Superfamily • Family • Species
  • 27.
    SECONDARY DATABASES Contains dataderived from the results of analysing primary data Manually created or automatically generated Contains more relevant and useful information structured to specific requirements Example :- PROSITE, PRINTS, BLOCKS, Pfam
  • 28.
    Pfam • Proteins containconserved regions • Based on the conserved regions, proteins are classified into families • Provides links to external databases like PDB, SCOP, CATH etc.
  • 29.
    Pfam: Features • Sequencesearch • View Pfam family • View a clan • View a sequence • View a structure • Keyword search
  • 30.
  • 31.
  • 32.
    COMPOSITE DATABASES  Collectionof various primary database sequences  Renders sequence searching highly efficient as it searches multiple resources  Examples :- NRDB (Non Redundant Database), OWL, MIPSX, SWISS PROT + TrEMBL
  • 34.

Editor's Notes

  • #6 Each database exchange data every day. Each database has its own sequence submission and retrieval tools They follow a standardized annotation The Collaboration created a Feature Table Definition that outlines legal features and syntax
  • #9 Currently, NCBI receives and processes about 20,000 direct submission sequences per month, in addition to the approximately 200,000 bulk submissions that are processed automatically. Collaboration with EMBL and DDBJ
  • #10 Database continues to grow at exponential rate. Doubling in size every 10 months Has sequences of 250,000 distinct organisms
  • #11 All tools can be downloaded and used on your local workstations as standalone.