Bioinformatic_Databases and Sequence Analysis

Some databases in the field of molecular biology…
AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb,
ARR, AsDb,BBDB, BCGD,Beanref,Biolmage,
BioMagResBank, BIOMDB, BLOCKS, BovGBASE,
BOVMAP, BSORF, BTKbase, CANSITE, CarbBank,
CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP,
ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG,
CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb,
Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC,
ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db,
ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView,
GCRDB, GDB, GENATLAS, Genbank, GeneCards,
Genline, GenLink, GENOTK, GenProtEC, GIFTS,
GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB,
HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD,
HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB,
HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat,
KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB,
Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP5
Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us,
MPDB, MRR, MutBase, MycDB, NDB, NRSub, 0-lycBase,
OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase, PDB,
PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD,
PPDB, PRESAGE, PRINTS, ProDom, Prolysis, PROSITE,
PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE,
SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase,
SPAD, SRNA db, SRPDB, STACK, StyGene,Sub2D,
SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS-
MODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB,
TOPS, TRANSFAC, TRR, UniGene, URNADB, V BASE,
VDRR, VectorDB, WDCM, WIT, WormPep, YEPD, YPD,
YPM, etc .................. !!!!

What we expect from a database..!!
• Sequence, functional, structural information,
related bibliography
• Well Structured and Indexed
• Well cross-referenced (with other databases)
• Periodically updated
• Tools for analysis and visualization

PRIMARY DATABASES
 Contains bio-molecular data in its original form.
 Experimental results are submitted directly into the database by researchers, and
the data are essentially archival in nature.
 Once given a database accession number, the data in primary databases are
never changed.
 Examples :- GenBank, EMBL and DDBJ for DNA/RNA sequences, SWISS-PROT
and PIR for protein sequences and PDB for molecular structures.

Biological Sequence Databases
• Sequence databases
• Structure databases

Nucleotide databases
• International Nucleotide Sequence
Database Collaboration (INSDC)
– NCBI
– EMBL
– DDBJ

Standard contents of a sequence
database
• Sequences
• Accession number
• References
• Taxonomic data
• Annotation/curation
• Keywords
• Cross-references
• Documentation

Genbank
• An annotated collection of all publicly
available nucleotide and proteins
• Set up in 1979 at the LANL (Los Alamos).
• Maintained since 1992 NCBI (Bethesda).

NCBI
• Very comprehensive biological database
• GENBANK: The nucleotide sequence database
• Provides 42 different resource
• Provides a simple and easy to use web
interface
http://www.ncbi.nlm.nih.gov/

• Sequence submission: done using Bankit or
Sequin
• Search Engine for data retrieval: Entrez
• Retrieves information across all the resources
under NCBI
Example: PubMed, taxonomy, SNP, PubChem
etc.

Tools for analysis
• BLAST
• Primer-BLAST
• ORF finder
• Genome workbench

Protein Sequence databases
• UniProt
• PFAM
• Prosite
• Motif scan

UniProt
• Universal Protein Resource
• Formed through the merger of :
– SIB
– EBI-SwissProt
– TrEMBL
– PIR

Uniprot features
• Blast
• Align
• Retrieve
• ID mapping

• PDB – Protein Data Bank
• CATH
• SCOP – Structural Classification of Proteins

Protein DataBank (PDB)
• Important in solving real problems in molecular
biology
• Protein Databank
• PDB Established in 1972 at Brookhaven National
Laboratory (BNL)
• Sole international repository of macromolecular
structure data
• Moved to Research Collaboratory
for Structural Bioinformatics
http://www.rcsb.org/

PDB: example
HEADER LYASE(OXO-ACID) 01-OCT-91 12CA 12CA 2
COMPND CARBONIC ANHYDRASE /II (CARBONATE DEHYDRATASE) (/HCA II) 12CA 3
SOURCE HUMAN (HOMO SAPIENS) RECOMBINANT PROTEIN 12CA 5
AUTHOR S.K.NAIR,D.W.CHRISTIANSON 12CA 6
REVDAT 1 15-OCT-92 12CA 0 12CA 7
JRNL AUTH S.K.NAIR,T.L.CALDERONE,D.W.CHRISTIANSON,C.A.FIERKE 12CA 8
JRNL TITL ALTERING THE MOUTH OF A HYDROPHOBIC POCKET. 12CA 9
JRNL TITL 2 STRUCTURE AND KINETICS OF HUMAN CARBONIC ANHYDRASE 12CA 10
JRNL TITL 3 /II$ MUTANTS AT RESIDUE VAL-121 12CA 11
JRNL REF J.BIOL.CHEM. V. 266 17320 1991 12CA 12
JRNL REFN ASTM JBCHA3 US ISSN 0021-9258 071 12CA 13
REMARK 1 12CA 14EMARK 3 AUTHORS
HENDRICKSON,KONNERT 12CA 20
REMARK 3 R VALUE 0.170 12CA 21
REMARK 3 RMSD BOND DISTANCES 0.011 ANGSTROMS 12CA 22
REMARK 3 RMSD BOND ANGLES 1.3 DEGREES 12CA 23
REMARK 4 12CA 24
REMARK 4 N-TERMINAL RESIDUES SER 2, HIS 3, HIS 4 AND C-TERMINAL 12CA 25
REMARK 4 RESIDUE LYS 260 WERE NOT LOCATED IN THE DENSITY MAPS AND, 12CA 26
REMARK 4 THEREFORE, NO COORDINATES ARE INCLUDED FOR THESE RESIDUES. 12CA 27
………

wwPDB
• Contains information about experimentally
determined structures of proteins, nucleic
acids, and complex assemblies
• RCSB-PDB, PDBe, PDBj, BMRB – repositories of
protein structure data
• Files in PDB, mmCIF, PDBML/XML formats

• Advanced search – provides comprehensive
information about a protein.
• Sequence info, domain info, sequence
similarity, literature, apart from the details of
the structure.
• Cross referenced to SCOP and CATH

CATH
• Classification of proteins based on domain
structures
• Each protein chopped into individual domains
and assigned into homologous superfamilies.
• Hierarchial domain classification of PDB
entries.

CATH hierarchy
• Class – derived from secondary structure content is assigned
automatically
• Architecture – describes gross orientation of secondary
structures, independent of connectivity
• Topology – clusters structures according to their
topological connections and numbers of secondary
structures
• Homologous superfamily – this level groups
together protein domains which are thought to
share a common ancestor and can therefore be
described as homologous

SCOP
• Description of structural and evolutionary
relationships between all the proteins with
known structures
• Uses the PDB entries
• Search using keywords or PDB identifiers

Hierarchy in SCOP
• Class
• Fold
• Superfamily
• Family
• Species

SECONDARY DATABASES
Contains data derived from the results of
analysing primary data
Manually created or automatically generated
Contains more relevant and useful information
structured to specific requirements
Example :- PROSITE, PRINTS, BLOCKS, Pfam

Pfam
• Proteins contain conserved regions
• Based on the conserved regions, proteins are
classified into families
• Provides links to external databases like PDB,
SCOP, CATH etc.

Pfam: Features
• Sequence search
• View Pfam family
• View a clan
• View a sequence
• View a structure
• Keyword search

PROSITE http://ca.expasy.org/
prosite/

COMPOSITE DATABASES
 Collection of various primary database sequences
 Renders sequence searching highly efficient as it
searches multiple resources
 Examples :- NRDB (Non Redundant Database), OWL,
MIPSX, SWISS PROT + TrEMBL

Bioinformatic_Databases and Sequence Analysis

More Related Content

Similar to Bioinformatic_Databases and Sequence Analysis

More from MohamedHasan816582

Recently uploaded

Bioinformatic_Databases and Sequence Analysis

Editor's Notes