KEMBAR78
Bioinformatics and Databases in Biological Science | PPT
Introduction to Databases
INTRODUCTION
DATA
Data is raw, unorganized facts that need to be processed.
Example:- Each student's test score is one piece of data.
INFORMATION
• When data is processed, organized, structured
or presented in a given context so as to make
it useful, it is called information.
• Example:- score of a class or of the average
entire school is information that can be
derived from the given data.
Database
• A database is a collection of data in an organized
manner, which is accessible in various ways.
• Biological Databases serve a critical purpose in the
collection and organization of data related to biological
systems.
• They provide a computational support and a user-friendly
interface to a researcher for a meaningful analysis of
biological data.
• A database is a computerized archive used to store and organize
data in such a way that information can be retrieved easily via a
variety of search criteria.
• Databases are composed of computer hardware and software for
data management.
• The chief objective of the development of a database is to organize
data in a set of structured records to enable easy retrieval of
information.
• Each record, also called an entry, should contain a number of fields
that hold the actual data items, for example, fields for names, phone
numbers, addresses, dates.
WHAT ARE THE BIOLOGICAL
DATABASES ???
Different classifications of
databases
• Type of data
• nucleotide sequences
• protein sequences
• proteins sequence patterns or motifs
• macromolecular 3D structure
• gene expression data
• metabolic pathways
Different classifications of
databases….
• Primary or derived databases
• Primary databases: experimental results directly into database
• Secondary databases: results of analysis of primary databases
• Aggregate of many databases
• Links to other data items
• Combination of data
• Consolidation of data
Different classifications of
databases….
• Availability
• Publicly available, no restrictions
• Available, but with copyright
• Accessible, but not downloadable
• Academic, but not freely available
• Proprietary, commercial; possibly free for academics
TYPES OF DATABASES
 Primary Databases
 Secondary Databases
PRIMARY DATABASES
 Contains bio-molecular data in its original form.
 Experimental results are submitted directly into the database by researchers,
and the data are essentially archival in nature.
 Once given a database accession number, the data in primary databases are
never changed.
 Examples :- GenBank, EMBL and DDBJ for DNA/RNA sequences, SWISS-
PROT and PIR for protein sequences and PDB for molecular structures.
GenBank
•Database from NCBI, includes sequences from
publicly available resources.
http://
www.ncbi.nlm.nih.gov/
genbank/
NCBI and Entrez
• One of the largest and most comprehensive
databases belonging to the NIH – national
institute of health (USA)
• Entrez is the search engine of NCBI
• Search for :
genes, proteins, genomes, structures,
diseases, publications and more.
• http://www.ncbi.nlm.nih.gov/
15
Genbank
• An annotated collection of all publicly available nucleotide and
proteins
• Set up in 1979 at the LANL (Los Alamos).
• Maintained since 1992 NCBI (Bethesda).
GenBank file format
GenBank file format
EMBL
 European Molecular Biological Laboratory
 Nucleic acid database from EBI
(European Bioinformatics Institute)
 Produced in collaboration with DDBJ and GenBank
 Search engine – SRS (Sequence Retrieval System)
http://
www.ebi.ac.uk/
DDBJ
 DNA Databank of Japan
 Started in 1986 in collaboration with GenBank
 Produced and maintained at NIG
(National Institute of Genetics)
http://www.ddbj.nig.ac.jp/
SWISS PROT http://www.ebi.ac.uk/uniprot/
…...
 Annotated sequence database established
in 1986
 Consists of sequence entries of different
lie formats
 Similar format to EMBL
 http://us.expasy.org/sprot/sprot-top.html
PIR
• Protein Information Resource
•A division of National Biomedical Research
•Foundation (NBRF) in U.S.
•One can search for entries or do sequence
similarity search at PIR site.
http://
pir.georgetown.edu/
TrEMBL
 Translated European Molecular Biology Laboratory
 Computer annotated supplement of SWISS PROT.
 Contains all the translations of EMBL nucleotide
sequence entries not yet integrated in SWISS PROT.
http://www.ebi.ac.uk/trembl/
Protein DataBank (PDB)
• Important in solving real problems in
molecular biology
• Protein Databank
• PDB Established in 1972 at Brookhaven National
Laboratory (BNL)
• Sole international repository of macromolecular
structure data
• Moved to Research Collaboratory
for Structural Bioinformatics
http://www.rcsb.org/
PDB: example
HEADER LYASE(OXO-ACID) 01-OCT-91 12CA 12CA 2
COMPND CARBONIC ANHYDRASE /II (CARBONATE DEHYDRATASE) (/HCA II) 12CA 3
SOURCE HUMAN (HOMO SAPIENS) RECOMBINANT PROTEIN 12CA 5
AUTHOR S.K.NAIR,D.W.CHRISTIANSON 12CA 6
REVDAT 1 15-OCT-92 12CA 0 12CA 7
JRNL AUTH S.K.NAIR,T.L.CALDERONE,D.W.CHRISTIANSON,C.A.FIERKE 12CA 8
JRNL TITL ALTERING THE MOUTH OF A HYDROPHOBIC POCKET. 12CA 9
JRNL TITL 2 STRUCTURE AND KINETICS OF HUMAN CARBONIC ANHYDRASE 12CA 10
JRNL TITL 3 /II$ MUTANTS AT RESIDUE VAL-121 12CA 11
JRNL REF J.BIOL.CHEM. V. 266 17320 1991 12CA 12
JRNL REFN ASTM JBCHA3 US ISSN 0021-9258 071 12CA 13
REMARK 1 12CA 14EMARK 3 AUTHORS HENDRICKSON,KONNERT 12CA 20
REMARK 3 R VALUE 0.170 12CA 21
REMARK 3 RMSD BOND DISTANCES 0.011 ANGSTROMS 12CA 22
REMARK 3 RMSD BOND ANGLES 1.3 DEGREES 12CA 23
REMARK 4 12CA 24
REMARK 4 N-TERMINAL RESIDUES SER 2, HIS 3, HIS 4 AND C-TERMINAL 12CA 25
REMARK 4 RESIDUE LYS 260 WERE NOT LOCATED IN THE DENSITY MAPS AND, 12CA 26
REMARK 4 THEREFORE, NO COORDINATES ARE INCLUDED FOR THESE RESIDUES. 12CA 27
………
COMPOSITE DATABASES
 Collection of various primary database sequences
 Renders sequence searching highly efficient as it
searches multiple resources
 Examples :- NRDB (Non Redundant Database), OWL,
MIPSX, SWISS PROT + TrEMBL
SECONDARY DATABASES
Contains data derived from the results of analysing
primary data
Manually created or automatically generated
Contains more relevant and useful information
structured to specific requirements
Example :- PROSITE, PRINTS, BLOCKS, Pfam
PROSITE
Families of proteins
Can search using regular
expressions
Similar to unix commands
Families exhibit these patterns
So we can search over families
http://
ca.expasy.org/
prosite/
BLOCK
S
 Motifs/blocks are
created by
automatically
detecting the most
conserved regions
of each protein
family.
PRIMARY VS SECONDARY DATABASES

Bioinformatics and Databases in Biological Science

  • 1.
  • 2.
  • 3.
    DATA Data is raw,unorganized facts that need to be processed. Example:- Each student's test score is one piece of data. INFORMATION • When data is processed, organized, structured or presented in a given context so as to make it useful, it is called information. • Example:- score of a class or of the average entire school is information that can be derived from the given data.
  • 4.
    Database • A databaseis a collection of data in an organized manner, which is accessible in various ways. • Biological Databases serve a critical purpose in the collection and organization of data related to biological systems. • They provide a computational support and a user-friendly interface to a researcher for a meaningful analysis of biological data.
  • 5.
    • A databaseis a computerized archive used to store and organize data in such a way that information can be retrieved easily via a variety of search criteria. • Databases are composed of computer hardware and software for data management. • The chief objective of the development of a database is to organize data in a set of structured records to enable easy retrieval of information. • Each record, also called an entry, should contain a number of fields that hold the actual data items, for example, fields for names, phone numbers, addresses, dates.
  • 6.
    WHAT ARE THEBIOLOGICAL DATABASES ???
  • 8.
    Different classifications of databases •Type of data • nucleotide sequences • protein sequences • proteins sequence patterns or motifs • macromolecular 3D structure • gene expression data • metabolic pathways
  • 10.
    Different classifications of databases…. •Primary or derived databases • Primary databases: experimental results directly into database • Secondary databases: results of analysis of primary databases • Aggregate of many databases • Links to other data items • Combination of data • Consolidation of data
  • 11.
    Different classifications of databases…. •Availability • Publicly available, no restrictions • Available, but with copyright • Accessible, but not downloadable • Academic, but not freely available • Proprietary, commercial; possibly free for academics
  • 12.
    TYPES OF DATABASES Primary Databases  Secondary Databases
  • 13.
    PRIMARY DATABASES  Containsbio-molecular data in its original form.  Experimental results are submitted directly into the database by researchers, and the data are essentially archival in nature.  Once given a database accession number, the data in primary databases are never changed.  Examples :- GenBank, EMBL and DDBJ for DNA/RNA sequences, SWISS- PROT and PIR for protein sequences and PDB for molecular structures.
  • 14.
    GenBank •Database from NCBI,includes sequences from publicly available resources. http:// www.ncbi.nlm.nih.gov/ genbank/
  • 15.
    NCBI and Entrez •One of the largest and most comprehensive databases belonging to the NIH – national institute of health (USA) • Entrez is the search engine of NCBI • Search for : genes, proteins, genomes, structures, diseases, publications and more. • http://www.ncbi.nlm.nih.gov/ 15
  • 16.
    Genbank • An annotatedcollection of all publicly available nucleotide and proteins • Set up in 1979 at the LANL (Los Alamos). • Maintained since 1992 NCBI (Bethesda).
  • 17.
  • 18.
  • 20.
    EMBL  European MolecularBiological Laboratory  Nucleic acid database from EBI (European Bioinformatics Institute)  Produced in collaboration with DDBJ and GenBank  Search engine – SRS (Sequence Retrieval System) http:// www.ebi.ac.uk/
  • 21.
    DDBJ  DNA Databankof Japan  Started in 1986 in collaboration with GenBank  Produced and maintained at NIG (National Institute of Genetics) http://www.ddbj.nig.ac.jp/
  • 22.
    SWISS PROT http://www.ebi.ac.uk/uniprot/ …... Annotated sequence database established in 1986  Consists of sequence entries of different lie formats  Similar format to EMBL  http://us.expasy.org/sprot/sprot-top.html
  • 23.
    PIR • Protein InformationResource •A division of National Biomedical Research •Foundation (NBRF) in U.S. •One can search for entries or do sequence similarity search at PIR site. http:// pir.georgetown.edu/
  • 24.
    TrEMBL  Translated EuropeanMolecular Biology Laboratory  Computer annotated supplement of SWISS PROT.  Contains all the translations of EMBL nucleotide sequence entries not yet integrated in SWISS PROT. http://www.ebi.ac.uk/trembl/
  • 25.
    Protein DataBank (PDB) •Important in solving real problems in molecular biology • Protein Databank • PDB Established in 1972 at Brookhaven National Laboratory (BNL) • Sole international repository of macromolecular structure data • Moved to Research Collaboratory for Structural Bioinformatics http://www.rcsb.org/
  • 26.
    PDB: example HEADER LYASE(OXO-ACID)01-OCT-91 12CA 12CA 2 COMPND CARBONIC ANHYDRASE /II (CARBONATE DEHYDRATASE) (/HCA II) 12CA 3 SOURCE HUMAN (HOMO SAPIENS) RECOMBINANT PROTEIN 12CA 5 AUTHOR S.K.NAIR,D.W.CHRISTIANSON 12CA 6 REVDAT 1 15-OCT-92 12CA 0 12CA 7 JRNL AUTH S.K.NAIR,T.L.CALDERONE,D.W.CHRISTIANSON,C.A.FIERKE 12CA 8 JRNL TITL ALTERING THE MOUTH OF A HYDROPHOBIC POCKET. 12CA 9 JRNL TITL 2 STRUCTURE AND KINETICS OF HUMAN CARBONIC ANHYDRASE 12CA 10 JRNL TITL 3 /II$ MUTANTS AT RESIDUE VAL-121 12CA 11 JRNL REF J.BIOL.CHEM. V. 266 17320 1991 12CA 12 JRNL REFN ASTM JBCHA3 US ISSN 0021-9258 071 12CA 13 REMARK 1 12CA 14EMARK 3 AUTHORS HENDRICKSON,KONNERT 12CA 20 REMARK 3 R VALUE 0.170 12CA 21 REMARK 3 RMSD BOND DISTANCES 0.011 ANGSTROMS 12CA 22 REMARK 3 RMSD BOND ANGLES 1.3 DEGREES 12CA 23 REMARK 4 12CA 24 REMARK 4 N-TERMINAL RESIDUES SER 2, HIS 3, HIS 4 AND C-TERMINAL 12CA 25 REMARK 4 RESIDUE LYS 260 WERE NOT LOCATED IN THE DENSITY MAPS AND, 12CA 26 REMARK 4 THEREFORE, NO COORDINATES ARE INCLUDED FOR THESE RESIDUES. 12CA 27 ………
  • 27.
    COMPOSITE DATABASES  Collectionof various primary database sequences  Renders sequence searching highly efficient as it searches multiple resources  Examples :- NRDB (Non Redundant Database), OWL, MIPSX, SWISS PROT + TrEMBL
  • 29.
    SECONDARY DATABASES Contains dataderived from the results of analysing primary data Manually created or automatically generated Contains more relevant and useful information structured to specific requirements Example :- PROSITE, PRINTS, BLOCKS, Pfam
  • 30.
    PROSITE Families of proteins Cansearch using regular expressions Similar to unix commands Families exhibit these patterns So we can search over families http:// ca.expasy.org/ prosite/
  • 31.
    BLOCK S  Motifs/blocks are createdby automatically detecting the most conserved regions of each protein family.
  • 32.