Bioinformatics and Databases in Biological Science

DATA
Data is raw, unorganized facts that need to be processed.
Example:- Each student's test score is one piece of data.
INFORMATION
• When data is processed, organized, structured
or presented in a given context so as to make
it useful, it is called information.
• Example:- score of a class or of the average
entire school is information that can be
derived from the given data.

Database
• A database is a collection of data in an organized
manner, which is accessible in various ways.
• Biological Databases serve a critical purpose in the
collection and organization of data related to biological
systems.
• They provide a computational support and a user-friendly
interface to a researcher for a meaningful analysis of
biological data.

• A database is a computerized archive used to store and organize
data in such a way that information can be retrieved easily via a
variety of search criteria.
• Databases are composed of computer hardware and software for
data management.
• The chief objective of the development of a database is to organize
data in a set of structured records to enable easy retrieval of
information.
• Each record, also called an entry, should contain a number of fields
that hold the actual data items, for example, fields for names, phone
numbers, addresses, dates.

WHAT ARE THE BIOLOGICAL
DATABASES ???

Different classifications of
databases
• Type of data
• nucleotide sequences
• protein sequences
• proteins sequence patterns or motifs
• macromolecular 3D structure
• gene expression data
• metabolic pathways

databases….
• Primary or derived databases
• Primary databases: experimental results directly into database
• Secondary databases: results of analysis of primary databases
• Aggregate of many databases
• Links to other data items
• Combination of data
• Consolidation of data

databases….
• Availability
• Publicly available, no restrictions
• Available, but with copyright
• Accessible, but not downloadable
• Academic, but not freely available
• Proprietary, commercial; possibly free for academics

TYPES OF DATABASES
 Primary Databases
 Secondary Databases

PRIMARY DATABASES
 Contains bio-molecular data in its original form.
 Experimental results are submitted directly into the database by researchers,
and the data are essentially archival in nature.
 Once given a database accession number, the data in primary databases are
never changed.
 Examples :- GenBank, EMBL and DDBJ for DNA/RNA sequences, SWISS-
PROT and PIR for protein sequences and PDB for molecular structures.

GenBank
•Database from NCBI, includes sequences from
publicly available resources.
http://
www.ncbi.nlm.nih.gov/
genbank/

NCBI and Entrez
• One of the largest and most comprehensive
databases belonging to the NIH – national
institute of health (USA)
• Entrez is the search engine of NCBI
• Search for :
genes, proteins, genomes, structures,
diseases, publications and more.
• http://www.ncbi.nlm.nih.gov/
15

Genbank
• An annotated collection of all publicly available nucleotide and
proteins
• Set up in 1979 at the LANL (Los Alamos).
• Maintained since 1992 NCBI (Bethesda).

EMBL
 European Molecular Biological Laboratory
 Nucleic acid database from EBI
(European Bioinformatics Institute)
 Produced in collaboration with DDBJ and GenBank
 Search engine – SRS (Sequence Retrieval System)
http://
www.ebi.ac.uk/

DDBJ
 DNA Databank of Japan
 Started in 1986 in collaboration with GenBank
 Produced and maintained at NIG
(National Institute of Genetics)
http://www.ddbj.nig.ac.jp/

SWISS PROT http://www.ebi.ac.uk/uniprot/
…...
 Annotated sequence database established
in 1986
 Consists of sequence entries of different
lie formats
 Similar format to EMBL
 http://us.expasy.org/sprot/sprot-top.html

PIR
• Protein Information Resource
•A division of National Biomedical Research
•Foundation (NBRF) in U.S.
•One can search for entries or do sequence
similarity search at PIR site.
http://
pir.georgetown.edu/

TrEMBL
 Translated European Molecular Biology Laboratory
 Computer annotated supplement of SWISS PROT.
 Contains all the translations of EMBL nucleotide
sequence entries not yet integrated in SWISS PROT.
http://www.ebi.ac.uk/trembl/

Protein DataBank (PDB)
• Important in solving real problems in
molecular biology
• Protein Databank
• PDB Established in 1972 at Brookhaven National
Laboratory (BNL)
• Sole international repository of macromolecular
structure data
• Moved to Research Collaboratory
for Structural Bioinformatics
http://www.rcsb.org/

PDB: example
HEADER LYASE(OXO-ACID) 01-OCT-91 12CA 12CA 2
COMPND CARBONIC ANHYDRASE /II (CARBONATE DEHYDRATASE) (/HCA II) 12CA 3
SOURCE HUMAN (HOMO SAPIENS) RECOMBINANT PROTEIN 12CA 5
AUTHOR S.K.NAIR,D.W.CHRISTIANSON 12CA 6
REVDAT 1 15-OCT-92 12CA 0 12CA 7
JRNL AUTH S.K.NAIR,T.L.CALDERONE,D.W.CHRISTIANSON,C.A.FIERKE 12CA 8
JRNL TITL ALTERING THE MOUTH OF A HYDROPHOBIC POCKET. 12CA 9
JRNL TITL 2 STRUCTURE AND KINETICS OF HUMAN CARBONIC ANHYDRASE 12CA 10
JRNL TITL 3 /II$ MUTANTS AT RESIDUE VAL-121 12CA 11
JRNL REF J.BIOL.CHEM. V. 266 17320 1991 12CA 12
JRNL REFN ASTM JBCHA3 US ISSN 0021-9258 071 12CA 13
REMARK 1 12CA 14EMARK 3 AUTHORS HENDRICKSON,KONNERT 12CA 20
REMARK 3 R VALUE 0.170 12CA 21
REMARK 3 RMSD BOND DISTANCES 0.011 ANGSTROMS 12CA 22
REMARK 3 RMSD BOND ANGLES 1.3 DEGREES 12CA 23
REMARK 4 12CA 24
REMARK 4 N-TERMINAL RESIDUES SER 2, HIS 3, HIS 4 AND C-TERMINAL 12CA 25
REMARK 4 RESIDUE LYS 260 WERE NOT LOCATED IN THE DENSITY MAPS AND, 12CA 26
REMARK 4 THEREFORE, NO COORDINATES ARE INCLUDED FOR THESE RESIDUES. 12CA 27
………

COMPOSITE DATABASES
 Collection of various primary database sequences
 Renders sequence searching highly efficient as it
searches multiple resources
 Examples :- NRDB (Non Redundant Database), OWL,
MIPSX, SWISS PROT + TrEMBL

SECONDARY DATABASES
Contains data derived from the results of analysing
primary data
Manually created or automatically generated
Contains more relevant and useful information
structured to specific requirements
Example :- PROSITE, PRINTS, BLOCKS, Pfam

PROSITE
Families of proteins
Can search using regular
expressions
Similar to unix commands
Families exhibit these patterns
So we can search over families
http://
ca.expasy.org/
prosite/

BLOCK
S
 Motifs/blocks are
created by
automatically
detecting the most
conserved regions
of each protein
family.

PRIMARY VS SECONDARY DATABASES

Bioinformatics and Databases in Biological Science

More Related Content

Similar to Bioinformatics and Databases in Biological Science

More from MohamedHasan816582

Recently uploaded

Bioinformatics and Databases in Biological Science