Primary and secondary databases ppt by puneet kulyana
This document provides an introduction to databases used for biological data. It defines key terms like data, information, and databases. It describes different types of biological databases including primary databases that contain original experimental data, and secondary databases that contain derived or analyzed data. Examples of primary databases include GenBank, EMBL, and PDB, while secondary databases include PROSITE, PRINTS, and Pfam that contain conserved protein motifs and families. The document also compares primary and secondary databases.
DATA & INFORMATION
DATA
Datais raw, unorganized facts that need to
be processed.
Example:- Each student's test score is one
piece of data.
INFORMATION
When data is processed, organized,
structured or presented in a given context
so as to make it useful, it is called
information.
Example:- The average score of a class
or of the entire school is information that
can be derived from the given data.
4
5.
DATA INFORMATION
Definition
(Oxford
Dictionaries)
Facts andstatistics collected
together for reference or
analysis
Facts provided or
learned about something
or someone
Data as processed,
stored, or transmitted
by a computer
Refers to Raw Data Analyzed Data
Description
Qualitative Or Quantitative
Variables that can be used to
make ideas or conclusions
A group of data which
carries news and
meaning
In the form of
Numbers, letters, or a set of
characters.
Ideas and inferences
Collected via
Measurements, experiments,
etc.
Linking data and making
inferences
Represented in
A structure, such as tabular
data, data tree, a data graph,
etc.
Language, ideas, and
thoughts based on the
data
Interrelation Information that is collected
Data that has been
processed
C
O
M
P
A
R
I
S
O
N
B
E
T
W
E
E
N
D
A
T
A
&
I
N
F
O
R
M
A
T
I
O
N
5
6.
S. No. Typeof data Example(s) Weblinks
1. Sequence of
biomolecules viz., DNA,
RNA, proteins
GenBank, EMBL,
DDBJ, Swiss-Prot,
PIR
(i) www.ncbi.nlm.nih.gov/genba
nk/
(ii) https://www.ebi.ac.uk/embl/
(iii) www.ddbj.nig.ac.jp/
(iv)http://web.expasy.org/docs/s
wiss-prot_guideline.html
(v) http://pir.georgetown.edu/
2. Bio-molecular
structures
PDB http://www.rcsb.org/pdb/home
/home.do
3. Bibliography/scientific
literature **
PubMed, Scopus
(Search engine)
(i) www.ncbi.nlm.nih.gov/pubme
d
(ii) www.scopus.com
4. Patent databases USPTO www.uspto.gov/
5. Metabolic pathways /
molecular interactions
KEGG http://www.genome.jp/kegg/pa
thway.htm
6
TYPES OF DATA & INFORMATION
Databases are categorized based on the data type. A few examples are
listed below:-
7.
DATABASE???
A database isa
collection of data
in an organized
manner, which is
accessible in
various ways.
7
Biological Databases servea critical purpose in the collation
and organization of data related to biological systems.
They provide a computational support and a user-friendly
interface to a researcher for a meaningful analysis of biological
data.
9
PRIMARY DATABASES
Containsbio-molecular data in its original form.
Experimental results are submitted directly into the
database by researchers, and the data are essentially
archival in nature.
Once given a database accession number, the data in
primary databases are never changed.
Examples :- GenBank, EMBL and DDBJ for DNA/RNA
sequences, SWISS-PROT and PIR for protein sequences
and PDB for molecular structures.
11
12.
GenBank
Database from NCBI,includes sequences from publicly
available resources.
http://www.ncbi.nlm.nih.gov/genbank/ 12
13.
EMBL
European MolecularBiological Laboratory
Nucleic acid database from EBI (European
Bioinformatics Institute)
Produced in collaboration with DDBJ and GenBank
Search engine – SRS (Sequence Retrieval System)
http://www.ebi.ac.uk/
13
14.
DDBJ
DNA Databankof Japan
Started in 1986 in collaboration with GenBank
Produced and maintained at NIG (National Institute
of Genetics)
http://www.ddbj.nig.ac.jp/ 14
15.
SWISS PROT
Annotatedsequence database established in 1986
Consists of sequence entries of different lie formats
Similar format to EMBL
http://us.expasy.org/sprot/sprot-top.html
http://www.ebi.ac.uk/uniprot/
15
16.
PIR
Protein InformationResource
A division of National Biomedical Research
Foundation (NBRF) in U.S.
One can search for entries or do sequence similarity
search at PIR site.
http://pir.georgetown.edu/ 16
17.
TrEMBL
Translated EuropeanMolecular Biology Laboratory
Computer annotated supplement of SWISS PROT.
Contains all the translations of EMBL nucleotide
sequence entries not yet integrated in SWISS PROT.
http://www.ebi.ac.uk/trembl/ 17
SECONDARY DATABASES
Containsdata derived from the results of analysing
primary data
Manually created or automatically generated
Contains more relevant and useful information
structured to specific requirements
Example :- PROSITE, PRINTS, BLOCKS, Pfam
20
PROSITE
Families of proteins
Cansearch using regular expressions
Similar to unix commands using
wildcards, etc.
E.g., [AC]-x-V-x(4)-{ED}
Interpreted as:
[Ala or Cys]-any-Val-any-any-any-
any-{any but Glu or Asp}
Families exhibit these patterns
So we can search over families
http://ca.expasy.org/prosite/ 22
PRINTS
Most proteinfamilies are characterized not by one,
but by several conserved motifs
Fingerprints are groups of conserved motifs excised
from sequence alignments
Taken together, they provide diagnostic family
signatures. They are the basis of the PRINTS
database, and are stored in the form of aligned
motifs.
Input about protein families is done manually
24
25.
Pfam
Maintained by theSanger Centre (Cambridge)
Protein families aligned using HMMs
Hidden Markov Models
Given a new sequence
Find families which the sequence might fit into
Sequence Coverage
11912 families
Split into Pfam-A (high quality) and Pfam-B (low quality)
http://pfam.sanger.ac.uk/ 25