KEMBAR78
Primary and secondary databases ppt by puneet kulyana | PPTX
INTRODUCTION
TO DATABASES
By:-
 PUNEET
 NEERAJ
 KARTIK
 VARUN
1
INDEX/CONTENTS
 Introduction
 Data & Information
 Database
 Biological Databases
 Types of Databases
- Primary Databases
- Secondary Databases
- Composite Databases
 References
2
INTRODUCTION 3
DATA & INFORMATION
DATA
Data is raw, unorganized facts that need to
be processed.
Example:- Each student's test score is one
piece of data.
INFORMATION
When data is processed, organized,
structured or presented in a given context
so as to make it useful, it is called
information.
Example:- The average score of a class
or of the entire school is information that
can be derived from the given data.
4
DATA INFORMATION
Definition
(Oxford
Dictionaries)
Facts and statistics collected
together for reference or
analysis
Facts provided or
learned about something
or someone
Data as processed,
stored, or transmitted
by a computer
Refers to Raw Data Analyzed Data
Description
Qualitative Or Quantitative
Variables that can be used to
make ideas or conclusions
A group of data which
carries news and
meaning
In the form of
Numbers, letters, or a set of
characters.
Ideas and inferences
Collected via
Measurements, experiments,
etc.
Linking data and making
inferences
Represented in
A structure, such as tabular
data, data tree, a data graph,
etc.
Language, ideas, and
thoughts based on the
data
Interrelation Information that is collected
Data that has been
processed
C
O
M
P
A
R
I
S
O
N
B
E
T
W
E
E
N
D
A
T
A
&
I
N
F
O
R
M
A
T
I
O
N
5
S. No. Type of data Example(s) Weblinks
1. Sequence of
biomolecules viz., DNA,
RNA, proteins
GenBank, EMBL,
DDBJ, Swiss-Prot,
PIR
(i) www.ncbi.nlm.nih.gov/genba
nk/
(ii) https://www.ebi.ac.uk/embl/
(iii) www.ddbj.nig.ac.jp/
(iv)http://web.expasy.org/docs/s
wiss-prot_guideline.html
(v) http://pir.georgetown.edu/
2. Bio-molecular
structures
PDB http://www.rcsb.org/pdb/home
/home.do
3. Bibliography/scientific
literature **
PubMed, Scopus
(Search engine)
(i) www.ncbi.nlm.nih.gov/pubme
d
(ii) www.scopus.com
4. Patent databases USPTO www.uspto.gov/
5. Metabolic pathways /
molecular interactions
KEGG http://www.genome.jp/kegg/pa
thway.htm
6
TYPES OF DATA & INFORMATION
Databases are categorized based on the data type. A few examples are
listed below:-
DATABASE???
A database is a
collection of data
in an organized
manner, which is
accessible in
various ways.
7
WHAT ARE THE BIOLOGICAL
DATABASES ???
8
Biological Databases serve a critical purpose in the collation
and organization of data related to biological systems.
They provide a computational support and a user-friendly
interface to a researcher for a meaningful analysis of biological
data.
9
TYPES OF DATABASES
 Primary Databases
 Secondary Databases
10
PRIMARY DATABASES
 Contains bio-molecular data in its original form.
 Experimental results are submitted directly into the
database by researchers, and the data are essentially
archival in nature.
 Once given a database accession number, the data in
primary databases are never changed.
 Examples :- GenBank, EMBL and DDBJ for DNA/RNA
sequences, SWISS-PROT and PIR for protein sequences
and PDB for molecular structures.
11
GenBank
Database from NCBI, includes sequences from publicly
available resources.
http://www.ncbi.nlm.nih.gov/genbank/ 12
EMBL
 European Molecular Biological Laboratory
 Nucleic acid database from EBI (European
Bioinformatics Institute)
 Produced in collaboration with DDBJ and GenBank
 Search engine – SRS (Sequence Retrieval System)
http://www.ebi.ac.uk/
13
DDBJ
 DNA Databank of Japan
 Started in 1986 in collaboration with GenBank
 Produced and maintained at NIG (National Institute
of Genetics)
http://www.ddbj.nig.ac.jp/ 14
SWISS PROT
 Annotated sequence database established in 1986
 Consists of sequence entries of different lie formats
 Similar format to EMBL
 http://us.expasy.org/sprot/sprot-top.html
http://www.ebi.ac.uk/uniprot/
15
PIR
 Protein Information Resource
 A division of National Biomedical Research
Foundation (NBRF) in U.S.
 One can search for entries or do sequence similarity
search at PIR site.
http://pir.georgetown.edu/ 16
TrEMBL
 Translated European Molecular Biology Laboratory
 Computer annotated supplement of SWISS PROT.
 Contains all the translations of EMBL nucleotide
sequence entries not yet integrated in SWISS PROT.
http://www.ebi.ac.uk/trembl/ 17
COMPOSITE DATABASES
 Collection of various primary database sequences
 Renders sequence searching highly efficient as it
searches multiple resources
 Examples :- NRDB (Non Redundant Database), OWL,
MIPSX, SWISS PROT + TrEMBL
18
19
SECONDARY DATABASES
 Contains data derived from the results of analysing
primary data
 Manually created or automatically generated
 Contains more relevant and useful information
structured to specific requirements
 Example :- PROSITE, PRINTS, BLOCKS, Pfam
20
SECONDARY DATABASES
SECONDARY
DATABASE
PRIMARY
SOURCE
INFORMATION
STORED
PROSITE SWISS PROT
Regular
expression
BLOCKS
PROSITE/PRIN
TS
Aligned
motifs(blocks)
PRINTS
OWL
(Composite DB)
Aligned motifs
Pfam SWISS PROT
Hidden Markov
Models
Profile SWISS PROT
Weighted
Matrices(profile)
21
PROSITE
Families of proteins
Can search using regular expressions
Similar to unix commands using
wildcards, etc.
E.g., [AC]-x-V-x(4)-{ED}
Interpreted as:
[Ala or Cys]-any-Val-any-any-any-
any-{any but Glu or Asp}
Families exhibit these patterns
So we can search over families
http://ca.expasy.org/prosite/ 22
BLOCKS
 Motifs/blocks
are created
by
automatically
detecting the
most
conserved
regions of
each protein
family.
23
PRINTS
 Most protein families are characterized not by one,
but by several conserved motifs
 Fingerprints are groups of conserved motifs excised
from sequence alignments
 Taken together, they provide diagnostic family
signatures. They are the basis of the PRINTS
database, and are stored in the form of aligned
motifs.
 Input about protein families is done manually
24
Pfam
Maintained by the Sanger Centre (Cambridge)
Protein families aligned using HMMs
Hidden Markov Models
Given a new sequence
Find families which the sequence might fit into
Sequence Coverage
11912 families
Split into Pfam-A (high quality) and Pfam-B (low quality)
http://pfam.sanger.ac.uk/ 25
26
PRIMARY VS SECONDARY DATABASES 27
REFERENCES
 Class notes
 ESSENTIAL BIOINFORMATICS- Jin Xiong
 file:///C:/Users/student/Downloads/DATABASES%2
0IN%20BIOINFORMATICS.pdf
 https://www.ebi.ac.uk/training/online/course/bioinfor
matics-terrified/what-database/relational-
databases/primary-and-secondary-databases
 http://www.diffen.com/difference/Data_vs_Informa
tion
 Google images
28
29

Primary and secondary databases ppt by puneet kulyana

  • 1.
    INTRODUCTION TO DATABASES By:-  PUNEET NEERAJ  KARTIK  VARUN 1
  • 2.
    INDEX/CONTENTS  Introduction  Data& Information  Database  Biological Databases  Types of Databases - Primary Databases - Secondary Databases - Composite Databases  References 2
  • 3.
  • 4.
    DATA & INFORMATION DATA Datais raw, unorganized facts that need to be processed. Example:- Each student's test score is one piece of data. INFORMATION When data is processed, organized, structured or presented in a given context so as to make it useful, it is called information. Example:- The average score of a class or of the entire school is information that can be derived from the given data. 4
  • 5.
    DATA INFORMATION Definition (Oxford Dictionaries) Facts andstatistics collected together for reference or analysis Facts provided or learned about something or someone Data as processed, stored, or transmitted by a computer Refers to Raw Data Analyzed Data Description Qualitative Or Quantitative Variables that can be used to make ideas or conclusions A group of data which carries news and meaning In the form of Numbers, letters, or a set of characters. Ideas and inferences Collected via Measurements, experiments, etc. Linking data and making inferences Represented in A structure, such as tabular data, data tree, a data graph, etc. Language, ideas, and thoughts based on the data Interrelation Information that is collected Data that has been processed C O M P A R I S O N B E T W E E N D A T A & I N F O R M A T I O N 5
  • 6.
    S. No. Typeof data Example(s) Weblinks 1. Sequence of biomolecules viz., DNA, RNA, proteins GenBank, EMBL, DDBJ, Swiss-Prot, PIR (i) www.ncbi.nlm.nih.gov/genba nk/ (ii) https://www.ebi.ac.uk/embl/ (iii) www.ddbj.nig.ac.jp/ (iv)http://web.expasy.org/docs/s wiss-prot_guideline.html (v) http://pir.georgetown.edu/ 2. Bio-molecular structures PDB http://www.rcsb.org/pdb/home /home.do 3. Bibliography/scientific literature ** PubMed, Scopus (Search engine) (i) www.ncbi.nlm.nih.gov/pubme d (ii) www.scopus.com 4. Patent databases USPTO www.uspto.gov/ 5. Metabolic pathways / molecular interactions KEGG http://www.genome.jp/kegg/pa thway.htm 6 TYPES OF DATA & INFORMATION Databases are categorized based on the data type. A few examples are listed below:-
  • 7.
    DATABASE??? A database isa collection of data in an organized manner, which is accessible in various ways. 7
  • 8.
    WHAT ARE THEBIOLOGICAL DATABASES ??? 8
  • 9.
    Biological Databases servea critical purpose in the collation and organization of data related to biological systems. They provide a computational support and a user-friendly interface to a researcher for a meaningful analysis of biological data. 9
  • 10.
    TYPES OF DATABASES Primary Databases  Secondary Databases 10
  • 11.
    PRIMARY DATABASES  Containsbio-molecular data in its original form.  Experimental results are submitted directly into the database by researchers, and the data are essentially archival in nature.  Once given a database accession number, the data in primary databases are never changed.  Examples :- GenBank, EMBL and DDBJ for DNA/RNA sequences, SWISS-PROT and PIR for protein sequences and PDB for molecular structures. 11
  • 12.
    GenBank Database from NCBI,includes sequences from publicly available resources. http://www.ncbi.nlm.nih.gov/genbank/ 12
  • 13.
    EMBL  European MolecularBiological Laboratory  Nucleic acid database from EBI (European Bioinformatics Institute)  Produced in collaboration with DDBJ and GenBank  Search engine – SRS (Sequence Retrieval System) http://www.ebi.ac.uk/ 13
  • 14.
    DDBJ  DNA Databankof Japan  Started in 1986 in collaboration with GenBank  Produced and maintained at NIG (National Institute of Genetics) http://www.ddbj.nig.ac.jp/ 14
  • 15.
    SWISS PROT  Annotatedsequence database established in 1986  Consists of sequence entries of different lie formats  Similar format to EMBL  http://us.expasy.org/sprot/sprot-top.html http://www.ebi.ac.uk/uniprot/ 15
  • 16.
    PIR  Protein InformationResource  A division of National Biomedical Research Foundation (NBRF) in U.S.  One can search for entries or do sequence similarity search at PIR site. http://pir.georgetown.edu/ 16
  • 17.
    TrEMBL  Translated EuropeanMolecular Biology Laboratory  Computer annotated supplement of SWISS PROT.  Contains all the translations of EMBL nucleotide sequence entries not yet integrated in SWISS PROT. http://www.ebi.ac.uk/trembl/ 17
  • 18.
    COMPOSITE DATABASES  Collectionof various primary database sequences  Renders sequence searching highly efficient as it searches multiple resources  Examples :- NRDB (Non Redundant Database), OWL, MIPSX, SWISS PROT + TrEMBL 18
  • 19.
  • 20.
    SECONDARY DATABASES  Containsdata derived from the results of analysing primary data  Manually created or automatically generated  Contains more relevant and useful information structured to specific requirements  Example :- PROSITE, PRINTS, BLOCKS, Pfam 20
  • 21.
    SECONDARY DATABASES SECONDARY DATABASE PRIMARY SOURCE INFORMATION STORED PROSITE SWISSPROT Regular expression BLOCKS PROSITE/PRIN TS Aligned motifs(blocks) PRINTS OWL (Composite DB) Aligned motifs Pfam SWISS PROT Hidden Markov Models Profile SWISS PROT Weighted Matrices(profile) 21
  • 22.
    PROSITE Families of proteins Cansearch using regular expressions Similar to unix commands using wildcards, etc. E.g., [AC]-x-V-x(4)-{ED} Interpreted as: [Ala or Cys]-any-Val-any-any-any- any-{any but Glu or Asp} Families exhibit these patterns So we can search over families http://ca.expasy.org/prosite/ 22
  • 23.
    BLOCKS  Motifs/blocks are created by automatically detectingthe most conserved regions of each protein family. 23
  • 24.
    PRINTS  Most proteinfamilies are characterized not by one, but by several conserved motifs  Fingerprints are groups of conserved motifs excised from sequence alignments  Taken together, they provide diagnostic family signatures. They are the basis of the PRINTS database, and are stored in the form of aligned motifs.  Input about protein families is done manually 24
  • 25.
    Pfam Maintained by theSanger Centre (Cambridge) Protein families aligned using HMMs Hidden Markov Models Given a new sequence Find families which the sequence might fit into Sequence Coverage 11912 families Split into Pfam-A (high quality) and Pfam-B (low quality) http://pfam.sanger.ac.uk/ 25
  • 26.
  • 27.
  • 28.
    REFERENCES  Class notes ESSENTIAL BIOINFORMATICS- Jin Xiong  file:///C:/Users/student/Downloads/DATABASES%2 0IN%20BIOINFORMATICS.pdf  https://www.ebi.ac.uk/training/online/course/bioinfor matics-terrified/what-database/relational- databases/primary-and-secondary-databases  http://www.diffen.com/difference/Data_vs_Informa tion  Google images 28
  • 29.