KEMBAR78
Database in bioinformatics | PPTX
Database
 A Computerized archive used to store and organize data in such
a way that information can be retrieved easily.
 A database is a repository of information that has a specific
structure that enables the entering and extraction of data
 In general this database structure consists of files or tables,
 each containing numerous records and fields
Conti..
 Database System (DBS) is an integrated collection of related files
along with the detail about their definition, interpretation,
manipulation and maintenance
 A database system is based on the data. Also a database system can
be run or executed by using software called DBMS (Database
Management System).
 A database system controls the data from unauthorized access.
 A database management system (DBMS) is a collection of programs
that enables users to create and maintain a database.
Database management systems
 Database management systems provide several functions in
addition to simple file management:
 control security
 maintain data integrity
 provide for backup and recovery
 control redundancy
 allow data independence
 provide non-procedural query language
 perform automatic query optimization
Organisation
 Organisation:
 flat files
 Relational databases
Flat-file databases
 the simplest form of a database,
 where collections of data, such as nucleotide and amino
acid sequence, are stored as either a large single text file
Conti…
Conti..
 a database that treats all of its data as a collection of
relations
 A relational database stores the data within a number of
tables.
 Each table consists of records and fields (rows and
columns)
Types of Database
 The databases can be classified into three
categories on the basis of the information
stored.
 They are Primary, Secondary and
Composite databases.
 Primary databases contain data that is
derived experimentally.
 They usually store information related to
the sequences or structures of biological
components
 They can be further divided into protein or
nucleotide databases
Primary Database
 This databases contains the raw nucleic acid sequence data
which are produced and submitted by researchers worldwide.
 NCBI(The National Centre for Biotechnology Information)
 GenBank
 DDBJ (DNA data bank of Japan)
 SWISS-PROT(Swiss-Prot )
 PIR (Protein Information Resource)
 PDB(Protein Data Bank)
 TrEMBL (Translated European Molecular Biology Laboratory)
Protein
PIR
MIPS
SWISS-PROT
TrEMBL
Conti…
Secondary Databases
Secondary Databases:
 contain information derived from primary databases.
 store information such as conserved sequences, active
site residues, and signature sequences. Protein
Databank data is stored in secondary databases.
Examples include:
 Class Architecture Topology Homology (CATH),
 Kyoto Encyclopedia of Genes and Genomics (KEGG),
 Protein Families (Pfam)
 and Structural Classification of Proteins (SCOP)
Composite Databases
Composite Databases
 are collections of several primary database resources.
 provide users with various tools and software for analysis of data.
 NCBI being a composite database has stored a lot of sequence of
nucleotide and protein within its server and thereby suffers from
high redundancy in the data deposited
Biological databases
 Biological databases can be broadly classified in to
 Sequence database
 structure database
 and pathway databases.
 Sequence databases are applicable to both nucleic acid sequences
and protein sequences, whereas structure databases are applicable
to only Proteins.
Sequence databases
Sequence databases
 Nucleotide and protein sequence databases represent the most
widely used and some of the best established biological
databases.
 serve as repositories for wet lab results and the primary source
for experimental results.
 Major public data banks included in this type are
 GenBank in USA,
 EMBL (European Molecular Biology Laboratory) in Europe
 and DDBJ (DNADataBank) in Japan
Conti….
 And protein databases includes
 ExPaSy
 UniProt
 PIR
 PDB
 Swiss-Prot
 TrEMBL
NATIONAL CENTER FOR BIOTECHNOLOGY
INFORMATION (NCBI)
 developed at the National Institutes of Health (NIH) in 1988
 Part of national library of medicine at national institute of
health
 provides access to a large amount of biomedical and genomic
information (www.ncbi.nlm.nih.gov/home/
about/mission.shtml).
 It maintains a large scale of databases and bioinformatics
tools as well as services.
 One of the most popular databases is GenBank
Conti…
Mission or role
 The aim is to find novel techniques and methodologies for dealing
with huge and complex data
 and provide better accessibility to analytical and computational
tools.
 Maintenance of biological databases whether primary or
secondary.
 It includes GENEBANK
 NCBI provides the data retrieval systems such as ENTREZ
 Provides computational sources for the analysis of the GENEBANK
data and other biological data
Conti…
Resources
 The resources that are present on this site can be divided
into two major categories:
 1) databases
 2) tools
 The major databases maintained at NCBI are
 GenBank and PubMed (bibliographic database for biomedical literature).
 Other databases include the
 Gene,
 Genome,
 Epigenomics,
 Gene
 Expression
 RefSeq,
 Structure, Database of Short Genetic Variation (dbSNP),
 TAXONOMY, etc.
TOOLS at NCBI
 The NCBI also provides a variety of tools for database search
 The Entrez: is search engine of NCBI
 The other tools include
 Genomes Browser,
 BLAST,
 CDTree,
 Genetic Codes,
 Open Reading Frame Finder (ORF Finder),
 SNP Database Specialized Search Tools,
GenBank
 GenBank (Genetic Sequence Databank)
 GenBank® is the genetic sequence database at the National Center for
Biotechnology Information (NCBI).
 It was established in the year 1982 and now maintained by the
National Center for Biotechnology (NCBI).
 It contains publicly available nucleotide sequences
 DNA sequences can be submitted to GenBank using several different
methods.
 BankIt: Web-based form for submission of a small number of
sequences
 Sequin: More appropriate for complicated submissions containing
many sequences
Structure of Genbank
 A detailed structure of a nucleotide
sequence file format in this database
includes the following:
 1. Locus: This can be defined as a title
given by GenBank itself to name the
sequence entry. It includes the
following:
 a. Locus Name: Similar to accession
number for the sequence.
 b. Sequence Length: Tells the number
of bases existing in the sequence.
Conti….
 c. Molecule-Type: Identifies the
type of nucleic acid sequence.
The various types are mRNA
(which is present as cDNA), rRNA,
snRNA, and DNA.
 d. GB Division: Postulates class of
the data according to
classification criteria of GenBank.
 e. Modification Date: The date on
which the record was modified.
 2. Definition: This denotes the name of the
nucleotide sequence.
 3. Accession: This covers accession number,
accession version, and GI number.
 Accession number can be defined as the
unique identifier associated with each
nucleotide sequence present in the
database.
 4. VERSION - Identification number assigned
to a single, specific sequence in the
database. This number is in the format
“accession.version.”
 5. GI Also a sequence identification
number. Whenever a sequence is changed,
the version number is increased and a new
GI is assigned.
 6. Keyword: Defined words that
were used to index the entries.
 7. The Source: This describes
organism from which sequences
have been obtained.
 8. Organism - The scientific name
(usually genus and species) and
phylogenetic lineage
 9. REFERENCE - Citations of
publications by sequence authors,
the journal from which with the
sequence was derived
 10. Features: These
consist of the
information derived
from the sequence
such as biological
source,
 exon,
 intron,
 promoters,
 CDS
 alternate splice,
 Base Count,
 Origin
European Molecular Biology Laboratory
(EMBL)
 The EMBL Nucleotide Sequence Database is maintained by EBI,
UK
 It was formed in the year 1974
 It develops and maintains a large number of databases, and
scientists can access the data free of cost.
 This database serves as the primary source of nucleotide
sequences for Europe.
 in this database, the nucleotide sequence data generated by
large-scale genome-sequencing projects and those available
from the European Patent Office can be submitted
Conti…
 Data collection is done in collaboration with GenBank
(USA) and the DNA Database of Japan (DDBJ).
 The other genomic databases held at EBI are
 Ensembl (a database of genome annotation)
 Genome Reviews.
 The daily releases of the database contain new
submissions and updated sequence data
 while every 3 months the entire database is released.
DDBJ
 DDBJ: DNA Data Bank of Japan Is a biological database
that collects DNA sequences submitted by researchers.
 It is run by the National Institute of Genetics, Japan.
DDBJ Flat File Format
 The data submitted in DDBJ is managed and retrieved
according to the DDBJ format (flat file).
 The flat file includes the sequence and the information of
who submitted the data, references, source organisms,
and information about the feature, etc
Ensembl Genome Database
 Ensembl is one of several well known genome browsers for the
retrieval of genomic information from several organisms
including human, plants, bacteria and animals.
 Created and maintained by the EBI and the Sanger Center (UK)
databases for green plants
 There are three different comparative genomic databases
for green plants, namely,
 GreenPhylDB,
 Plaza,
 Phytozome
 These databases aim to support studies on genomics
studies related to plant evolution and
 to provides comparative data on genomes and gene
families and the tools for their analysis.
Conti…..
 It provides information on
 genomic context of plant genes,
 Gene homologues, and paralogues,
 RNA transcripts from the given genes,
 peptide sequences, and
 functions of gene families.
 It allows access to complete genome sequences available in the
database.
Protein Databases
Swiss-Prot
Swiss-Prot is a protein sequence and knowledge database.
 It is well known for high quality of annotation, use of
standardized nomenclature, and links to specialized databases.
 its repository contains the amino acid sequence, the protein
name and description, taxonomic data, and citation information
PFAM
 A database of protein families, Pfam contains annotations as
well as multiple sequence alignments generated using hidden
Markov models
Conti…
 TrEMBL: The European Bioinformatics Institute, collaborating with
Swiss-Prot, introduced another database, TrEMBL (translation of EMBL
nucleotide sequence database)
 This database consists of computer annotated entries obtained from
the translation of all coding sequences in the nucleotide databases.
 PIR: The Protein Information Resource (PIR) is an integrated public
bioinformatics resource that supports genomic and proteomic
research and scientific studies
 The PIR serves the scientific community through on-line access, and
performing off-line sequence identification services for researchers.
 It is a database of freely accessible protein sequences which contains
high-quality data and functional information for the proteins
Structure databases
There are many structural database that include
Protein DataBank (PDB)
 Important in solving real problems in molecular biology
 PDB Established in 1972 at Brookhaven National
Laboratory (BNL)
 It contains structural information of the macromolecules
determined by X-ray, crystallographic, NMR methods
 PDB is maintained by the Research Collaboratory for
Structural Bioinformatics (RCSB).
Conti…
 PROSITE: is a database of protein domains and families.
 PROSITE contains biologically significant sites, patterns
and profiles that help to reliably identify to which known
protein family a new sequence belongs.
 CATH: The CATH database (Class, architecure, topology,
homologous superfamily) is a hierarchical classification of
protein domain structures, which clusters proteins at four
major structural levels.
Pathway databases
 Pathway databases
 A pathway database (DB) is a DB that describes
biochemical pathways, reactions, and enzymes
 Some examples of the pathway databases are
 KEGG (The Kyoto Encyclopedia of Genes and Genomes)
 BRENDA,
 Biocyc.
Conti…
 KEGG: The Kyoto Encyclopedia of Genes and Genomes (KEGG) is the
primary resource for the Japanese Genome Net service
 it is a collection of online databases dealing with genomes, enzymatic
pathways, and biological chemicals
 KEGG contains three databases: PATHWAY, GENES, and LIGAND.
 The PATHWAY database stores computerized knowledge on molecular
interaction networks.
 The GENES database contains data concerning sequences of genes and
proteins generated by the genome projects.
 The LIGAND database holds information about the chemical compounds and
chemical reactions that are relevant to cellular processes.
 BioCyc: The BioCyc Database Collection is a compilation of
 pathway and genome information for different organisms.
 It includes two other databases,
 EcoCyc which describes Escherichia coli K-12;
 MetaCyc, which describes pathways for more than 300
organisms.

Database in bioinformatics

  • 1.
    Database  A Computerizedarchive used to store and organize data in such a way that information can be retrieved easily.  A database is a repository of information that has a specific structure that enables the entering and extraction of data  In general this database structure consists of files or tables,  each containing numerous records and fields
  • 2.
    Conti..  Database System(DBS) is an integrated collection of related files along with the detail about their definition, interpretation, manipulation and maintenance  A database system is based on the data. Also a database system can be run or executed by using software called DBMS (Database Management System).  A database system controls the data from unauthorized access.  A database management system (DBMS) is a collection of programs that enables users to create and maintain a database.
  • 3.
    Database management systems Database management systems provide several functions in addition to simple file management:  control security  maintain data integrity  provide for backup and recovery  control redundancy  allow data independence  provide non-procedural query language  perform automatic query optimization
  • 4.
    Organisation  Organisation:  flatfiles  Relational databases Flat-file databases  the simplest form of a database,  where collections of data, such as nucleotide and amino acid sequence, are stored as either a large single text file
  • 5.
  • 6.
    Conti..  a databasethat treats all of its data as a collection of relations  A relational database stores the data within a number of tables.  Each table consists of records and fields (rows and columns)
  • 7.
    Types of Database The databases can be classified into three categories on the basis of the information stored.  They are Primary, Secondary and Composite databases.  Primary databases contain data that is derived experimentally.  They usually store information related to the sequences or structures of biological components  They can be further divided into protein or nucleotide databases
  • 8.
    Primary Database  Thisdatabases contains the raw nucleic acid sequence data which are produced and submitted by researchers worldwide.  NCBI(The National Centre for Biotechnology Information)  GenBank  DDBJ (DNA data bank of Japan)  SWISS-PROT(Swiss-Prot )  PIR (Protein Information Resource)  PDB(Protein Data Bank)  TrEMBL (Translated European Molecular Biology Laboratory) Protein PIR MIPS SWISS-PROT TrEMBL
  • 9.
  • 10.
    Secondary Databases Secondary Databases: contain information derived from primary databases.  store information such as conserved sequences, active site residues, and signature sequences. Protein Databank data is stored in secondary databases. Examples include:  Class Architecture Topology Homology (CATH),  Kyoto Encyclopedia of Genes and Genomics (KEGG),  Protein Families (Pfam)  and Structural Classification of Proteins (SCOP)
  • 11.
    Composite Databases Composite Databases are collections of several primary database resources.  provide users with various tools and software for analysis of data.  NCBI being a composite database has stored a lot of sequence of nucleotide and protein within its server and thereby suffers from high redundancy in the data deposited
  • 12.
    Biological databases  Biologicaldatabases can be broadly classified in to  Sequence database  structure database  and pathway databases.  Sequence databases are applicable to both nucleic acid sequences and protein sequences, whereas structure databases are applicable to only Proteins.
  • 13.
    Sequence databases Sequence databases Nucleotide and protein sequence databases represent the most widely used and some of the best established biological databases.  serve as repositories for wet lab results and the primary source for experimental results.  Major public data banks included in this type are  GenBank in USA,  EMBL (European Molecular Biology Laboratory) in Europe  and DDBJ (DNADataBank) in Japan
  • 14.
    Conti….  And proteindatabases includes  ExPaSy  UniProt  PIR  PDB  Swiss-Prot  TrEMBL
  • 15.
    NATIONAL CENTER FORBIOTECHNOLOGY INFORMATION (NCBI)  developed at the National Institutes of Health (NIH) in 1988  Part of national library of medicine at national institute of health  provides access to a large amount of biomedical and genomic information (www.ncbi.nlm.nih.gov/home/ about/mission.shtml).  It maintains a large scale of databases and bioinformatics tools as well as services.  One of the most popular databases is GenBank
  • 16.
    Conti… Mission or role The aim is to find novel techniques and methodologies for dealing with huge and complex data  and provide better accessibility to analytical and computational tools.  Maintenance of biological databases whether primary or secondary.  It includes GENEBANK  NCBI provides the data retrieval systems such as ENTREZ  Provides computational sources for the analysis of the GENEBANK data and other biological data
  • 17.
    Conti… Resources  The resourcesthat are present on this site can be divided into two major categories:  1) databases  2) tools
  • 18.
     The majordatabases maintained at NCBI are  GenBank and PubMed (bibliographic database for biomedical literature).  Other databases include the  Gene,  Genome,  Epigenomics,  Gene  Expression  RefSeq,  Structure, Database of Short Genetic Variation (dbSNP),  TAXONOMY, etc.
  • 19.
    TOOLS at NCBI The NCBI also provides a variety of tools for database search  The Entrez: is search engine of NCBI  The other tools include  Genomes Browser,  BLAST,  CDTree,  Genetic Codes,  Open Reading Frame Finder (ORF Finder),  SNP Database Specialized Search Tools,
  • 20.
    GenBank  GenBank (GeneticSequence Databank)  GenBank® is the genetic sequence database at the National Center for Biotechnology Information (NCBI).  It was established in the year 1982 and now maintained by the National Center for Biotechnology (NCBI).  It contains publicly available nucleotide sequences  DNA sequences can be submitted to GenBank using several different methods.  BankIt: Web-based form for submission of a small number of sequences  Sequin: More appropriate for complicated submissions containing many sequences
  • 21.
    Structure of Genbank A detailed structure of a nucleotide sequence file format in this database includes the following:  1. Locus: This can be defined as a title given by GenBank itself to name the sequence entry. It includes the following:  a. Locus Name: Similar to accession number for the sequence.  b. Sequence Length: Tells the number of bases existing in the sequence.
  • 22.
    Conti….  c. Molecule-Type:Identifies the type of nucleic acid sequence. The various types are mRNA (which is present as cDNA), rRNA, snRNA, and DNA.  d. GB Division: Postulates class of the data according to classification criteria of GenBank.  e. Modification Date: The date on which the record was modified.
  • 23.
     2. Definition:This denotes the name of the nucleotide sequence.  3. Accession: This covers accession number, accession version, and GI number.  Accession number can be defined as the unique identifier associated with each nucleotide sequence present in the database.  4. VERSION - Identification number assigned to a single, specific sequence in the database. This number is in the format “accession.version.”  5. GI Also a sequence identification number. Whenever a sequence is changed, the version number is increased and a new GI is assigned.
  • 24.
     6. Keyword:Defined words that were used to index the entries.  7. The Source: This describes organism from which sequences have been obtained.  8. Organism - The scientific name (usually genus and species) and phylogenetic lineage  9. REFERENCE - Citations of publications by sequence authors, the journal from which with the sequence was derived
  • 25.
     10. Features:These consist of the information derived from the sequence such as biological source,  exon,  intron,  promoters,  CDS  alternate splice,  Base Count,  Origin
  • 26.
    European Molecular BiologyLaboratory (EMBL)  The EMBL Nucleotide Sequence Database is maintained by EBI, UK  It was formed in the year 1974  It develops and maintains a large number of databases, and scientists can access the data free of cost.  This database serves as the primary source of nucleotide sequences for Europe.  in this database, the nucleotide sequence data generated by large-scale genome-sequencing projects and those available from the European Patent Office can be submitted
  • 27.
    Conti…  Data collectionis done in collaboration with GenBank (USA) and the DNA Database of Japan (DDBJ).  The other genomic databases held at EBI are  Ensembl (a database of genome annotation)  Genome Reviews.  The daily releases of the database contain new submissions and updated sequence data  while every 3 months the entire database is released.
  • 28.
    DDBJ  DDBJ: DNAData Bank of Japan Is a biological database that collects DNA sequences submitted by researchers.  It is run by the National Institute of Genetics, Japan. DDBJ Flat File Format  The data submitted in DDBJ is managed and retrieved according to the DDBJ format (flat file).  The flat file includes the sequence and the information of who submitted the data, references, source organisms, and information about the feature, etc
  • 29.
    Ensembl Genome Database Ensembl is one of several well known genome browsers for the retrieval of genomic information from several organisms including human, plants, bacteria and animals.  Created and maintained by the EBI and the Sanger Center (UK)
  • 30.
    databases for greenplants  There are three different comparative genomic databases for green plants, namely,  GreenPhylDB,  Plaza,  Phytozome  These databases aim to support studies on genomics studies related to plant evolution and  to provides comparative data on genomes and gene families and the tools for their analysis.
  • 31.
    Conti…..  It providesinformation on  genomic context of plant genes,  Gene homologues, and paralogues,  RNA transcripts from the given genes,  peptide sequences, and  functions of gene families.  It allows access to complete genome sequences available in the database.
  • 32.
    Protein Databases Swiss-Prot Swiss-Prot isa protein sequence and knowledge database.  It is well known for high quality of annotation, use of standardized nomenclature, and links to specialized databases.  its repository contains the amino acid sequence, the protein name and description, taxonomic data, and citation information PFAM  A database of protein families, Pfam contains annotations as well as multiple sequence alignments generated using hidden Markov models
  • 33.
    Conti…  TrEMBL: TheEuropean Bioinformatics Institute, collaborating with Swiss-Prot, introduced another database, TrEMBL (translation of EMBL nucleotide sequence database)  This database consists of computer annotated entries obtained from the translation of all coding sequences in the nucleotide databases.  PIR: The Protein Information Resource (PIR) is an integrated public bioinformatics resource that supports genomic and proteomic research and scientific studies  The PIR serves the scientific community through on-line access, and performing off-line sequence identification services for researchers.  It is a database of freely accessible protein sequences which contains high-quality data and functional information for the proteins
  • 34.
    Structure databases There aremany structural database that include Protein DataBank (PDB)  Important in solving real problems in molecular biology  PDB Established in 1972 at Brookhaven National Laboratory (BNL)  It contains structural information of the macromolecules determined by X-ray, crystallographic, NMR methods  PDB is maintained by the Research Collaboratory for Structural Bioinformatics (RCSB).
  • 35.
    Conti…  PROSITE: isa database of protein domains and families.  PROSITE contains biologically significant sites, patterns and profiles that help to reliably identify to which known protein family a new sequence belongs.  CATH: The CATH database (Class, architecure, topology, homologous superfamily) is a hierarchical classification of protein domain structures, which clusters proteins at four major structural levels.
  • 36.
    Pathway databases  Pathwaydatabases  A pathway database (DB) is a DB that describes biochemical pathways, reactions, and enzymes  Some examples of the pathway databases are  KEGG (The Kyoto Encyclopedia of Genes and Genomes)  BRENDA,  Biocyc.
  • 37.
    Conti…  KEGG: TheKyoto Encyclopedia of Genes and Genomes (KEGG) is the primary resource for the Japanese Genome Net service  it is a collection of online databases dealing with genomes, enzymatic pathways, and biological chemicals  KEGG contains three databases: PATHWAY, GENES, and LIGAND.  The PATHWAY database stores computerized knowledge on molecular interaction networks.  The GENES database contains data concerning sequences of genes and proteins generated by the genome projects.  The LIGAND database holds information about the chemical compounds and chemical reactions that are relevant to cellular processes.
  • 38.
     BioCyc: TheBioCyc Database Collection is a compilation of  pathway and genome information for different organisms.  It includes two other databases,  EcoCyc which describes Escherichia coli K-12;  MetaCyc, which describes pathways for more than 300 organisms.