Introduction To Biological Databases
Introduction To Biological Databases
INTRODUCTION
                                                                                            • One of the hallmarks of modern genomic research is the
                                                                                              generation of enormous amounts of raw sequence data.
                                                                                            • Thus, the very first challenge in the genomics era is to store and
                                                                                              handle the staggering volume of information through the
                                                                                              establishment and use of computer databases.
                    WHAT IS A DATABASE?
• A database is a computerized archive used to store and organize                          • Although data retrieval is the main purpose of all databases,
  data in such a way that information can be retrieved easily via a                          biological databases often have a higher level of requirement, known
  variety of search criteria.                                                                as knowledge discovery, which refers to the identification of
                                                                                             connections between pieces of information that were not known
• Databases are composed of computer hardware and software for                               when the information was first entered.
  data management.                                                                             – For example, databases containing raw sequence information can perform
                                                                                                 extra computational tasks to identify sequence homology or conserved motifs.
• The chief objective of the development of a database is to organize
  data in a set of structured records to enable easy retrieval of                          • These features facilitate the discovery of new biological insights
  information.                                                                               from raw data.
    – Each record, also called an entry, should contain a number of fields that hold
        the actual data items, for example, fields for names, phone numbers,
        addresses, dates.
    – To retrieve a particular record from the database, a user can specify a
        particular piece of information, called value, to be found in a particular field
        and expect the computer to retrieve the whole data record. This process is
   2/13/2012                                                                        3         2/13/2012                                                                 4
        called making a query.
2/13/2012 5 2/13/2012 6
                                                                                                                                                                                1
                                                                                                                                                                                           2/13/2012
                       RELATIONAL DATABASES
• Instead of using a single table as in a flat file database, relational
  databases use a set of tables to organize data.
    – Each table, also called a relation, is made up of columns and rows.
    – Columns represent individual fields. Rows represent values in the fields of
      records.
    – The columns in a table are indexed according to a common feature called an
      attribute, so they can be cross-referenced in other tables.
    – To execute a query in a relational database, the system selects linked data
      items from different tables and combines the information into one report.
             • Therefore, specific information can be found more quickly from a relational database than
               from a flat file database.
              OBJECT-ORIENTED DATABASES
• In an object-oriented programming language, an object can be
  considered as a unit that combines data and mathematical routines
  that act on the data.
    – The database is structured such that the objects are linked by a set of
      pointers defining predetermined relationships between the objects.
    – Searching the database involves navigating through the objects with the aid
      of the pointers linking different objects.
    – Programming languages like C++ are used to create object-oriented
      databases.
                      BIOLOGICAL DATABASES
                                                                                                            • Secondary databases contain computationally processed or
• Current biological databases use all three types of database                                                manually curated information, based on original information
  structures:                                                                                               • from primary databases.
    – Flat files, relational, and object oriented.                                                                – Translated protein sequence databases containing functional annotation
                                                                                                                    belong to this category.
• Based on their contents, biological databases can be roughly                                                    – Examples are SWISS-Prot and Protein Information Resources (PIR)
                                                                                                                    (successor of Margaret Dayhoff’s Atlas of Protein Sequence and
  divided into three categories:                                                                                    Structure).
    – Primary databases,
    – Secondary databases,
    – Specialized databases.                                                                                • Specialized databases are those that cater to a particular
                                                                                                              research interest.
                                                                                                                  – For example, Flybase, HIV sequence database, and Ribosomal Database
• Primary databases contain original biological data.
                                                                                                                    Project are databases that specialize in a particular organism or a
    – They are archives of raw sequence or structural data submitted by the                                         particular type of data.
      scientific community.
    – GenBank and Protein Data Bank (PDB) are examples of primary databases.
 2/13/2012                                                                                         11         2/13/2012                                                                     12
                                                                                                                                                                                                  2
                                                                                                                                                                                                                 2/13/2012
                                                                                                                                  SPECIALIZED DATABASES
• The data record also provides cross-referencing links to other                                               • Specialized databases normally serve a specific research
  online resources of interest.                                                                                  community or focus on a particular organism.
                                                                                                                   – The content of these databases may be sequences or other types of information.
                                                                                                                   – The sequences in these databases may overlap with a primary database, but may also have new
• Other features such as very low redundancy and high level of                                                       data submitted directly by authors.
  integration with other primary and secondary databases make
  SWISS-PROT very popular among biologists.                                                                    • Because they are often curated by experts in the field, they may
                                                                                                                 have unique organizations and additional annotations associated
• A recent effort to combine SWISS-PROT, TrEMBL, and PIR led                                                     with the sequences.
                                                                                                                   – Many genome databases that are taxonomic specific fall within this category.
  to the creation of the UniProt database, which has larger coverage                                               – Examples include Flybase,WormBase, AceDB, and TAIR
  than any one of the three databases while at the same time
  maintaining the original SWISS-PROT feature of low redundancy,                                               • In addition, there are also specialized databases that contain
  cross-references, and a high quality of annotation.                                                            original data derived from functional analysis.
                                                                                                                   – For example, GenBank EST database and Microarray Gene Expression Database at the
                                                                                                                     European Bioinformatics Institute (EBI) are some of the gene expression databases available.
2/13/2012 15 2/13/2012 16
                                                                                                                                                                                                                            3
                                                                                                                                                                                                                                              2/13/2012
• Redundancy is another major problem affecting primary databases.                                         • Errors in annotation can be particularly damaging.
    – The causes include repeated submission of identical or overlapping sequences by the same or              – Large majority of new sequences are assigned functions based on similarity with sequences in
      different authors, revision of annotations, dumping of expressed sequence tags (EST) data , and              the databases that are already annotated..
      poor database management that fails to detect the redundancy.
                                                                                                               – Therefore, a wrong annotation can be easily transferred to all similar genes in the entire
  2/13/2012                                                                                           19           database.
                                                                                                             2/13/2012                                                                                 20
                                                                                                                                                     INTRODUCTION
                                                                                                           • A major goal in developing databases is to provide efficient and
                                                                                                             user-friendly access to the data stored.
      INFORMATION RETRIEVAL                                                                                • There are a number of retrieval systems for biological data. The
         FROM BIOLOGICAL                                                                                     most popular retrieval systems for biological databases are Entrez
                                                                                                             and Sequence Retrieval Systems (SRS) that provide access to
            DATABASES                                                                                        multiple databases for retrieval of integrated search results.
                                         ENTREZ                                                                                                                  PUBMED
• The NCBI developed and maintains Entrez, a biological database
  retrieval system.                                                                                        • One of the databases accessible from Entrez is a biomedical
• It is a gateway that allows text-based searches for a wide variety of                                      literature database known as PubMed, which contains abstracts
  data, including:                                                                                           and in some cases the full text articles from nearly 4,000 journals.
   – Annotated genetic sequence information, structural information, as well as citations and abstracts,
     full papers, and taxonomic data.
                                                                                                           • An important feature of PubMed is the retrieval of information
                                                                                                             based on medical subject headings (MeSH) terms.
• The key feature of Entrez is its ability to integrate information,                                            – The MeSH system consists of a collection of more than 20,000 controlled and standardized
                                                                                                                  vocabulary terms used for indexing articles
  which comes from cross-referencing between NCBI databases based
  on preexisting and logical relationships between individual entries.
                                                                                                           • Another way to broaden the retrieval is by using the “Related
   – Users do not have to visit multiple databases located in disparate places.
   – For example, in a nucleotide sequence page, one may find cross-referencing links to the                 Articles” option.
     translated protein sequence, genome mapping data, or to the related PubMed literature                      – PubMed uses a word weight algorithm to identify related articles with similar words in the
     information, and to protein structures if available.                                                         titles, abstracts, and MeSH.
                                                                                                                – By using this feature, articles on the same topic that were missed in the original search can be
                                                                                                                  retrieved.
• Effective use of Entrez requires an understanding of the main
  features of the search engine.
  2/13/2012                                                23                                                2/13/2012                                                                                                                         24
                                                                                                                                                                                                                                                          4
                                                                                                                                                                          2/13/2012
                                            OMIM                                                                                     GENBANK
                                                                                                           • GenBank is the most complete collection of annotated nucleic acid
 • Another unique database accessible from Entrez is Online                                                  sequence data for almost every organism.
   Mendelian Inheritance inMan(OMIM)
      – Is a non-sequence-based database of human disease genes and human genetic disorders.               • The content includes genomic DNA, mRNA, cDNA, ESTs, high
      – Each entry in OMIM contains summary information about a particular disease as well as genes
        related to the disease.
                                                                                                             throughput raw sequence data, and sequence polymorphisms.
      – The text contains numerous hyperlinks to literature citations, primary sequence records, as
        well as chromosome loci of the disease genes.
                                                                                                           • There is also a GenPept database for protein sequences, the
                                                                                                             majority of which are conceptual translations from DNA sequences,
 • The database can serve as an excellent starting point to study                                            although a small number of the amino acid sequences are derived
   genes related to a disease.                                                                               using peptide sequencing techniques.
• The third section of the flat file is the sequence itself starting with
  the   label “ORIGIN.”
  2/13/2012                                                              27                                  2/13/2012                                                     28