"Biological Database" under the following five headings:
1. Concept
2. Types of Databases (from all possible viewpoints)
3. Primary, Secondary, and Specialized Databases
4. Interconnection between Databases
5. Information Retrieval from Biological Databases
1. Concept of Biological Database
A biological database is a structured collection of biological information that is stored,
organized, and made accessible for easy retrieval, updating, and management. These
databases are an essential component of bioinformatics and are used for storing
everything from DNA and protein sequences to gene expression profiles, molecular
structures, biochemical pathways, and even disease-related information.
Biological research generates a huge volume of data from genome sequencing, protein
structure determination, gene expression studies, and other experiments. Managing and
interpreting this data manually is impossible. This is where biological databases come in.
They serve as repositories where biological data can be stored, curated, and accessed by
researchers across the globe.
Why Do We Need Biological Databases?
Modern biological studies rely on high-throughput technologies that produce massive
data volumes. For example:
A single genome sequencing project can generate billions of base pairs.
Transcriptomic studies using RNA-Seq produce vast gene expression data.
Proteomics experiments identify and quantify thousands of proteins.
Without databases, managing such data becomes inefficient and error-prone. Biological
databases:
Ensure data standardization and uniformity
Enable data sharing among the global scientific community
Support data mining for hypothesis generation
Facilitate data integration from various experiments
Aid in experimental reproducibility and validation
Features of Biological Databases
Most biological databases have the following features:
Accessibility: Available online through public or institutional platforms
Searchability: Equipped with user-friendly search tools (e.g., keyword, accession
number, sequence search)
Annotation: Provide metadata like gene function, structure, and literature
references
Curation: Maintained by scientists or automated systems to ensure data accuracy
Interoperability: Linked to other databases for cross-reference (e.g., linking
UniProt protein entries to PDB structures)
How Biological Databases Are Used
Biological databases are used by:
Molecular biologists: For retrieving gene sequences and annotations
Medical researchers: For studying disease-associated mutations
Evolutionary biologists: For comparing genomes and building phylogenetic trees
Pharmacologists: For finding drug targets
Students and educators: For learning and teaching biological concepts
Examples of Common Biological Databases
Some widely used biological databases include:
GenBank: A comprehensive nucleotide sequence database
EMBL-EBI: Offers many databases like Ensembl (genome), InterPro (protein
domains), and Reactome (pathways)
UniProt: A protein sequence and function database
PDB (Protein Data Bank): Contains 3D structures of proteins and nucleic acids
NCBI: An umbrella platform that includes GenBank, PubMed, RefSeq, and more
Evolution of Biological Databases
The earliest biological databases date back to the 1970s, when GenBank and EMBL were
established for storing nucleotide sequences. With the rise of computational biology and
the Human Genome Project (1990–2003), the volume of biological data exploded,
necessitating more advanced and integrated databases.
Today’s databases are not just static repositories—they are interactive platforms that
allow users to:
Compare sequences using BLAST
Visualize genes on chromosomes using genome browsers
Explore protein networks and pathways
Download large datasets for local analysis
Categories of Data in Biological Databases
Biological databases may store different types of data:
Nucleotide sequences (e.g., DNA, RNA)
Protein sequences and structures
Gene expression profiles
Biological pathways and interactions
Genetic variation and mutation data
Taxonomy and evolutionary information
Literature and clinical data
Challenges in Managing Biological Databases
Despite their usefulness, biological databases face several challenges:
Data redundancy: Same sequences or entries may exist in multiple formats or
databases
Data inconsistency: Conflicting annotations between sources
Storage: Storing and managing petabytes of data
Curation: Keeping data up-to-date manually or through automation
Integration: Linking across different database formats and platforms
Efforts are ongoing to overcome these challenges using cloud computing, machine
learning, and data-sharing frameworks such as the FAIR principles (Findable,
Accessible, Interoperable, Reusable).
2. Types of Databases in All Possible Viewpoints
Biological databases can be classified in multiple ways, depending on the nature of the
data, how the data is curated, what functions the database serves, and the scope of its
content. Understanding these types is important for using the right database for a specific
purpose and for exam-based discussions.
Let’s break down the types of biological databases from all possible viewpoints:
A. Based on Data Type
This is the most common classification and is based on the kind of biological data stored:
1. Nucleotide Sequence Databases:
Store DNA and RNA sequences. Examples:
o GenBank (NCBI, USA)
o EMBL Nucleotide Sequence Database (Europe)
o DDBJ (Japan)
2. Protein Sequence Databases:
Store amino acid sequences of proteins. Examples:
o UniProt (includes Swiss-Prot and TrEMBL)
o PIR (Protein Information Resource)
3. Protein Structure Databases:
Contain 3D structures of proteins. Examples:
o Protein Data Bank (PDB)
o SCOP (Structural Classification of Proteins)
o CATH (Class, Architecture, Topology, Homologous superfamily)
4. Genome Databases:
Contain complete genome sequences and annotations. Examples:
o Ensembl
o NCBI Genome
o UCSC Genome Browser
5. Gene Expression Databases:
Store data from microarrays, RNA-Seq, etc. Examples:
o GEO (Gene Expression Omnibus)
o ArrayExpress
6. Pathway and Interaction Databases:
Store molecular pathways and protein interactions. Examples:
o KEGG (Kyoto Encyclopedia of Genes and Genomes)
o Reactome
o BioGRID
7. Mutation and Variation Databases:
Track polymorphisms, SNPs, and disease-related mutations. Examples:
o dbSNP
o COSMIC (Catalogue of Somatic Mutations in Cancer)
o ClinVar
8. Taxonomy and Classification Databases:
Organize species by evolutionary relationship. Example:
o NCBI Taxonomy Database
o TreeBASE
B. Based on Curation Method
Databases can also be classified based on whether they are manually curated or
automatically generated:
1. Manually Curated Databases:
Reviewed and annotated by experts. They are high-quality and reliable but take
more time to update.
o Example: Swiss-Prot (part of UniProt), Reactome
2. Automated or Computational Databases:
Populated using algorithms and scripts, often with minimal human supervision.
o Example: TrEMBL (automatically annotated protein database)
C. Based on Function or Purpose
This classification is based on how the database is used:
1. Archival Databases (Also called Primary):
Store raw experimental data. No interpretation, just deposition.
o Example: GenBank, DDBJ, PDB
2. Analytical Databases:
Store processed, interpreted data often derived from primary databases.
o Example: InterPro, STRING (protein-protein interactions)
3. Bibliographic Databases:
Focus on literature and references rather than raw sequences or structures.
o Example: PubMed, OMIM
D. Based on Accessibility
Databases can be either:
1. Open Access / Public:
Free to use and share. Most biological databases fall in this category.
o Examples: GenBank, UniProt, KEGG
2. Restricted / Commercial:
Require subscriptions, payment, or institutional access.
o Examples: GeneSpring, Pathway Studio
E. Based on Organism or Domain Specificity
Some databases are general, while others are tailored for specific organisms or domains:
1. General Databases:
Store data for many organisms.
o Examples: NCBI, EMBL, UniProt
2. Organism-Specific Databases:
Store data for one organism.
o Examples:
TAIR (The Arabidopsis Information Resource)
FlyBase (Drosophila)
WormBase (C. elegans)
MGI (Mouse Genome Informatics)
3. Disease-Specific Databases: Focus on a particular disease or medical condition.
o Examples:
HIV Sequence Database
Cancer Genome Atlas (TCGA)
F. Based on Data Source
1. Experimental Databases:
Contain data from laboratory experiments.
o Examples: GEO, ArrayExpress
2. Computational/Predicted Databases:
Store data predicted using algorithms, such as predicted protein domains or
secondary structures.
o Examples: PFAM (protein families), InterPro
Summary Table for Revision
Classification Example Types Sample Databases
Data Type Nucleotide, Protein, Structure GenBank, UniProt, PDB
Curation Method Manual, Automated Swiss-Prot, TrEMBL
Function Primary, Secondary, Literature GenBank, InterPro, PubMed
Accessibility Open, Restricted NCBI, GeneSpring
Organism Specificity General, Species/Disease-specific FlyBase, TCGA
Data Source Experimental, Predicted GEO, PFAM
Importance of These Classifications
Understanding these different viewpoints is helpful because:
It helps researchers choose the right tool or database for analysis.
It improves efficiency in data mining and retrieval.
It supports better experimental planning, especially for omics studies.
It allows students to categorize and memorize databases for exams.
3. Primary, Secondary, and Specialized Databases
Biological databases can be broadly classified into three main categories based on the
nature of the data and the level of processing involved: Primary, Secondary, and
Specialized databases. Understanding these classifications helps researchers and students
determine which database to use based on their scientific needs—whether for raw data,
processed data, or domain-specific information.
A. Primary Databases (Archival Databases)
Definition:
Primary databases are repositories of raw, unprocessed experimental data directly
submitted by researchers. These databases do not interpret or analyze the data but serve
as official records of experimental results.
Characteristics:
Data is usually submitted by scientists after experiments.
No deep analysis or annotation is added.
Mostly serve as reference archives.
Entries are usually assigned accession numbers for tracking.
Often linked to publication requirements (journals require deposition).
Examples:
1. GenBank (USA) – Maintained by NCBI; stores nucleotide sequences.
2. EMBL-EBI (Europe) – European nucleotide archive.
3. DDBJ (Japan) – DNA Data Bank of Japan.
4. PDB (Protein Data Bank) – Stores 3D structures of biomolecules.
5. ArrayExpress / GEO – Raw gene expression data from microarrays and RNA-
Seq.
Advantages:
Serve as a permanent record.
Ensure transparency and reproducibility.
Enable others to reuse the data for new insights.
Limitations:
Data may be incomplete or contain errors.
Lack of biological interpretation.
Annotation is often minimal or inconsistent.
B. Secondary Databases (Derived Databases)
Definition:
Secondary databases contain processed, curated, and annotated data derived from
primary databases. These databases add scientific meaning by interpreting, validating,
and organizing raw data into useful formats.
Characteristics:
Include functional annotation, structure prediction, domain identification, etc.
Often integrate data from multiple primary databases.
Curated either manually (by experts) or computationally.
Enable hypothesis generation and advanced analysis.
Examples:
1. UniProtKB (Swiss-Prot section) – Annotated protein sequences with functions
and pathways.
2. InterPro – Classifies proteins into families and domains.
3. Pfam – Protein families and domains using hidden Markov models.
4. PROSITE – Functional motifs and domains in proteins.
5. Ensembl – Annotated genomes and gene predictions.
Advantages:
Offer higher quality and more reliable information.
Enable comparative studies (e.g., cross-species gene functions).
Aid in education, research, and computational analysis.
Limitations:
May lag behind current data due to time-consuming curation.
Computational annotations can introduce biases or errors.
C. Specialized Databases
Definition:
Specialized databases focus on specific organisms, diseases, biological processes, or
data types. They are designed to address niche research questions and often integrate
highly detailed data and tools for a particular area of biology or medicine.
Characteristics:
Provide in-depth, focused content.
May contain unique datasets not found in general databases.
Include custom tools tailored to the subject domain.
Frequently updated by communities or consortia.
Examples:
1. Organism-Specific Databases:
o TAIR (The Arabidopsis Information Resource) – Arabidopsis genome and
function.
o FlyBase – Genetics and molecular biology of Drosophila melanogaster.
o WormBase – Data on Caenorhabditis elegans.
o MGI – Mouse Genome Informatics.
2. Disease-Specific Databases:
o TCGA (The Cancer Genome Atlas) – Cancer mutations and expression.
o OMIM (Online Mendelian Inheritance in Man) – Human genes and
genetic disorders.
o ClinVar – Variants and their relationship to human health.
3. Process-Specific Databases:
o KEGG – Pathways, enzyme reactions, and gene functions.
o Reactome – Human biological pathways.
o BioGRID – Protein–protein interactions.
Advantages:
Deep coverage of subject area.
Provide advanced tools for analysis (e.g., pathway viewers).
Encourage community participation and expert contributions.
Limitations:
May not be suitable for general-purpose analysis.
Can be under-maintained if funding or contributors decline.
Some may not be freely accessible.
Comparison Table
Feature Primary Database Secondary Database Specialized Database
Raw experimental
Data Type Curated, analyzed data Focused domain-specific data
data
Manual or Often manual and domain-
Curation Minimal or none
computational driven
UniProt, Pfam,
Examples GenBank, PDB, GEO TAIR, KEGG, TCGA
InterPro
Archiving and Functional
Use Case In-depth research on a topic
reference interpretation
Accessibility Mostly open access Mostly open access May be open or restricted
Conclusion
Understanding the distinction between primary, secondary, and specialized databases is
essential for biological data analysis. Primary databases serve as the foundation,
secondary databases offer interpreted knowledge, and specialized databases provide
focused tools and data. For students, this classification is also critical for exams and
practical research design.
4. Interconnection Between Biological Databases
In modern bioinformatics, biological databases do not function in isolation. Instead, they
form a vast, interconnected network that allows users to move seamlessly from one type
of data to another—such as from a DNA sequence to its protein product, its 3D structure,
biological pathway, or related literature. This integration improves the efficiency,
completeness, and biological relevance of data retrieval and analysis.
A. Why Interconnection is Necessary
Biological systems are complex and multi-dimensional. No single database can capture
the full biological context. For example:
A gene’s DNA sequence might be stored in GenBank.
The protein it encodes may be found in UniProt.
Its 3D structure may be in PDB.
Its function and pathway may be mapped in KEGG or Reactome.
Related mutations may be in ClinVar or COSMIC.
Scientific publications about it may be in PubMed.
Without interconnection, users would have to visit each database manually, wasting
time and risking errors. Interconnected databases solve this problem by linking relevant
data across platforms.
B. How Databases Are Interconnected
Interconnections between databases are established through:
1. Cross-Referencing
One database provides links to entries in other databases.
For example, a UniProt protein entry may link to:
o The corresponding gene in Ensembl or NCBI Gene.
o The structure in PDB.
o Functional pathways in KEGG.
o Expression data in GEO.
2. Shared Identifiers
Common accession numbers, gene symbols, protein IDs, or taxonomic IDs are
used across databases.
Example: NCBI uses a universal Gene ID that may be used in GEO, OMIM,
ClinVar, and others.
3. Web APIs and Services
Databases offer Application Programming Interfaces (APIs) for real-time data
exchange.
Tools like NCBI Entrez, EBI RESTful APIs, and BioMart allow custom
queries and data mining.
4. Data Warehousing
Integrated platforms store and synchronize data from multiple sources.
Example: Ensembl integrates gene, protein, variation, and regulatory data from
NCBI, UniProt, dbSNP, and more.
5. Federated Search Engines
Tools like EB-eye (EBI Search) and NCBI Entrez allow a single query to search
across multiple databases.
C. Examples of Database Interconnection
Let’s explore how some databases are linked in real-life scenarios:
Example 1: From Gene to Protein to Pathway
Gene entry in NCBI Gene contains:
o Nucleotide sequence → GenBank.
o Protein product → UniProt link.
o Related publications → PubMed.
o Pathway info → KEGG or Reactome.
Example 2: From Mutation to Disease
Variant entry in ClinVar:
o Links to the gene in NCBI Gene.
o Associated protein in UniProt.
o Literature in PubMed.
o OMIM for inherited disease data.
Example 3: From Structure to Function
A structure entry in PDB:
o Links to UniProt for sequence and function.
o Links to KEGG or Reactome for metabolic roles.
o Literature references via PubMed.
D. Integrated Platforms and Portals
Several platforms act as data integration hubs, combining multiple databases in one
interface:
Platform Description Linked Databases
GenBank, PubMed, GEO, ClinVar,
NCBI Central bioinformatics hub in the US
etc.
EMBL-
Europe-based bioinformatics portal Ensembl, ArrayExpress, InterPro, etc.
EBI
Swiss-Prot, TrEMBL, PDB, GO,
UniProt Central protein knowledgebase
KEGG
Genome browser with functional
Ensembl NCBI, UniProt, OMIM, Reactome
annotation
Enzyme, Gene, Compound, Drug
KEGG Pathway-based integration
databases
E. Benefits of Interconnection
1. Ease of Use: One query can retrieve multiple data types.
2. Data Validation: Cross-referencing reduces errors.
3. Efficient Research: Saves time and enhances hypothesis building.
4. Multi-Omics Analysis: Allows integration of genomics, transcriptomics,
proteomics, and metabolomics.
5. Dynamic Updating: Many databases auto-update links based on new research.
F. Challenges in Interconnection
Despite the benefits, interconnecting databases comes with challenges:
Challenge Description
Data Inconsistency Different databases may have conflicting information.
Format Differences Lack of standard formats makes integration difficult.
Challenge Description
Version Conflicts Updates in one database may not reflect in others.
Redundancy Duplicate records may confuse users.
Dependency Failure in one database affects many linked tools.
G. Future Trends in Integration
1. Semantic Web Technologies: Using ontologies (like Gene Ontology) to link data
meaningfully.
2. AI-Powered Data Linking: Machine learning tools to suggest database
connections.
3. Cloud-Based Integration: Real-time data sharing using cloud platforms.
4. Community-Driven Curation: Expert communities maintaining links and
updating records.
Conclusion
The interconnection of biological databases transforms them from isolated data stores
into a rich, collaborative network. This ecosystem helps researchers move from gene
sequences to medical applications efficiently and accurately. For students,
understanding how databases interrelate is crucial for both academic exams and practical
bioinformatics work.
5. Information Retrieval from Biological Databases
Information retrieval is the process of searching, accessing, and extracting relevant
biological data from databases. For students, researchers, and bioinformaticians,
mastering this process is essential to analyze genetic sequences, predict protein functions,
identify mutations, study gene expression, or link disease-related data. In this section, we
explain how to retrieve data, tools used, strategies, challenges, and real-world
applications of biological data retrieval.
A. Importance of Information Retrieval
Biological research produces an enormous volume of data daily, from genome sequences
to protein structures. Without efficient retrieval methods:
Data remains unused.
Research becomes time-consuming.
Scientific conclusions may be incomplete.
Thus, efficient and accurate retrieval ensures meaningful use of biological databases
and facilitates new discoveries.
B. Basic Steps of Information Retrieval
1. Formulate the Query
o Start by identifying keywords or identifiers (e.g., gene name, accession
number, disease name, organism).
o Use Boolean operators like AND, OR, and NOT to refine search.
2. Select the Appropriate Database
o Use GenBank for nucleotide sequences.
o Use UniProt for protein information.
o Use GEO for gene expression data.
o Use KEGG for pathways.
3. Use Database-Specific Search Tools
o BLAST for sequence similarity searches.
o Entrez for integrated searches across NCBI databases.
o EB-eye for EBI databases.
4. Filter and Refine Results
o Apply filters: species, date, data type, relevance.
o Sort by relevance, publication date, or data quality.
5. Download and Analyze
o Export data in formats like FASTA, CSV, GFF, or XML.
o Use bioinformatics tools or software (e.g., MEGA, BioEdit, RStudio).
C. Tools and Interfaces for Retrieval
Tool / Interface Purpose Example Database
BLAST Find similar sequences NCBI, Ensembl
Entrez Search across all NCBI databases NCBI
Tool / Interface Purpose Example Database
Text search across EMBL-EBI
EB-eye EMBL-EBI
databases
UniProt Search Protein keyword/ID search UniProt
BioMart Complex queries using filters Ensembl
SRS (Sequence Retrieval Query multiple databases with shared Used in older
System) fields systems
PubMed Search scientific literature NCBI
Retrieve genes or proteins with Gene Ontology
GO Term Search
specific functions databases
D. Example Search Scenarios
Example 1: Find a gene sequence from NCBI
Go to NCBI Nucleotide Database.
Search: BRCA1 Homo sapiens.
Filter results by RefSeq.
Click on entry → View full sequence in FASTA.
Example 2: Retrieve protein function from UniProt
Go to UniProt.
Search: TP53 human.
View entry → Protein function, pathways, interactions, disease relevance.
Example 3: Compare sequence with BLAST
Paste sequence into BLASTn.
Set database to "nr" or "refseq_rna".
Run BLAST → Get matched sequences and alignment.
Example 4: Explore pathway of a gene
Use KEGG Pathway Search.
Input gene: EGFR human.
View metabolic/signaling pathways involving EGFR.
E. Filtering and Data Exporting
Most databases allow export options such as:
FASTA: For sequences.
CSV or TSV: For tables and results.
XML or JSON: For structured data (programmatic use).
Graphs: Some offer visual outputs for download (e.g., KEGG maps).
You can also filter data by:
Organism (e.g., Homo sapiens, E. coli).
Data type (genomic, transcriptomic, proteomic).
Experimental method (e.g., RNA-Seq, X-ray crystallography).
Publication date or impact score.
F. Challenges in Retrieval
Challenge Description
Too much data Queries return thousands of hits, overwhelming users.
Incorrect search terms Using vague or wrong keywords yields irrelevant results.
Database-specific syntax Each system has unique search rules.
Different databases may use different genome assemblies or
Version inconsistency
IDs.
Data formatting issues Incompatible file formats or metadata loss during downloads.
G. Best Practices for Effective Retrieval
1. Use official gene/protein symbols from sources like HGNC.
2. Utilize advanced search options (filters, date ranges, specific fields).
3. Understand the database scope before querying.
4. Record accession numbers or IDs for future reference.
5. Use controlled vocabularies (e.g., Gene Ontology terms).
H. Role of Information Retrieval in Bioinformatics Research
Comparative Genomics: Retrieve sequences for cross-species alignment.
Functional Annotation: Get data for unknown gene functions.
Mutation Analysis: Retrieve SNP or mutation data from ClinVar/dbSNP.
Expression Studies: Access datasets from GEO or ArrayExpress.
Drug Target Discovery: Use pathway and protein databases.
I. Future Directions in Retrieval Systems
AI and Natural Language Processing (NLP): Help interpret queries written in
plain language.
Voice-Activated Search: Under development in user-friendly bioinformatics
tools.
Integration with Wearables and Clinical Devices: For real-time medical
genomics.
Enhanced Visualization Tools: Dynamic graphs, networks, and 3D models.
Conclusion
Information retrieval is the gateway to modern biological research. For students,
understanding how to formulate queries, use retrieval tools like BLAST or Entrez,
interpret results, and handle database interconnections is crucial for both academic
success and professional research. With practice, bioinformatics databases become
powerful allies in exploring genes, proteins, diseases, and beyond.