Prof.
Naglaa Abdallah
Course contents
• 12 weeks
• Lectures, exercise, discussion.
• Materials (presentations, links, books, etc.)
Class Structure
2 hours lecture
1 hour tutorial
Grading
• Final exam 60%
• Practical exam 20%
• Quizzes (Homework assignments), midterm 10%
• Oral 10%
Course description
An introduction to theory and practice of Bioinformatics and
computational biology
Goals for the course:
The course will familiarize the students with the tools and principles of
contemporary bioinformatics. By the end of the course, students will
have a working knowledge of a variety of publicly available data and
computational tools important in bioinformatics and a grasp of the
underlying principles that is adequate for them to evaluate and use
novel techniques as they arise the future.
What is Bioinformatics
• Bioinformatics: Collection and storage of biological
information
• It is The field of science in which biology, computer
science, and information technology merge to form a
single discipline
Computational Biology: Development of statistical models
to analyze biological data
• Ultimate goal: to enable the discovery of new biological
insights as well as to create a global perspective from
which unifying principles in biology can be discerned
• Bioinformatics: any use of computers to handle biological information.
• Bioinformatics (Oxford English Dictionary): The branch of science
concerned with information and information flow in biological systems,
esp. the use of computational methods in genetics and genomics.
• Molecular Bioinformatics: involves the use of computational tools to
discover new information in complex data sets (from the one-dimensional
information of DNA through the two-dimensional information of RNA
and the three-dimensional information of proteins, to the four-dimensional
information of evolving living systems).
The field of science in which biology, computer science and
information technology merge into a single line
Biologists
collect molecular data:
DNA & Protein sequences,
gene expression, etc.
Bioinformaticians
Study biological questions
by analyzing molecular
data
Computer scientists
(+Mathematicians, Statisticians, etc.)
Develop tools, softwares, algorithms
to store and analyze the data.
011
010
010
010
000
01001011 01010101010
• Store data 01100101 010101010101
• Create Interfaces to the data
Database
• Build tools to analyze data Database
Database
The objective of biological experimentation is not just to generate
biological data, but also to analyze the data and extract information and
knowledge from it. The high complexities and volumes of these data
require the use of computers for the storage and analysis of the data,
and makes bioinformatics an complete part of modern molecular
science.
Origin of Bioinformatics
• Bioinformatics started in the 1960s, Term ‘Bioinformatics’
appears in the 1989.
• Bioinformatics was created to serve molecular biology or
Studying life at molecular level.
• Development of fast computers, good algorithmic techniques.
• Bioinformatics is the application of many sciences (Applied
mathematics, Statistics, Physics, Biology, Genetics and
Biochemistry.
History of Bioinformatics
1953:DNA structure discovered.
1956: First protein sequenced (insulin).
1960: Assembly of protein sequence databases.
1972: Protein Data Bank (PDB).
1977: Sanger sequencing technique developed.
1979: First DNA Data Bank (GenBank).
1987: Multiple sequence alignment.
1988: National Centre for Biotechnology information (NCBI) is
released by Larry Wall.
History of Bioinformatics
1990: Human Genome Project started.
1990: BLAST program introduced by Kartin and Altshul.
1993: The first genome database (C.elegans).
1995: Influenza genome sequences (5Mb).
2000: Drosophila genome sequences (180Mb).
2001: The human genome (3.000 Mbp) is published.
The evolution of bioinformatics as seen in the 90’s
The requirement of bioinformatics
• Data collection techniques (DNA sequencing, protein
sequencing, microarrays)
• Theoretical concept (concepts of DNA structure, protein
structure, evolution)
• Programs (BLAST, FASTA)
• Databases
• Institutions
• Complex genomic and high throughput data
The important of Bioinformatics
• Applications areas include
• Medicine
• Pharmaceutical drug design
• Toxicology
• Molecular evolution
• Biosensors
• Biomaterials
• Biological computing models
• DNA computing
What could Bioinformatics offer?
Analyze and interpret biological data: Genomic Sequences, RNA
structure & Transcriptomic Sequences & Protein sequences and
Structures - RNA structure (RNA)
Develop new algorithms and tools to: Assess the biological
information, handle large datasets, find relationships between data
sources etc…
Basic Science :
- Understand the living cell
- Find the function of a new protein
- Find the genes/proteins that are unique to human
Medical applications: identify the mutations (SNPs) that cause
genetic diseases, disease diagnosis & find and develop new and better
drugs
Agriculture applications: higher yield crop, increase shelf life, etc.
15
• Structural genomics is a field of genomics that involves the
characterization of genome structures.
• This knowledge can be useful in the practice of manipulating the
genes and DNA segments of a species.
• Functional genomics is a field of molecular biology that
attempts to describe gene functions and interactions.
• Functional genomics make use of the large data generated by
genomic and transcriptomic projects.
Bioinformtics could ue used to:
• Sequence complete genome
• Identify protein coding regions
• Identify unique genes
• Gene knockout
• Functional analysis (phenotype,
detailed functional characterization..)
• Structural studies, drug development
• The most logical way to look at how bioinformatics assists
molecular biology, is to look at it from the central dogma.
• Bioinformatics plays a role at each stage of the central dogma of
molecular biology.
First, there is DNA
• DNA is the most basic data gathered from molecular experiments,
and data types associated with DNA are genomes, genes, and gene
features.
18
Central Paradigm in Molecular Biology
Gene (DNA) mRNA Protein
21ST centaury
Genome Transcriptome Proteome
The second part in the dogma is mRNA.
• Data is generated in many areas of experimentation involving
mRNA, and this includes the levels of expression of the mRNA.
• Typically these would be microarray experiments.
• Data associated with the structure of RNA, and then data associated
with other RNA.
• This include ribosomal RNA, transfer RNA and studies involving
RNAi
20
Next are data associated with proteins.
• Especially in modern molecular biology and biochemistry,
proteomics is a growing field with many resources being allocated
to the molecular study of proteins.
• The data types most associated with proteins are sequence data,
structural data and phylogenetics data.
21
• Today, the field of Systems Biology is very high on the list of important research
topics. It is a field that studies the complex interactions of genes proteins and
other cellular elements and is very important for the advancement of
knowledge regarding the function of the life. Metabolic pathway
determination and modeling is a very important aspect of systems biology.
• The highest level at which molecular biologists are working, is at the
phenotypic level, trying to explain the reasons for phenotypes given genetic and
proteomic makeup of cells.
• Human disease manifestation as a result of genetic flaws is a big topic of study,
and bioinformatics is playing a major role in determining the causes of genetic
disease, as well as helping with the search for cures for these conditions.
• Genome browsers, is one that exists only in silico, inside computers, these
resources provide an integrated look at entire organisms, and how everything
known about its biology is interrelated.
22
Central Dogma and Genome Browsers
Integrated Resources
DNA RNA Protein Systems Phenotype
Genomes Expression Sequence Metabolic Disease
Pathways
Genes Structure Structure
Features Other RNA Phylogenetics
(tRNA,rRNA&iRNA)
Central Dogma and Genome Browsers
Integrated Resources
DNA RNA Protein Systems Phenotype
DNA
Genomes
• Automated sequencing
• 200 million by 1998
• 1.5 billion by 2003 (Human Genome complete)
• Roche 454: >1Gb/day
• Cost drops too....
• ATCGATCGATCATGCTAGCTAGCTAGCTAGCTAGCG
CTATGCTAGCTCGTGCTAGCATGATCGATCATG.......
DNA
TCATCGGTCATGCATGC TCATCGGTCATGCAATCGA
TCATCGGTCA ACCTGTGTTCATCGGTCATGC
TCATCGGTCATGC TCATCGGTCA
TCATCGGTCATGCACGGTTA
TCATCGGTCATGC
•Huge number of sequences from library (1000 000 000 clones)
•Sequence must be determined
•Must be assembled
•Must be stored
Genome Storage –what we need?
Computational tools are needed to distill pathways of interest
from large molecular interaction databases
• Need modern computer systems
• Files
• Specific formats
• Databases
• Manage data
• Searchable
• Fast, reliable, available
DNA
• Genes (Eukaryotic cell)
DNA
• Genome annotation
• Finding genes and the features associated with the genes
• Must be done on computers
TTTATATAGGAGGAGATTCTCGCGATATCGGATCATCGAGCGCGCGCTATATATAGCTAGCTAGCTAGCTACTAGATCATCGATCGATCGCGCGATTACGATGCTAGCTACTACGGAAT
TACGATCGATCGATGCATCGATGCATGCATGCATGCATGCTAGCTAGCTAGCTAGCTAGCATGCTAGCTAGCTGATCGATCGGCGATCGATGCATCGATGCATCGATCGTAGCTAGCT
AAAGAGAGAGAGATCTCTCTTATAATTATAGCGCGATATATATGCGCATATATATATGCGATCATCGACTGCGCTATATACGATCGATCTAGCATCTAGCGCTATATACGATCATCTAA
TATGCTAGCTACTACTATCATCGATGCTATCAGCTCGGCGCAATTATATATATTTTCTCTTATATAACTCGATAGCTACTACTACCTACTAGTAGCTAGCTAGCTAGCTAGCATCGTACG
TTTATATAGGAGGAGATTCTCGCGATATCGGATCATCGAGCGCGCGCTATATATAGCTAGCTAGCTAGCTACTAGATCATCGATCGATCGCGCGATTACGATGCTAGCTACTACGGAAT
TACGATCGATCGATGCATCGATGCATGCATGCATGCATGCTAGCTAGCTAGCTAGCTAGCATGCTAGCTAGCTGATCGATCGGCGATCGATGCATCGATGCATCGATCGTAGCTAGCT
AAAGAGAGAGAGATCTCTCTTATAATTATAGCGCGATATATATGCGCATATATATATGCGATCATCGACTGCGCTATATACGATCGATCTAGCATCTAGCGCTATATACGATCATCTAA
TATGCTAGCTACTACTATCATCGATGCTATCAGCTCGGCGCGCATTATATATATTTTCTCTTCTCTCTCTCGATAGCTACTACTAGCTACTAGTAGCTAGCTAGCTAGCTAGCATCGTACG
TTTATATAGGAGGAGATTCTCGCGATATCGGATCATCGAGCGCGCGCTATATATAGCTAGCTAGCTAGCTACTAGATCATCGATCGATCGCGCGATTACGATGCTAGCTACTACGGAAT
TACGATCGATCGATGCATCGATGCATGCATGCATGCATGCTAGCTAGCTAGCTAGCTAGCATGCTAGCTAGCTGATCGATCGGCGATCGATGCATCGATGCATCGATCGTAGCTAGCT
AAAGAGAGAGAGATCTCTCTTATAATTATAGCGCGATATATATGCGCATATATATATGCGATCATCGACTGCGCTATATACGATCGATCTAGCATCTAGCGCTATATACGATCATCTAA
TATGCTAGCTACTACTATCATCGATGCTATCAGCTCGGCGCGCATTATATATATTTTCTCTTCTCTCTCTCGATAGCTACTACTAGCTACTAGTAGCTAGCTAGCTAGCTAGCATCGTACG
TTTATATAGGAGGAGATTCTCGCGATATCGGATCATCGAGCGCGCGCTATATATAGCTAGCTAGCTAGCTACTAGATCATCGATCGATCGCGCGATTACGATGCTAGCTACTACGGAAT
TACGATCGATCGATGCATCGATGCATGCATGCATGCATGCTAGCTAGCTAGCTAGCTAGCATGCTAGCTAGCTGATCGATCGGCGATCGATGCATCGATGCATCGATCGTAGCTAGCT
AAAGAGAGAGAGATCTCTCTTATAATTATAGCGCGATATATATGCGCATATATATATGCGATCATCGACTGCGCTATATACGATCGATCTAGCATCTAGCGCTATATACGATCATCTAA
TATGCTAGCTACTACTATCATCGATGCTATCAGCTCGGCGCGCATTATATATATTTTCTCTTCTCTCTCTCGATAGCTACTACTAGCTACTAGTAGCTAGCTAGCTAGCTAGCATCGTACG
TTTATATAGGAGGAGATTCTCGCGATATCGGATCATCGAGCGCGCGCTATATATAGCTAGCTAGCTAGCTACTAGATCATCGATCGATCGCGCGATTACGATGCTAGCTACTACGGAAT
TACGATCGATCGATGCATCGATGCATGCATGCATGCATGCTAGCTAGCTAGCTAGCTAGCATGCTAGCTAGCTGATCGATCGGCGATCGATGCATCGATGCATCGATCGTAGCTAGCT
AAAGAGAGAGAGATCTCTCTTATAATTATAGCGCGATATATATGCGCATATATATATGCGATCATCGACTGCGCTATATACGATCGATCTAGCATCTAGCGCTATATACGATCATCTAA
TATGCTAGCTACTACTATCATCGATGCTATCAGCTCGGCGCGCATTATATATATTTTCTCTTCTCTCTCTCGATAGCTACTACTAGCTACTAGTAGCTAGCTAGCTAGCTAGCATCGTACG
promoter TF binding site
CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG
CGTGCAAATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA
CTTTGTTACGCGTTTTTGTCATGGCTTTGGTCCCGCTTTGTTC
Transcription
AGAATGCTTTTAATAAGCGGGGTTACCGGTTTGGTTAGCGAGA
Start Site
AGAGCCAGTAAAAGACGCAGTGACGGAGATGTCTGATG CAA
TAT GGA CAA TTG GTT TCT TCT CTG AAT
.................................
Ribosome binding Site
.............. TGAAAAACGTA
ORF=Open Reading Frame
CDS=Coding Sequence
31
Central Dogma and Genome Browsers
Integrated Resources
DNA RNA Protein Systems Phenotype
RNA
• Poly-nucleotide molecules
• mRNA most well known
• Carries gene information for translation
• Expression analysis for genes
• Comparison between different environmental stimuli or different
cellular states
• Level of protein production proportional to RNA level
Gives a good indication of gene expression
RNA
Structure – Function relationship important
Primary Secondary Tertiary
5’AACUCGAGC
UACUAGCUAG
GCGCGUUAAU
UAUCGUACUA
UAGCUACUAC
UUCGCGUAAU
UAUUACGAUG
UUCGGCUAGA
UUAGCGAUAU
UAUUACGAUA
UAUAUGCGCA
UAUCAGAUU3’
Central Dogma and Genome Browsers
Integrated Resources
DNA RNA Protein Systems Phenotype
Protein
• Relatively easy to find DNA sequence
• More than 200 organisms sequenced
• Putative proteomes are abundant
• Bioinformatics: Computational tools to determine protein
structure and function from sequence
• Proteomics à analyze protein sequences
• Translation of RNA into proteins. The process of genome
sequencing is relatively simple, and to data several organisms had
their entire genome sequenced.
• Through various annotation processes using computers, it is possible
to predict the genes, and the proteins they will encode.
• This creates the need for bioinformatics tools to computationally
determine protein structure and function from sequence, with the end
goal being that the accuracy of predicting the structure and
function of proteins would be extremely high.
Protein function
• Classify proteins (Database of protein
motifs)
• Choose and express representative
proteins from all families
• Determine structure by X-ray
• Predict the rest by homology modeling
• When comparing protein sequences, it can be assumed that for
proteins with similar sequences, the proteins should have similar
functions.
• This fact is also true for subsections of a particular sequence, and not
only for the entire sequences.
• In Proteomics, many discoveries as to the specific functionality of a
particular stretch of sequence (motif) have been made, and these
discoveries were stored, and are being used today in sequence
function assignment.
• Protein structures have been available in the public domain for
longer than any other type of biological data, with the first public
repository being created in the early 70’s.
• More than 33000 structures are available for download via the
internet.
• These structures can be downloaded for the purpose of homology
modeling.
• If 2 sequences are homologues, and functionally similar as well,
then their 3-dimensional structures should be similar as well.
This approach can be useful during drug discovery, or just studies
on the functioning of a particular protein.
Protein
• Similar sequence à Similar function
• Smaller stretches of sequence carries similar function
• Motifs or signature sequences
• DNA binding motifs
Sequence B
Sequence A
Protein
• Motifs and signatures à Identify unknown proteins
• Search protein database for proteins with probable functionality
• Databases of protein signatures and motifs
• Pfam, Prosite, Prints, BLOCKS
• Various methods of representation
• HMMs – Pfam
• Regular expressions - Prosite
• PSSMs - BLOCKS
Central Dogma and Genome Browsers
Integrated Resources
DNA RNA Protein Systems Phenotype
• Knowing the function of one protein is not enough. If we are to
understand the way life function, we must know and understand the way
in which all proteins in the cell function together.
• We must understand the complex relationships there are between the
molecules in the cell that makes it function.
• The way in which this is done, is first to understand the metabolic
pathways.
• These are networks into which functional groups of proteins are divided.
• To do that, we must add all the different pathways and their effects
together.
• The field of study concerned with doing this is systems biology, in
which the biological system, as a whole, is studied with the end goal
of studying the effects of all the pathways eventually put together.
• Bioinformatics is an essential component of systems biology.
• Many software packages exist for the study of metabolic pathways
and the greater system as a whole.
• One such packages is called CellDesigner (http://www.celldesigner.org/).
• It is a structured diagram editor for drawing gene-regulatory and
biochemical networks.
• It is a package used to create in silico metabolic systems.
• These models can then be assigned metabolic characteristics, for example
the effect of the up regulation of one component on another. This can
eventually lead to the modeling of entire cascading effects in the system.
Cell Designer (http://www.celldesigner.org/)
47
Central Dogma and Genome Browsers
Integrated Resources
DNA RNA Protein Systems Phenotype
Phenotype
• All individual systems together à phenotype
• Phenotype studies à big money required
• Human Health à Phenotype studies
• Agriculture à Phenotype studies