Joint BecA-ILRI Hub, SLU and UNESCO Advanced
Genomics and Bioinformatics
Mark
Wamalwa
7th - 17th October 2013
BecA-ILRI
Hub,
Nairobi,
Kenya
h"p://hub.africabiosciences.org/
h"p://www.Ilri.org/
m.wamalwa@cgiar.org
Plan for the Week
Day 1 Introduction to
Linux
Introduction to Perl
Day 2 Shell Programming
programming
Perl programming Nucleotide and protein
Day 3 contd Sequence Manipulation
Regulatory sequence
CLC Genomics
Day 4 analysis
Cocktail
Day 5 CLC Genomics contd
What is Bioinformatics/ Computational
Biology?
Bioinformatics: Seeks to analyze large sets of biological data in order
to solve biological questions, to formulate hypotheses and to build
models of underlying biological processes involved.
Bioinformatics: collection and storage of biological information
Bulk Data analysis
Bulk Data storage
Bulk Data mining
Computational biology: development of algorithms and statistical
models to analyze biological data
Scope of bioinformatics
Storage
and
retrieval
of
biological
data
Molecular
structures:
visualiza9on
and
analysis,
classica9on,
predic9on
Sequence
analysis:
Sequence
alignments,
database
searches,
mo9f
detec9on
Genomics:
annota9on,
compara9ve
genomics
Phylogeny
Func;onal
genomics:
Transcriptome,
proteome,
interactome
Analysis
of
biochemical
networks:
metabolic
networks,
regulatory
networks
Systems
biology:
Modelling
and
simula9on
of
dynamical
systems
Multidisciplinarity
molecular genomics
biology
genetics mathematics
biochemistry statistics
bioinformatics numerical
biophysics analysis
algorithmics
evolution
image data
analysis management
Multidisciplinary
n Scientists can not be experts in all of these domains
n Problems:
qBiologists (generally) hate statistics and computers
qComputer scientists (generally) ignore statistics and biology
qStatisticians and mathematicians (generally)
Spend their time writing formula everywhere
qComplexity of the biological domain
Each time you try to formulate a rule, there is a possible
counter-example
q Solution: multidisciplinary teams/multi-lab projects
Applications
q Research in biology
Molecular organization of the cell/organism
Development
Mechanisms of evolution
q Medicine
Diagnostic of cancers
Detecting genes involved in cancer
q Pharmaceutical research
mechanisms of drug action
drug target identification
q Biotechnology
Gene therapy
Bioengineering
From wet science to bioinformatics
q Progresses in biology stimulated the incorporation of new methods in
bioinformatics
Structure analysis (since the Genomes (since the 90s)
50s) Genome annotation
structure comparison Comparative genomics
structure prediction Functional classifications
(ontologies)
Sequencing (since the 70s) Transcriptome (since 1997)
Sequence alignment Multivariate analysis
Sequence search in Proteome (~ 2000)
databases Graph analysis
High throughput technologies
Genome projects stimulated drastic improvement of sequencing technology
q Post-genomic era
Genome sequence is not sufficient to predict gene function
This stimulated the development of new experimental methods
transcriptomics (microarrays)
proteomics (Y=2-hybrid, mass spectrometry, ...)
q The "omics" trend:
High throughput methods raised a fashion of "omics.
Some of the "omics" are not associated to any new/high throughput
approach, this is just a new name on a previous method, or on an
abstract concept
Large-scale analyses
q The availability of massive amounts of data enables to address questions
that could not even be imagined a few years ago
genome-scale measurement of transcriptional regulation
comparative genomics
q Downstream analyses require a good understanding of statistics
q Warning: the global trends
the capability to analyze large amounts of data presents a risk to remain at a superficial
level, or to be fooled by forgetting to check the pertinence of the results (with some in-
depth examples)
good news: this does not prevent the authors from publishing in highly quoted journals
Bioinformatics is a science of inference
q The risks of inference
q Any analysis of massive data will unavoidably generate a certain rate of
errors (false positives and false negatives).
q Good research and development will include an evaluation of the error
rates.
q Good methods will minimize the error rate.
q Trade-off between specificity and sensitivity.
Why bioinformatics then ?
nIn most cases, wet biology will be required afterwards to validate the predictions
nBioinformatics can
q Reduce data to a small set of testable predictions
q assign a degree of confidence to each prediction
nThe biologist will often have to chose the appropriate degree of confidence, depending
on the trade between
q cost for validating predictions
q benefit expected from the right predictions
nBioinformatics as in silico biology
q Allows to explore domains that can not be addressed experimentally e.g., the study of past
evolutionary events
Phylogenetic inference and comparative genomics give us insights in the mechanisms of evolution
and in the past evolutionary events
The time scale of these events is however so large (billions of years) that one cannot conceive to
reproduce the inferred events with experimental methods.
Goals of Bioinformatics
Molecular Biology as an Information Science.
What is the Information?
Central Dogma Central Paradigm
of Molecular Biology for Bioinformatics
DNA
-> RNA Genomic Sequence Information
-> Protein -> mRNA (level)
-> Phenotype -> Protein Sequence
-> DNA -> Protein Structure
-> Protein Function
Molecules -> Phenotype
Sequence, Structure, Function
Processes Large Amounts of Information
Mechanism, Specificity, Regulation
Standardized
Statistical
Most cellular functions are performed or
facilitated by proteins. "
Primary biocatalyst"
Cofactor transport/storage"
Mechanical motion/support"
Immune protection"
Control of growth/differentiation"
Information transfer (mRNA)"
Genetic material Protein synthesis (tRNA/mRNA)"
Some catalytic activity"
(idea from D Brutlag, Stanford, graphics from S Strobel)
Scope of Bioinformatics
nDevelopment of computational tools
qWriting software
q Creating databases
nApplication of these tools to generate biological knowledge
q Creating databases
q Molecular sequence analysis
q Molecular sequence analysis
qMolecular structural analysis
qMolecular functional analysis
The
Bioinforma;cs
PlaAorm
High-performance
compu;ng
server:
32
total
processing
cores
128GB
of
memory
(RAM)
8TB
of
disk
space
25TB
LTO4
tape
backup
library
Linux
cluster
32
CPUs
(AMD
64-bit)
128
Gigabyte
RAM
>10
terabytes
disk
storage
Grid
compu;ng
Parallel
applica;ons:
>
Genome
assembly
(Newbler,
MIRA,
Celera,
velvet,
CAP3.
)
>
Genome
annota;on
(glimmer,
)
>
Phylogene;c
analysis
(Beast,
Mr
Bayes)
>
Other
sequence
analysis
tools
(BLAST,
clustalw,
HMMER,
R)
BecA-ILRI
Genomics
PlaAorm
Opportuni1es
for
genomics
and
metagenomics
research
Capillary
sequencing
ABI
3130-xl
ABI
3730-xl
ABI
3500-xl
Next
genera1on
sequencing
Genomics
Viral
genomics
1 sample = 1 library 454 GS
= 1 plate Func;onal
Genomics
500 mb/run pyrosequencer
1/2 cassava genome Metagenomics
1/8 human genome
Bioinformatics Core Activities
Statistical support Training/Capacity Building
Experimental design motif finding
functional/network analysis
Primary data analysis microarray analysis
NGS QC, spatial defect removal Data management
454 GA pipeline NGS data storage and manipulation
Data warehouse facilities : databases
Secondary/downstream analysis
Differential expression Software development
ChIP-seq peak calling Bioconductor packages: NGS annotation
Structural variation, genomic packages
rearrangements Automated NGS analysis packages
SNP and CN analysis
microRNA profiling Bioinformatics tools
GO enrichment Ensembl, Galaxy, Cytoscape
From Sequence (genomics/metagenomics) to impact
phylogenetic
analysis Diagnostics
geographical
mapping
Global diseases
(meta)genome sequencing surveillance
protein
Databases modeling
Vaccine dvlpmt
sequence
variation
analysis Drug dvlpmt
Compilation of complete
genomes, metagenomes, Primer,
annotation and Improved drug
microarray selection
curation of metadata
Extraction of
important biological Environmental
discovery of sustainability
information new micro-
organisms and
pathways
Improved Public
health intervention
Books
nZvelebil, M. & Baum, J.O. Understanding Bioinformatics. (2007) pp. 772
nPevzner, J. (2003). Bioinformatics and Functional Genomics. Wiley.
qAll the slides available at: http://www.bioinfbook.org/
nW. Mount. Bioinformatics: Sequence and Genome Analysis. (2004) pp. 692.
qhttp://www.bioinformaticsonline.org/
nWesthead, D.R., J.H. Parish, and R.M. Twyman. 2002. Bioinformatics. BIOS Scientific Publishers,
Oxford.
nBranden et al. Introduction to Protein Structure. (1998) pp. 410
The
BecA
Hub
team
08
countries,
17
females,
19
males
Australia,
Benin,
Cameroon,
England,
Ethiopia,
Italy,
Kenya,
USA
Dankie!!!