Topic 7
Data-mining in trancriptomics databases
• Genome-wide expression profiling
• The technology
• Organization and classification of data-sets
• Data-mining
ORGANIZATION OF BIOLOGICAL DATA
Gene i Genomics
m-RNA i Transcriptomics
Protein Sequence /
Protein i Proteomics
Function
(Enzyme, 3-D Structural
hormone etc.) Database
The Flow of Genetic Information
5’ Sequence same as RNA
3’
DNA ACTGCACCATGGGGCTCAGCGACGGGGAATGGCACTTGGTG
TGACGTGGTACCCCGAGTCGCTGCCCCTTACCGTGAACCAC
Sequence complementary to RNA
mRNA 5’ ACUGCACCAUGGGGCUCAGCGACGGGGAAUGGCACUUGGUG
Initiation codons
signal
Protein
Met-Gly-Leu-Ser-Asp-Gly-Gln-Trp-His-Leu-Val
DESCRIPTION OF A LIVING CELL / VIRUS
Genome / General Capability
Genomics of the Cell
Transcriptomics Readyness of the Cell
Proteomics / Physiological state
Protein Map of the cell
Network genomics
Metabolites
DNA RNA Protein
Growth rate
Expression
stem cells
cancer cells
microbes
Some useful signals on Genes
Upstream activating
sequences (UAS)
m-RNA expression
TATA box
start & end
DNA
x x
mRNA
Ribosomal
binding site protein
Protein Protein
synthesis synthesis
starts stops
A typical gene in higher organisms
Transcription Acceptor
Intron Donor
start site model
(non-coding region) model
Translation Stop
start site Exon (coding codon
region)
Alternative splicing leads to diversity
Transcription
start site
E1 I1 E2 I2 E3
E1 E2 E3
E1 I1 E2 E3
Human RNA-splice junctions sequence matrix
Genetic Regulation of Processes
(Regulation of Transcriptional Activity)
A Typical Genetic Regulatory Circuit
McAdams and Arkin, Proc. Natl. Acad. Sci., 1997, vol 94, 814-819
Newly identified members of Gal4 Regulatory Circuit
Ren et al, Science, 22 Dec 2000, vol 290, 2306-2309
8 cross-checks for regulon quantitation
In vitro
Protein fusions In vivo selection
Selection A-B (one-hybrid)
(Selex) A
B
EC SC BS HI
P1 1 0 1
P2 1 1 0
P3 0 1 1
P4 1 0 0
P5 1 1 1
Microarray data P6 0 1 1
Coregulated sets P7 1 1 0
of genes Phylogenetic profiles
TCA
cycle
B. subtilis purM purN purH purD
E. coli purM purN
Metabolic pathways Known regulons in
purH purD
Conserved operons other organisms
Data mining in transcriptomics
databases
47 articles on RNA array data
13 databases (3 Sybase, 2 Oracle, 8 Other)
60 articles on RNA array data mining
108 companies, 23 for software
Current Gene Expression Databases
Axeldb www.dkfz-
heidelberg.de/abt0135/axeldb.htm
Gene expression in Xenopus
BodyMap bodymap.ims.u-tokyo.ac.jp/
human & mouse gene expression
FlyView pbio07.uni-muenster.de/ Drosophila
Interferon Stimulated Gene Database
www.lerner.ccf.org/labs/williams/xchi-html.cgi
genes induced by treatment with interferon
Stanford Microarray Database
genome-www.stanford.edu/microarray
Raw & normalized data from various sources
RNA quantitation database integration
experiment • R/G ratios
control ORF
Microarrays1 • R, G values
~1000 bp • quality indicators
hybridization
ORF • Averaged PM-MM
PM • “presence”
Affymetrix2 MM
25-bp hybridization • feature statistics
ORF SAGE Tag • 25-mers
SAGE3 • Counts of SAGE 14-
sequence counting mers sequence tags
for each ORF
concatamers
1 DeRisi, et.al., Science 278:680-686 (1997)
2 Lockhart, et.al., Nat Biotech 14:1675-1680 (1996)
3 Velculescu, et.al,, Science 270:484-487 (1995)
Biotinylated RNA
from experiment
GeneChip expression Each probe cell contains
analysis probe array millions of copies of a specific
oligonucleotide probe
Streptavidin-
phycoerythrin
Image of hybridized probe array conjugate
Error Model for Microarray Data
Fawcett et al, Proc. Natl. Acad. Sci. USA (2000) 97, 8063-68
Representation of expression data
Normalized Time-point 1
Expression Data
from microarrays
T1 T2 T3
Time-point 3
Gene 1
dij
.
Gene 1
Gene N Gene 2
Cluster analysis of mRNA expression data
By gene (rat spinal cord development, yeast cell cycle):
Wen et al., 1998; Tavazoie et al., 1999; Eisen et al., 1998;
Tamayo et al., 1999
By condition or cell-type or by gene&cell-type (human
cancer):
Golub, et al. 1999; Alon, et al. 1999; Perou, et al. 1999;
Weinstein, et al. 1997
Cluster Analysis
• To divide samples into homogeneous groups based on set
of features.
• Clustering of genes based on similarity in expression
pattern over a range of conditions.
Protein/protein complex
Genes
DNA regulatory elements
Gene Expression Data Analysis
Gene Expression Data
Pairwise Measures
Distance/Similarity Matrix
Clustering
Gene Clusters
Motif Searching/...
Regulatory Elements / Gene Functions
Clusters of Two-Dimensional Data
Key Terms in Cluster Analysis
• Distance & Similarity measures
• Hierarchical & non-hierarchical
• Single/complete/average linkage
• Dendrograms & ordering
Distance Measures: Minkowski Metric
Suppose two objects x and y both have p features :
x ( x1 x 2 xp )
y ( y1 y 2 yp )
The Minkowski metric is defined by
p
d ( x, y) r | xi yi |r
i 1
Most Common Minkowski Metrics
1, r 2 (Euclidean distance )
p
d ( x, y) 2 | xi yi |2
i 1
2, r 1 (Manhattan distance)
p
d ( x , y ) | xi yi |
i 1
3, r (" sup" distance )
d ( x , y ) max | xi yi |
1 i p
An Example
x
3 y
1, Euclidean distance : 2 4 2 32 5.
2, Manhattan distance : 4 3 7.
3, " sup" distance : max{4,3} 4.
Manhattan distance is called Hamming
distance when all features are binary.
Gene Expression Levels Under 17 Conditions (1-High,0-Low)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
GeneA 0 1 1 0 0 1 0 0 1 0 0 1 1 1 0 0 1
GeneB 0 1 1 1 0 0 0 0 1 1 1 1 1 1 0 1 1
Hamming Distance : #( 01 ) #( 10 ) 4 1 5.
Similarity Measures: Correlation Coefficient
p
( x x)( y
i 1
i i y)
s ( x, y )
p p
2 2
i
( x
i 1
x ) i
( y y )
i 1
p p
averages : x 1
p xi and y
i 1
1
p y.
i 1
i
s( x, y) 1
What kind of x and y give
(1) s(x,y)=1,
(2) s(x,y)=-1,
(3) s(x,y)=0 ?
Similarity Measures: Correlation Coefficient
Expression Gene A Gene B
Level
Gene B Gene A
Time Time
Expression Gene B
Level
Gene A
Time
Pattern recognition &
normalization
Singular Value Decomposition (SVD) =
Principal-Component Analysis (PCA)
Linear transformation of Genes by Conditions space
to “Eigen” space producing orthonormal superpositions.
hierarchical & non-
Normalized Expression Data
ab c d
Clustering methods
Hierarchical: a series of successive fusions or
splittings of data until a final number of clusters is
obtained.
• A definite hierarchy between clusters & sub-clusters
Non-hierarchical -: A number of clusters is assumed
at the start. Points are allocated among clusters so
that a criterion is minimized, e.g.the within-clusters
sum of the variance
• No hierarchy within clusters or between clusters.
• E.g. K-mean, Self Organizing maps, etc..
Hierarchical Clustering Techniques
At the beginning, each object (gene) is
a cluster. In each of the subsequent
steps, two closest clusters will merge
into one cluster until there is only one
cluster left.
The distance between two clusters is
defined as the distance between--
• Single-Link Method / Nearest Neighbor
• Complete-Link / Furthest Neighbor
• Their Centroids.
• Average of all cross-cluster pairs.
Single-Link Method
Euclidean Distance
a a,b
b a,b,c a,b,c,d
c d c d d
(1) (2) (3)
b c d b c d c d d
a 2 5 6 a 2 5 6 a, b 3 5 a , b, c 4
b 3 5 b 3 5 c 4
c 4 c 4
Distance Matrix
Complete-Link Method
Euclidean Distance
a
a,b a,b
b a,b,c,d
c,d
c d c d
(1) (2) (3)
b c d b c d c d c, d
a 2 5 6 a 2 5 6 a, b 5 6 a, b 6
b 3 5 b 3 5 c 4
c 4 c 4
Distance Matrix
Compare Dendrograms
Single-Link Complete-Link
ab c d 0
ab c d
6
Which clustering methods do you suggest
for the following two-dimensional data?
Problems of Hierarchical
Clustering
• It concerns more about complete tree
structure than the optimal number of
clusters.
• There is no possibility of correcting for a
poor initial partition.
• Similarity and distance measures rarely
have strict numerical significance.
Non-hierarchical clustering
Normalized Expression Data
Interpreting Patterns of Gene Expression
with Self Organizing Maps
Tamayo et al, Proc. Natl. Acad. Sci. USA, 1999, Vol 96, 2907
SOM algorithm
• Initial mapping of nodes fo is random.
• At each iteration, data-point P is selected and the
node Np that maps closest to P is identified.
• The mapping of the nodes is then adjusted by the
formula
fi+1(N) = fi(N) + (d(N, Np), i) (P-fi(Np)
where learning rate, (x, i) = 0.02 T / (T + 100 i)
T = max. no of iterations.
Clustering of genes with Self Organizing Maps
Clustering by K-means
•Given a set S of N p-dimension vectors without any prior
knowledge about the set, the K-means clustering algorithm
forms K disjoint nonempty subsets such that each subset
minimizes some measure of dissimilarity locally. The algorithm
will globally yield an optimal dissimilarity of all subsets.
•Euclidean distance metric between the coordinates of any two
genes in the space reflects ignorance of a more biologically
relevant measure of distance. K-means is an unsupervised,
iterative algorithm that minimizes the within-cluster sum of
squared distances from the cluster mean.
•The first cluster center is chosen as the centroid of the entire
data set and subsequent centers are chosen by finding the
data point farthest from the centers already chosen. 200-400
iterations.
Representation of expression data
T1 T2 T3
Gene 1
Time-point 1
Time-point 3
Gene N
dij
.
Normalized
Expression Data Gene 1
from microarrays Gene 2
Identifying prevalent expression patterns
(gene clusters)
Time-point 1
Normalized
Expression
1.5
0.5
Time-point 3
-0.5
1 2 3
-1
-1.5
Time -point
Normalized
Expression
Normalized
Expression
1.2 1.5
1
0.7
0.5
0.2
0
-0.3
1 2 3 -0.5 1 2 3
-0.8
-1
-1.3
-1.5
-1.8 -2
Time -point Time -point
Evaluate Cluster contents
Genes MIPS functional category
gpm1 Glycolysis
HTB1 Nuclear
RPL11A
Organization
RPL12B
RPL13A
RPL14A Ribosome
RPL15A
RPL17A
RPL23A
TEF2 Translation
YDL228c
YDR133C
YDR134C
YDR327W Unknown
YDR417C
YKL153W
YPL142C
Representation and clustering of Gene Expression Data
Eisen et al, Proc. Natl. Acad. Sci. USA, 1998, Vol 95, 14863
Hierarchical Clustering of Genes from Expression Data
Red=up-regulated, green=down-regulated
Gene Disruption Studies in Yeast
genes
M
u
t
a
n
t
s
Hughes et al, Cell, 2000, vol 102, 109-126
Molecular Classification of Human Breast Tumors
Biclustering of Gene Expression Data
Breast tumor samples
g
e
n
e
s
Perou et al, Nature, 2000, vol 406, 747-752
Identification of marker genes in cancer by
expression profiling
Data-Management in Cancer Research
Weinstein et al, Science (1997) 275, 343-349
Obtaining correlation by integrating two data-sets
Database S: Molecular Structure Descriptors
460,000 compounds x 588 descriptors
Database A: Activity patterns (-log GI50)
60,000 compounds x 60 cell lines
Database T: molecular targets (abundance/expression)
100 targets x 60 cell lines
Database A.T’: Correlation between compounds & targets
60 cell lines 100 targets
60k compds
60 cell lines
100 targets
60k compds
A . T’ = A.T’
‘‘Clustered correlation’’ map of compounds & molecular targets
compounds
Targets
Gleaning information from the Cancer databases at NCI
• Clustering of cell lines based on A, T, & A.T’
databases
• Prediction of mechanism of action of drugs based on
A.T’ database
• Correlation of targets in terms of expression based on
T.T’ database.
• Correlation of targets in terms of activities based on
(A.T’)’.(A.T’) database.
• Correlation between structure descriptors and
molecular targets based on S’.(A.T’) database.
Target-target correlation using cancer data
In terms of expression In terms of activities
(T.T’) (A.T’)’.(A.T’)
1
Targets
113
1 Targets 113 1 Targets 113
Correlation
between structure
descriptors and
Targets in
S’.(AT’)
database
Scherf et al, Nature Genetics (2000) 24, 236-44
Hierarchical clustering of human cancer cell lines
Based on Based on
gene sensitivity
expression to 1400
profiles compds
tested
drugs Clustered Correlation for A.T’ database
genes
Distinct Types of Diffuse Large B-Cell Lymphoma
Identified by Gene Expression Profiling
Alizadeh et al, Nature (2000) 403, 503-511
Gene expression signatures for cancer types
DLBCL gene expression subgroups define
prognostic categories
Class Discovery & Class Prediction in Cancer Research
by Gene Expression Monitoring
• General strategy, independent of previous
biological knowledge
• Class Discovery: New Cancer Classes
• Class Prediction: Assigning tumors to known
classes
• Based solely on gene expression monitoring
Golub et al, Science, 1999, vol 286, 531-537
Class Distinction Between
Acute Myeloid Lukemia (AML) &
Acute Lymphoblastic Leukemia (ALL)
Identify Distinguishing Features in a Dataset
Class Prediction Between AML & ALL
Assigning new tumor to known class
Class Discovery in Cancer with a 2-cluster SOM
Golub et al, Science, 1999, vol 286, 531-537
Class Discovery with a 4-cluster SOM
• Possibly, discovers a New Class of Cancer
• Can be applied to cancer data irrespective of
biological background
Exon Microarrays for Human Genome
Shoemaker, et al, Nature (2001) 409, 922-927
15,511 probes for 8,183 predicted exons
69 experiments
Using Expression Data from multiple experiments to
validate exons & define Gene boundaries.
Characterization of novel transcripts using Tiling Arrays
Verification of predicted exons using tiling microarrays.
Whole genome scan for validating predicted exons.
Determination of Regulatory Network and Motifs
from Microarray Data
Tavazoie et al, Nature genetics (1999) 22, 281-85
Application of Microarray Technology
• Classification of cancers, identification of marker
genes
• Validation of predicted exons / genes for higher
organisms.
• Identification of genetic regulatory networks.