KEMBAR78
SVMotif A Machine Learning Motif Algorithm | PDF | Support Vector Machine | Promoter (Genetics)
0% found this document useful (0 votes)
32 views8 pages

SVMotif A Machine Learning Motif Algorithm

This document describes a machine learning algorithm called SVMotif that is used to identify cellular DNA transcription factor motifs. SVMotif uses a support vector machine approach to learn motifs from known interactions between transcription factors and genes. It can utilize both positive examples of known targets as well as negative examples of likely non-targets. The authors apply SVMotif to yeast transcription factor data and compare its performance to existing Gibbs sampling methods like BioProspector. They find that SVMotif is able to imply statistical correlations between transcription factor binding and secondary motifs.

Uploaded by

Ardhendu Sarkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views8 pages

SVMotif A Machine Learning Motif Algorithm

This document describes a machine learning algorithm called SVMotif that is used to identify cellular DNA transcription factor motifs. SVMotif uses a support vector machine approach to learn motifs from known interactions between transcription factors and genes. It can utilize both positive examples of known targets as well as negative examples of likely non-targets. The authors apply SVMotif to yeast transcription factor data and compare its performance to existing Gibbs sampling methods like BioProspector. They find that SVMotif is able to imply statistical correlations between transcription factor binding and secondary motifs.

Uploaded by

Ardhendu Sarkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Sixth International Conference on Machine Learning and Applications

SVMotif: A Machine Learning Motif Algorithm

Dustin Holloway Charles DeLisi


Mark Koni , Yue Fan
Molecular Biology, Cell Bioinformatics and Systems
Mathematics and Statistics
Biology, and Biochemistry Biology
Boston University
Boston University Boston University
Boston, MA 02215
Boston, MA 02215 Boston, MA 02215
mkon@bu.edu
dth128@bu.edu delisi@bu.edu
yue@bu.edu

There are many DNA components near a given gene in


Abstract a eukaryotic cell which influence its transcription and
resulting activation for protein production (expression).
We describe SVMotif, a support vector machine-based These DNA regions can bind activators, co-activators, and
learning algorithm for identification of cellular DNA inhibitors. Often these act to reinforce or inhibit each
transcription factor (TF) motifs extrapolated from known other. In general they can act combinatorially, determining
TF-gene interactions. An important aspect of this the gene's expression depending on circumstances.
procedure is its ability to utilize negative target Primary among such binding proteins are TF's, the main
information (examples of likely non-targets) as well as activators of gene transcription.
positive information. Applications involve situations The DNA binding components are largely determined
where clusters of genes are distinguished in experiments by binding site motifs, characteristic signatures of length 5-
with known transcription factors without known binding 15 DNA base pairs (bp) which bind to transcription factors
locations. We apply this to yeast TF data with target (TF's). The number of copies of a given TF's binding
identifications from ChIP-chip and other sources, and motif in a gene's promoter region (a region known to
compare performance with Gibbs sampling methods such contain most TF binding sites) is a strong predictor of the
as BioProspector. We verify that in yeast this method gene's response to the TF, and thus an indicator for the
implies well-defined and cross-validated statistical gene's regulation and function. For this reason the
correlations between TF binding and secondary motifs characterization of these DNA motifs has become an
whose binding properties (either with the primary TF or important biological problem.
other possible promoters) are not certain, and discuss The determination of a given TF's binding motifs often
some implications of this. SVMotif can be a useful follows from identification of common DNA patterns in
standalone method or a complement to existing promoter regions (regions known to contain most TF
techniques, and it will be made publicly available. binding sites) adjacent to genes which are known
experimentally to bind the TF, or shown to express their
RNA or protein in the presence of the TF in experiments.
1. Introduction Methodologies for obtaining this information include
microarray experiment correlations ([18], [25], [3]),
We will describe a machine-based method which can phylogeny [2], and gene ontology [6].
augment and improve on existing ones for discovering A common method has been based on Gibbs sampling
motifs from experimental information on transcription to optimize alignment among such gene clusters, in order
factor (TF)-gene interactions. This is a high dimensional to determine the most common motif pattern there ([30],
problem which has been successfully approached using [21], [27]). In BioProspector, for example, an optimal
Gibbs sampling methods (e.g., [30], [21]) among others, alignment is achieved by initiating an arbitrary alignment
and is now also being studied using machine learning among positive promoter regions, and adjusting this
methods ([36], [17]). We will illustrate our machine alignment one promoter at a time until an optimal one is
learning approach and compare it to Gibbs sampling for an found. The Gibbs sampling is done with a probabilistic
important data set - TF binding information for the simulated annealing algorithm in order to avoid local
budding yeast S. cerevisiae (baker's yeast). minima.

0-7695-3069-9/07 $25.00 © 2007 IEEE 573


DOI 10.1109/ICMLA.2007.105
Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on October 16,2023 at 06:01:45 UTC from IEEE Xplore. Restrictions apply.
These methods have more recently been augmented by produce an error margin that must be statistically
machine learning approaches. In [18], [19] for example, acceptable. Negatives can also be found from members of
boosting methods are used to accumulate weak rules the gene population with highest  values for binding the
related to presence of given -mers (consecutive strings of promoter of a given gene , as in the case of chromatin
length  appearing in the gene's promoter region) and immunoprecipitation chip (ChIP-chip) data [8], [29]) - we
given regulators (TF's) as predictors of gene expression. use such choices of negatives here.
This is done in experiments involving large numbers of In cases where only examples of (positive) targets are
regulators and various gene expression outcomes. From available (though likely non-targets could be generated
these statistical relationships between -mer presence, from the background gene distribution as above), negatives
regulator presence, and gene expression, there is produced can be artificially generated from permutation of positive
an 'important set' of -mers statistically associated with examples, with conservation of either zeroth or higher
expression via a regulator. These are then agglomerated order Markov properties. This has been done by us for the
into position weight matrices (PWM's) representing data described below with 0th order Markov preservation
probabilities of DNA bases in given locations of likely (i.e., only of overall nucleotide probabilities), and has been
binding site motifs. In [36] it is shown how incorporating shown to decrease prediction accuracy only by about 10%
hierarchical kernel information into analysis of promoter to 15%.
regions can result in accurate predictions of binding The table below shows differences in sensitivity for a
between TF's and gene promoters. few typical TF's, here chosen from the subset of
We present here a machine learning method, SVMotif, YeastGenome ([14], http://www.yeastgenome.org)
which finds motifs by statistical association of -mers with transcription factors which also have motifs listed in
known interactions of TF's with genes, learning them from Transfac [24], a curated database.
these examples. In its attempt to optimize based on Name #pos YeastGenome (Transfac) BioProspector SVMotif
similarities between binding promoters, this method acts CGGGTRR CGGGTAA AAGAAGARG
YBR049C 186 TTACCCG TACCCGG CYTCTTCTT
similarly to BioProspector and other methods; however, it GGCGGGTAAC
also has the advantage of using negative (i.e., genes whose GGCGGGTAA CCGGTGGCRG
YDL020C 134 GGTGGCAAA GTTTCCCCG TSGCCACCSG
promoters are expected not to bind) as well as positive GTTTCCCCG AAGAAGAGG
TACATA ARACGCGTYT
examples in learning a TF binding rule (see below for YDL056W 207 ACGCGT GCGACT GYYTTCTTS
some other approaches which also use negative GGTTGG SAAGAARRC
ATCCTCGAGTT CSCCACGTGGG
information). We expect that it will also be possible to YDL106C 69 SGTGCGSYGYG GACTCACAATC CCGCTGCAGCG
find binding motifs using other machine learning GCACTTACAAC CCCGGG

techniques (e.g., random forests, least angle regression [7])


Table 1: A sample of yeast transcription factors
which determine important variables out of a large class;
which have been analyzed in this paper . # pos
such additional approaches will be examined in later work.
represents the number of positive examples for
This method uses -mer importance determination
TF. The standard motifs (middle column) appear
from the largest components of a modification of the SVM
in Transfac [20] but are taken from YeastGenome
gradient vector w to statistically determine those -mers
[14] (which takes biological input from Transfac
which are most associated with a TF. SVM is a kernel-
[24]). Motifs to the right are in order of priority.
based method [34], [35], [4], [31] which incorporates prior
information via mappings ) of genomic feature vectors x Other related work: This approach identifies statistical
which effectively change the geometry of the feature space. properties of -mers occurring more frequently in one
SVM have been used in genomics in a number of different group of genes than another. As mentioned above,
ways [26], [20], [32]. boosting [18], [19] and support vector machine (SVM)
SVMotif extracts motifs from data sets involving approaches [36] have been used to demonstrate machine
promoters of genes known to bind a given TF (with a learning as a viable methodology in the area of
margin of error which can be allowed in input data; see determining binding genes and binding sites of TF's. In
below); it will be made publicly available. As mentioned, particular kernel methods have been used successfully in
a general advantage of learning methods is use of negative the work of Noble and Vert [36] to determine genes-TF
and positive examplesÁ which is not available in Gibbs associations.
sampling. Negative examples can be chosen in a few The central aspect here is learning-based ranking of -
ways, including random selection of promoter regions from mers from correlation with promoters known to bind a
the general population of genes (since being negative is a given TF !. Among other places, this is also done in [36],
statistical likelihood for a random promoter), which can which uses kernels on feature vectors of 5-mer counts. Let

574

Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on October 16,2023 at 06:01:45 UTC from IEEE Xplore. Restrictions apply.
x²³ be the sequence of bases in the promoter x of . The ChIP-chip experiments [9], [33]), as long as statistical
feature vector )²x³ – )spect ²x³ of the promoter region is a patterns are not masked by very small .
vector of length  , with each position ) ²x³ representing For  a gene in S. cerevisiae, x²³ is the FASTA
the count of the ! 5-mer in the indexing sequence. An sequence for its upstream region, up to 800 bp long. For
interesting addition in [36] is use of feature vectors taking fixed ! an SVM with a linear string kernel 2²xÁ y³ defined
phylogenetic conservation into account. They consider a in (1) takes data : and forms a discriminator
given upstream region x  7 of S. cerevisiae, and an  ²x³ ~ w h x b  which separates positive examples

alignment c  7 ! , consisting of an aligned array of five (²x Á & ³ for which & ~ ) from negative ones. Using the
(matching) upstream regions of length  from 5 related R-SVM feature reduction procedure [38], an importance
yeast species, to determine functional regions. Above 7 is vector w × ~ w d (b c c ³ is formed. Here, b
the alphabet ¸A, C, G, T¹À represents the center (vector mean) of the positive
There have been some other methods using negative examples in the feature space - , while c represents the
examples (likely nonbinding genes for a TF) for center of the negative examples. The vector a d b
determining binding motifs. An example is the Ann-Spec represents a componentwise multiplication of a and b.
algorithm [37]. Some boosting motif finding methods use From this a permutation  of indices of the vector w × is
negative information [13] as well. formed so the permuted vector w × k –w ×  is arranged
from largest to smallest component. The R-SVM program
2. SVM feature space approach takes the components of w ×  and reduces them to 150. This
feature selection is repeated 20 times with different choices
We assume a fixed TF !, and a dataset from experiments of presumed negatives (in this case genes with high 
which identify promoter regions of genes activated by ! (in values in ChIP-chip experiments) out of about 600
the present case these are ChIP-chip experiments [28], negative examples available. In each SVM run the
[8]). Our methods can be adapted to other data sets, for numbers of positives and negatives are chosen to be equal.
example those in [18] which map activation of a gene  as From the top 600 -mers obtained in these 20 runs, the
a function of TF presence together with presence of given best 50 are chosen using a single iteration of R-SVM. For
motifs in the promoter region of . typical transcription factors here, there are approximately
We have for the promoter sequence x²³ of  a 50-300 positive examples (i.e., genes binding !), and a
classifier & ~ f  indicating (with allowed error) whether larger number (e.g. 600) of negative examples, from which
or not ! binds the promoter of . different samples are taken in multiple runs to match
The SVM gives a function on promoters, defined by numbers of positives.
Though division into training and test sets is not
 ²x³ ~  2²x Á x³ b  – w h x b Á necessary in the motif finding procedure above, it is still
 valuable to know how good the SVM algorithm is in
with 2²xÁ y³ a string kernel ([17]) given by predicting whether a gene will bind the given TF, i.e., how
good w is at segregating the positive from the negative
2²xÁ y³ ~ )²x³ h )²y³, (1) examples. Typical test rate accuracy on the above data is
around 80%; see [11] for more details on the predictive
with ) a feature map taking x into the -mer feature vector accuracy of this algorithm for TF target prediction.
above. Software implementing this algorithm includes:
Positive Examples: Known positive examples of genes 
• SVMLight: http://svmlight.joachims.org with promoters binding ! include binding targets taken
•SVMTorch: http://www.idiap.ch/learning/SVMTorch.html from ChIP-chip experiments [29], [8], Transfac 6.0 Public
• LIBSVM: http://wws.csie.ntu.edu.tw/~cjlin/libsvm [24], and a list curated by Young et al., from which we
have excluded indirect evidence such as sequence analysis
A Matlab package which implements most SVM and expression correlation [29]. Of the ChIP-chip
algorithms with a C-based back end is SPIDER: interactions, only those with p-values 10c (i.e., high
http://www.kyb.mpg.de/bs/people/spider/whatisit.html confidence level) are considered positives. Selected
The latter implementation is used here. negatives are a randomly chosen subset of those genes
We illustrate the algorithm (again for ! fixed), given found not to be bound by a TF in ChIP-chip experiments
an example data set : ~ ¸²x Á & ³¹~ of promoter regions (typically these are the genes with highest -values and
x and outcomes & (binding/no binding). : is allowed to thus least significant binding).
have large errors (as are sometimes assumed to exist e.g. in

575

Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on October 16,2023 at 06:01:45 UTC from IEEE Xplore. Restrictions apply.
300. Negatives can be chosen randomly (since there are a
relatively small percentage of positives in the data), or
from genes in ChIP-chip experiments which have large -
values. They can also be formulated as randomized
versions of true positives in cases where known non-
positives are not available.

R-SVM: For fixed !, once the optimized SVM classifier


! ²x³ –  ²x³ is determined, the weighted direction vector
× (see above) contains the information about important -
w
mers. We choose the top 150 components as the first 150
of the vector w×  , output from w
× using R-SVM (see above).
The corresponding positions ¸c ²³¹1~ represent -mers
whose entries are the largest in w × , and are the features
which most differentiate positives and negatives. The final
choice of 50 -mers is obtained after 20 iterations and a
second R-SVM selection.

Agglomeration: Typically there are many similarities in


the -mers among the final 50, as would be expected. An
Figure 1: Workflow of the algorithm for a fixed agglomeration scheme is then employed to cluster those -
TF !: (a) Feature vectors of genes consist of mers which are similar or have significant overlap. Then a
counts of 4-, 5-, and 6-mers, with certain very PWM (position weight matrix) is formed which reflects the
common  -mers excluded. Negative examples relative frequencies of the bases in 7 in each position.
are chosen from ChIP-chip experiments [8] with This is similar to one used in [18], where in addition higher
high  -values. (b) Approximately 600 negative weights are given to -mers which correspond to earlier
examples are undersampled to match the positions in w×  . Starting with the first -mer in position
number of positive examples (typically 100 to c ²³ (which can be of length 4, 5, or 6) the second -
200 per promoter region). This provides mer at c ²³ is matched with it, and all overlaps are
balanced sets of positive and negative feature tested. If the match meets a certain threshold, then the
vectors with which to perform SVM overlap is kept, the -mers are placed in the same cluster,
discrimination. (c, d) The SVM classifier forming the first PWM. Every time a new -mer is added
provides a weighted direction vector w × , whose in this way, it either adds to an existing cluster or (if there
largest components are iteratively recalculated are no matching clusters) forms a new one.
20 times, from which 50 largest components
(with corresponding  -mer positions in the Comment on choices of  -mers: The performance of any
current w× vector) are determined at each run, clustering method depends on the quality of top ranked -
yielding up to 600 candidate  -mers. These are mers reported by SVM feature selection. We want to
reduced using R-SVM [38] to 50. (e) These  - eliminate noisy -mers before we run -mer count.
mers are extracted for clustering in the a. Typically the reported top ranked k-mers contain
formation of motif PWM's. (f) The resulting a large number of irrelevant noisy ones - typical of these
candidate PWM's are evaluated by (i) re- are, e.g., 'AAAAAN', 'ANAAAA', 'TATATA', 'ATATA',
scanning the positive promoters and 'ACACAC'…
determining individual matrices' scores (ii) b. Further, if negatives must be fabricated due to
examining entropy (purity of columns) and lack of information about true negatives, they can be
weights (numbers of  -mers clustered to form formed as permutations of positives (preserving their
the PWM). (g) This information is combined statistical properties). In this case SVM will often pick out
into a score, from which the top PWM's are regular sequences such as the above (which are not typical
chosen. in the permuted sequences) and their ranking will be
artificially higher.
For each gene  in S. cerevisiae, the total number of For these reasons we have made the following
positive examples ²³ typically is in the range of 50 to changes:

576

Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on October 16,2023 at 06:01:45 UTC from IEEE Xplore. Restrictions apply.
1. Delete 'AAAAA+', 'ACAC+', 'TATA+'… from Each column's importance is stored as its column sum
the original sequence. divided by the total weight (.23 b À ³ ¢
2. Set counting resistance to 2. Thus if a sequence is
" À      À #. (2)
…ACACAC…, the occurrence count for 'ACAC' is not 2
but 1. Similarly for …AAAAAAAA…, the occurrence The final PWM is normalized. Assuming 4 were the
count for 'AAAAA' is 1 rather than 4. final adjustment to this cluster, we normalize each column:
These two operations are assumed not to affect the
x 0 0 0 0 0 .58 1 {
motif components, but they do reduce such noisy -mers. z 1 0 0 0 0 .42 0 }
In the case of artificial negative generation, another 4 ~z }
0 1 1 1 0 0 0
method for elimination of noise would involve fitting a y 0 0 0 0 1 0 0 |
higher order Markov background model to the whole
genome and randomly generating pseudo-negative Next log odds scores with respect to the background
examples based on this model. model (in which each entry is .25) are taken with
pseudocounts of .1 added to both numerator and
More about Clustering: A greedy method with 2 denominator, yielding
thresholds is used. Once the score of a -mer matched to a x cÀ  cÀ  cÀ  cÀ  cÀ  À À {
cluster is above the in-threshold, the -mer is added into z À
z
cÀ  cÀ  cÀ  cÀ  À cÀ  }
}
this cluster. cÀ  À À À cÀ  cÀ  cÀ 
y cÀ  cÀ  cÀ  cÀ  À cÀ  cÀ  |
The algorithm starts the PWM with a weighted count.
For example, if the current -mer is GGGTAA with weight with each entry a log ratio (base 2). Here a 0 represents no
(in the final w vector) of 0.230, the weighted PWM is additional information (random entry), while a positive
entry represents positive information. Now multiplying by
x 0 0 0 0 .23 .23 {
z 0 0 0 0 0 0 } the weight vector (2) yields the final PWM,
4 ~ z }
.23 .23 .23 0 0 0
y 0 0 |
0 0 .23 0 x cÀ cÀ  cÀ  cÀ  cÀ  À À {
z À  cÀ  cÀ  cÀ  cÀ  À cÀ }
> ~z }
with the first row representing A, and the remaining rows cÀ À À À cÀ  cÀ  cÀ ,
representing C, G, and T. y cÀ cÀ  cÀ  cÀ  À cÀ  cÀ |
After comparing all possible alignments, an incoming
-mer with the next highest weight in w × (i.e., the second used to compute hitting scores by scanning a promoter
entry in the ordered vector w ×  ) will be added into the best region with > and counting the number of scores above a
fitting existing cluster with an offset determined by its best given threshold ; .
fit, and the cluster's PWM is updated to incorporate the Intuitively, a match in an 'important' position of >
new -merÁ by addition of the weighted PWM should gain more of a score and a mismatch in an
corresponding to it important position should lose more. Note that a match or
Thus if an incoming -mer is now CGGGTC with mismatch at the 'ends' will count less here, since
weight 0.17, it is clear that its position must be adjusted 1 importance multiples each column.
position to the left for the best match, so the PWM is
updated with alignment Scoring: How do we pick out the true cluster among all
C G G G T C clusters (each now a PWM) for !? The first part of
G G G T A A selecting the best PWM involves the relative entropy score
Thus we add 3 

, ~ Total_Weight h   log Á
x       { ~ ~

z À     À }
4 ~ z }
 À À À  
y     À  | to pick out overrepresented clusters. Here  is the
probability (frequency) of letter  in the ! PWM position,
transposed to the left, yielding the jogged sum  is the probability of  in the background model (here
0 0 0 0 0 .23 .23 .25);  is the importance score (see (2)) of the ! column.
x {
z À 0 0 0 0 .17 0 } Based on their entropy scores, we choose the top 5 clusters
: ~ z }
0 . . .4 0 0 0 (and their PWM's) as candidates and use them to scan the
y 0 0 0 0 .4 0 0 |
promoter sequences which are positive for !.

577

Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on October 16,2023 at 06:01:45 UTC from IEEE Xplore. Restrictions apply.
The hitting ratio score for a PWM is defined as The 85 TF's were tested using AlignAce [30],
BioProspector [21], and SVMotif. The following table
# hits on positive genes
HR ~ compares the performances these three methods on the full
# hits on negative genes MacIsaac dataset.
where negative genes are randomly undersampled to match AlignAce BioProspector SVMotif
the current number of positives. This is used to assess the Top 1 to 3 Top 1 to 3 Top 1 to 3
quality of these 5 candidate clusters. Appearing Ungapped (74) 19 34 27 33 29 43
significantly more often in positives than in negatives is Gapped (11) - - 3 3 6 9
assumed to be a property of a true motif. If we use
fabricated negatives instead of true ones (see above) then, Table 2: Motif finding performance on
as suggested earlier, regular and generally overrepresented MacIsaac [23] TF's with known binding
-mers like 'AAAAGA' and 'CTTCTTCTT', can get high specificities among AlignAce, BioProspector
hitting scores in promoter regions. For this reason we have and SVMotif. The 'top' PWM is the one with
chosen to use a product of entropy and hitting ratio scores the highest score for each TF; the number in
as the final criterion; the cluster with highest product is that column indicates how many TF's have
output as the best prediction. top PWM's which coincide with the reported
For computation of the hitting ratio we must set a motif in supplemental file 1 of MacIsaac [23].
threshold ; on a local PWM score to count a hit. 1 to 3 above represents the top 3 PWM for a
However, it is difficult to produce common thresholds for single TF. Each score represents the number
all TFs. We have fixed the value ; ~ 6.4 for all TFs, of PWM hits of the 'top' motif as reported by
based on the heuristic fact that for a normalized PWM, a MacIsaac among the union of the top 3
single strong match can add 1.65 PWM's for all TF's in the indicated row. Each
( ~ log ´  b À!°²À b À³µÁ the maximal log odds ratio³ PWM is scored either 0 or 1 total depending
to the PWM score, while a moderate match gains 1 and a on whether or not it coincides with reported
weak match gains 0.5. Mismatches in corresponding motifs. The above indicates there were a total
different levels will give losses of 1.8, 1 and 0.5. A hit of 74 TF's without gapped motifs, and 11 with
should have at least 4 strong matches and some matches or gapped motifs. For gapped motifs each
mismatches which on the average do not affect the hitting program was given a range of 3 gap sizes to
score. try.

Gapped motifs: The procedure for gapped motifs simply Below are the results for the subgroup of the above
tries a number of allowed gap sizes, and considers the TF's which also appear in Transfac [24].
regions adjacent to the gap to be contiguous for the
purpose of the SVM algorithm, similarly to what is done in AlignAce BioProspector SVMotif
Top 1 to 3 Top 1 to 3 Top 1 to 3
BioProspector.
Ungapped (25) 9 15 14 15 12 17
Gapped (3) - - 1 1 2 3
3. Experimental results
Table 3: Motif finding performance on
We chose 85 TF's from supplemental file 1 of binding data MacIsaac TF's (Transfac subgroup).
from MacIsaac [23], consisting of TF's with binding
specificities known from various sources. These 85 are Finally, we have taken the 102 S. cerevisiae TF's from the
chosen from a total of 88 used as a benchmark in [23]. UCSC Genome Browser [16]. In this database there are
Three TF's (MATA1, CRZ1 and ECM22) are omitted published PWM's for 102 TF's, complied largely from [8].
because positive examples for them did not exist in our These have been converted by us to consensus sequences.
original database. Of these, 74 are ungapped and 11 are Again after elimination of two TF's because of a lack of
gapped. Of the full group of 85 TF's, 28 of these (25 positive/negative examples, the corresponding results are:
ungapped and 3 gapped) also appear in the Transfac [24]
database, and so are analyzed separately given their AlignAce BioProspector SVMotif
presumably more accurate resulting known binding Top 1 to 3 Top 1 to 3 Top 1 to 3
specificities. Ungapped (90) 27 37 33 39 34 49
Gapped (10) - - 3 3 6 8

578

Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on October 16,2023 at 06:01:45 UTC from IEEE Xplore. Restrictions apply.
4. Signals indicating multiple switches A remark on methodology: We remark on the effective
simplicity of this process. The two basic components are
We have verified, as have others [17], [8] that there are identification of a statistical correlation between -mers
functional and statistical reasons for the multiple hits and gene activation (in sufficiently controlled
which occur in all methods that correlate motifs with TF experiments), and the inference of motif PWM's from
binding. For a given TF there are cofactors which can agglomeration of -mers which are significantly correlated
modulate the TF itself or reinforce the activity of the TF, with TF activation. The algorithms for both parts
leading to a number of the 'additional' motifs discovered by contribute to usefulness of the result.
programs such as BioProspector and SVMotif. This is This method could be further improved using ideas
added to in the multiple TF's in overlapping genomic introduced in [32], in which -mer selection is improved
transcription modules. Thus there are consistent using phylogenetic information from orthologous species.
appearances of 'additional' PWM's statistically associated As is shown there, this can be accomplished by
with TF's but apparently not acting as their binding sites. improvement of the kernel. As suggested above, such
We have tested some of these apparently false functional incorporation of prior knowledge can also be
relationships (in that the TF of interest does not bind the accomplished, in use of random forests through proper
additional sites), and verified them statistically. We have weighting of -mers in feature vectors using phylogenetic
divided several TF data sets in half, and showed that considerations. This will be considered in later work.
'additional' PWM's were consistently duplicated on cross- Other promising feature vector-based methods include
validation. This cross-functionality of motifs and the random projection methods [5], which are capable of
regulators which may bind them is not completely handling large numbers of variables and determining
understood, though it has been examined in specific important ones. Indeed, random projection and random
situations. variable selection methods (such as RF) can be combined
We note that prediction success rate for TF motifs with any number of other machine learning algorithms to
increases significantly if we consider the two or three handle feature spaces which would otherwise be
highest scoring PWM candidates for !, as opposed to just prohibitively large.
the highest scoring one. This may imply that there are Another important variation in the approach involves
typically three or less likely motifs which have a statistical the choice of positives and negatives in the machine
(and possibly functional) significance correlated with gene learning method. In the case of SVM we can undersample,
expression. With this hypothesis, scoring methods by the choosing only more reliably classified upstream regions x.
number of hits in the top 3 is a reasonable measure of A balance may need to be drawn between the choice of
accuracy of our methods in finding validated motifs from threshold and the lower number of examples which result.
Transfac or YeastGenome. It is shown in [8] and Finally, we note that the greedy agglomeration
elsewhere that multiple binding sites in a promoter are algorithm described here can be replaced by other existing
typical in yeast, and they often can act in concert; [18] agglomeration procedures, though their effectiveness has
offers more verification in this direction. not been studied in this application. Other procedures
include that used by [25]. The step of statistical
5. Discussion identification of -mers associated with TF activation is
what here requires machine learning methodologies.
Kernel methods such this one as are based on assigning
feature vectors to genes. Such feature space-based 6. References
methods can be emulated by other feature vector
[1] T. Bailey and C. Elkan (1994), "Unsupervised learning of
classification methods such as random forests or LARS
multiple motifs in biopolymers using EM," Machine
(least angle regression) [7] in ways which incorporate prior Learning 21, pp. 51-80.
knowledge into feature spaces in an identical way. The [2] P. Cliften, M. Johnston, et al., "Surveying Saccharomyces
choice of kernel in SVM is equivalent to choosing a genomes to identify functional elements by comparative
feature map ) into a space with Euclidean geometry. Thus DNA sequence analysis," Genome Research 11, pp. 1175-
any prior knowledge (e.g., phylogenetic information [36]) 1186, 2001.
which can be incorporated into a kernel can also be [3] E. Conlon, X. Liu, J. Lieb, and J. Liu, "Integrating regulatory
incorporated into alternative methods such as those above, motif discovery and genome-wide expression analysis,"
or any number of neural network-based techniques such as PNAS 100, pp. 3339-3344, 2003.
ART [10], by a proper adaptation of the feature map.

579

Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on October 16,2023 at 06:01:45 UTC from IEEE Xplore. Restrictions apply.
[4] N. Cristianini, and J. Shawe-Taylor, An Introduction to co-expressed genes," Pac. Symp. Biocomput. 6, pp 127-
Support Vector Machines, Cambridge University Press, 138, 2001.
Cambridge, 2000. [22] X. Liu, D. Brutlag, and J. Liu, "An algorithm for finding
[5] Q. Ding, Dimensionality reduction and its application in protein-DNA interaction sites with applications to
machine learning, Technical Report, Boston University, chromatin immunoprecipitation microarray experiments."
2006. Nature Biotechnology 20, pp. 835-39, 2002.
[6] S. Dwight, M. Harris, K. Dolinski, et al., "Saccharomyces [23] K. MacIsaac, et al., "An improved map of conserved
Genome Database (SGD) provides secondary gene regulatory sites for Saccharomyces cerevisiae." BMC
annotation using the Gene Ontology," Nucleic Acid Bioinformatics 7, pp. 113-127, 2006.
Research 30, pp. 69-72, 2002. [24] V. Matys, O. Kel-Margoulis, et al. "TRANSFAC(R) and its
[7] B. Efron, T. Hastie, I. Johnstone and R. Tibshirani, "Least module TRANSCompel(R): transcriptional gene regulation
angle regression," Annals of Statistics, pp. 407-499, 2003; in eukaryotes." Nucl. Acids Res. 34, pp. 108-110, 2006.
see also http://www-stat.stanford.edu/hastie/Papers/LARS [25] M. Middendorf, A. Kundaje, C. Wiggins, Y. Freund, and C.
/LeastAngle_2002.ps Leslie, "Predicting genetic regulatory response using
[8] C. Harbison, et al., "Transcriptional regulatory code of a classification," Twelfth International Conference on
eukaryotic genome," Nature 431, pp. 99-104, 2004. Intelligent Systems for Molecular Biology, Bioinformatics
[9] F. Gao, B. Foat, et al., "Defining transcriptional networks 20 Suppl 1, 2004.
through integrative modeling of mRNA expression and [26] P. Pavlidis and W. Noble, "Gene functional classification
transcription factor binding data," BMC Bioinformatics from heterogeneous data." RECOMB Conference
5(1), p. 31, 2004. Proceedings, pp. 249-255, 2001.
[10] S. Grossberg, Studies of Mind and Brain: Neural Principles [27] T. Reddy, B. Shakhnovich, and C. DeLisi, "Binding site
of Learning, Perception, Development, Cognition, and graphs: A new graph theoretical framework for prediction of
Motor Control, Reidel Press, Boston, 1982. transcription factor binding sites," PLOS Computational
[11] D. Holloway, M. Kon and C. DeLisi, "Machine learning for Biology 3, pp. 844-854, 2007.
regulatory analysis and transcription factor target prediction [28] B. Ren, R. Young, et al., "Genome-wide location and
in yeast." Systems and Synthetic Biology 1, 2007. function of DNA binding proteins," Science 290, pp. 2306-
[12] D. Holloway, M. Kon and C. DeLisi, "In silico regulatory 2309, 2000.
analysis for exploring human disease progression," preprint, [29] B. Ren, F. Robert, J. Wyrick, et al., "Genome-wide location
2007. and function of DNA-binding proteins," Science 290, pp.
[13] P. Hong, et al., "A boosting approach for motif modeling 2306-2309, 2000.
using ChIP-chip data." Bioinformatics 21, pp. 2636-2643, [30] F. Roth, J. Hughes, P. Estep, and G. Church, "Finding DNA
2005 regulatory motifs within unaligned non-coding sequences
[14] E. Hong, R. Balakrishnan, KR Christie, MC Costanzo, SS clustered by whole-genome mRNA Quantitation", Nature
Dwight, SR Engel, DG Fisk, et al., "Saccharomyces Biotechnology 16(10), pp. 939-45, 1998.
Genome Database"; http://www.YeastGenome.org [31] B. Schölkopf, and A. Smola, Learning with Kernels -
[15] T. Joachims, "Making large-scale SVM learning practical." Support Vector Machines, Regularization, Optimization and
Advances in Kernel Methods - Support Vector Learning, B. Beyond, MIT Press, Cambridge, MA, 2002.
Schölkopf and C. Burges and A. Smola (eds.), MIT Press, [32] B. Schölkopf, K. Tsuda, and J. Vert, Kernel Methods in
Cambridge, MA, 1999. Computational Biology, MIT, Cambridge, MA, 2004.
[16] D. Karolchik, et al., "The UCSC Genome Browser [33] N. Simonis, S. J. Wodak, et al., "Combining pattern
Database." Nucl. Acids Res 31, pp. 51-54, 2003. discovery and discriminant analysis to predict gene co-
[17] R. Kuang, E. Ie, K. Wang, K. Wang, M. Siddiqi, Y. Freund, regulation," Bioinformatics 20(15), pp. 2370-2379, 2004.
C. Leslie, "Profile-based string kernels for remote [34] V. Vapnik, Statistical Learning Theory, Wiley, New York,
homology detection and motif extraction," Journal of 2000.
Bioinformatics and Computational Biology 3, No. 3, pp. [35] V. Vapnik, The Nature of Statistical Learning Theory,
527-550, 2005. Springer, New York, 1998.
[18] A. Kundaje, M. Middendorf, F. Gao, C. Wiggins, and C. [36] J-P Vert, R. Thurman and W. S. Noble, "Kernels for gene
Leslie, "Combining sequence and time series expression regulatory regions," Proceedings of the 19th Annual
data to learn transcriptional modules," IEEE/ACM Trans Conference on Neural Information Processing Systems,
Comput Biol Bioinfo. 2, pp. 194-202, 2005 2005.
[19] A. Kundaje, et al., "A classification-based framework for [37] C. Workman and G. Stormo, "ANN-Spec: a method for
predicting and analyzing gene regulatory response." BMC discovering transcription factor binding sites with improved
Bioinformatics 7. specificity," Pac Symp Biocomput., pp. 467-78, 2000.
[20] G. Lanckriet, N. Cristianini, et al., "A statistical framework [38] X. Zhang, et al. "Recursive SVM feature selection and
for genomic data fusion," Bioinformatics 20, pp. 2626- sample classification for mass-spectrometry and microarray
2635, 2004. data," Bioinformatics 7, p. 197, 2006.
[21] X. Liu, D. Brutlag, and J. Liu, "BioProspector: discovering
conserved DNA motifs in upstream regulatory regions of

580

Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on October 16,2023 at 06:01:45 UTC from IEEE Xplore. Restrictions apply.

You might also like