SVMotif A Machine Learning Motif Algorithm
SVMotif A Machine Learning Motif Algorithm
574
Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on October 16,2023 at 06:01:45 UTC from IEEE Xplore. Restrictions apply.
x²³ be the sequence of bases in the promoter x of . The ChIP-chip experiments [9], [33]), as long as statistical
feature vector )²x³ )spect ²x³ of the promoter region is a patterns are not masked by very small .
vector of length , with each position ) ²x³ representing For a gene in S. cerevisiae, x²³ is the FASTA
the count of the ! 5-mer in the indexing sequence. An sequence for its upstream region, up to 800 bp long. For
interesting addition in [36] is use of feature vectors taking fixed ! an SVM with a linear string kernel 2²xÁ y³ defined
phylogenetic conservation into account. They consider a in (1) takes data : and forms a discriminator
given upstream region x 7 of S. cerevisiae, and an ²x³ ~ w h x b which separates positive examples
alignment c 7 ! , consisting of an aligned array of five (²x Á & ³ for which & ~ ) from negative ones. Using the
(matching) upstream regions of length from 5 related R-SVM feature reduction procedure [38], an importance
yeast species, to determine functional regions. Above 7 is vector w × ~ w d (b c c ³ is formed. Here, b
the alphabet ¸A, C, G, T¹À represents the center (vector mean) of the positive
There have been some other methods using negative examples in the feature space - , while c represents the
examples (likely nonbinding genes for a TF) for center of the negative examples. The vector a d b
determining binding motifs. An example is the Ann-Spec represents a componentwise multiplication of a and b.
algorithm [37]. Some boosting motif finding methods use From this a permutation of indices of the vector w × is
negative information [13] as well. formed so the permuted vector w × k w × is arranged
from largest to smallest component. The R-SVM program
2. SVM feature space approach takes the components of w × and reduces them to 150. This
feature selection is repeated 20 times with different choices
We assume a fixed TF !, and a dataset from experiments of presumed negatives (in this case genes with high
which identify promoter regions of genes activated by ! (in values in ChIP-chip experiments) out of about 600
the present case these are ChIP-chip experiments [28], negative examples available. In each SVM run the
[8]). Our methods can be adapted to other data sets, for numbers of positives and negatives are chosen to be equal.
example those in [18] which map activation of a gene as From the top 600 -mers obtained in these 20 runs, the
a function of TF presence together with presence of given best 50 are chosen using a single iteration of R-SVM. For
motifs in the promoter region of . typical transcription factors here, there are approximately
We have for the promoter sequence x²³ of a 50-300 positive examples (i.e., genes binding !), and a
classifier & ~ f indicating (with allowed error) whether larger number (e.g. 600) of negative examples, from which
or not ! binds the promoter of . different samples are taken in multiple runs to match
The SVM gives a function on promoters, defined by numbers of positives.
Though division into training and test sets is not
²x³ ~ 2²x Á x³ b w h x b Á necessary in the motif finding procedure above, it is still
valuable to know how good the SVM algorithm is in
with 2²xÁ y³ a string kernel ([17]) given by predicting whether a gene will bind the given TF, i.e., how
good w is at segregating the positive from the negative
2²xÁ y³ ~ )²x³ h )²y³, (1) examples. Typical test rate accuracy on the above data is
around 80%; see [11] for more details on the predictive
with ) a feature map taking x into the -mer feature vector accuracy of this algorithm for TF target prediction.
above. Software implementing this algorithm includes:
Positive Examples: Known positive examples of genes
• SVMLight: http://svmlight.joachims.org with promoters binding ! include binding targets taken
•SVMTorch: http://www.idiap.ch/learning/SVMTorch.html from ChIP-chip experiments [29], [8], Transfac 6.0 Public
• LIBSVM: http://wws.csie.ntu.edu.tw/~cjlin/libsvm [24], and a list curated by Young et al., from which we
have excluded indirect evidence such as sequence analysis
A Matlab package which implements most SVM and expression correlation [29]. Of the ChIP-chip
algorithms with a C-based back end is SPIDER: interactions, only those with p-values 10c (i.e., high
http://www.kyb.mpg.de/bs/people/spider/whatisit.html confidence level) are considered positives. Selected
The latter implementation is used here. negatives are a randomly chosen subset of those genes
We illustrate the algorithm (again for ! fixed), given found not to be bound by a TF in ChIP-chip experiments
an example data set : ~ ¸²x Á & ³¹~ of promoter regions (typically these are the genes with highest -values and
x and outcomes & (binding/no binding). : is allowed to thus least significant binding).
have large errors (as are sometimes assumed to exist e.g. in
575
Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on October 16,2023 at 06:01:45 UTC from IEEE Xplore. Restrictions apply.
300. Negatives can be chosen randomly (since there are a
relatively small percentage of positives in the data), or
from genes in ChIP-chip experiments which have large -
values. They can also be formulated as randomized
versions of true positives in cases where known non-
positives are not available.
576
Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on October 16,2023 at 06:01:45 UTC from IEEE Xplore. Restrictions apply.
1. Delete 'AAAAA+', 'ACAC+', 'TATA+'… from Each column's importance is stored as its column sum
the original sequence. divided by the total weight (.23 b À ³ ¢
2. Set counting resistance to 2. Thus if a sequence is
" À À #. (2)
…ACACAC…, the occurrence count for 'ACAC' is not 2
but 1. Similarly for …AAAAAAAA…, the occurrence The final PWM is normalized. Assuming 4 were the
count for 'AAAAA' is 1 rather than 4. final adjustment to this cluster, we normalize each column:
These two operations are assumed not to affect the
x 0 0 0 0 0 .58 1 {
motif components, but they do reduce such noisy -mers. z 1 0 0 0 0 .42 0 }
In the case of artificial negative generation, another 4 ~z }
0 1 1 1 0 0 0
method for elimination of noise would involve fitting a y 0 0 0 0 1 0 0 |
higher order Markov background model to the whole
genome and randomly generating pseudo-negative Next log odds scores with respect to the background
examples based on this model. model (in which each entry is .25) are taken with
pseudocounts of .1 added to both numerator and
More about Clustering: A greedy method with 2 denominator, yielding
thresholds is used. Once the score of a -mer matched to a x cÀ cÀ cÀ cÀ cÀ À À {
cluster is above the in-threshold, the -mer is added into z À
z
cÀ cÀ cÀ cÀ À cÀ }
}
this cluster. cÀ À À À cÀ cÀ cÀ
y cÀ cÀ cÀ cÀ À cÀ cÀ |
The algorithm starts the PWM with a weighted count.
For example, if the current -mer is GGGTAA with weight with each entry a log ratio (base 2). Here a 0 represents no
(in the final w vector) of 0.230, the weighted PWM is additional information (random entry), while a positive
entry represents positive information. Now multiplying by
x 0 0 0 0 .23 .23 {
z 0 0 0 0 0 0 } the weight vector (2) yields the final PWM,
4 ~ z }
.23 .23 .23 0 0 0
y 0 0 |
0 0 .23 0 x cÀ cÀ cÀ cÀ cÀ À À {
z À cÀ cÀ cÀ cÀ À cÀ }
> ~z }
with the first row representing A, and the remaining rows cÀ À À À cÀ cÀ cÀ ,
representing C, G, and T. y cÀ cÀ cÀ cÀ À cÀ cÀ |
After comparing all possible alignments, an incoming
-mer with the next highest weight in w × (i.e., the second used to compute hitting scores by scanning a promoter
entry in the ordered vector w × ) will be added into the best region with > and counting the number of scores above a
fitting existing cluster with an offset determined by its best given threshold ; .
fit, and the cluster's PWM is updated to incorporate the Intuitively, a match in an 'important' position of >
new -merÁ by addition of the weighted PWM should gain more of a score and a mismatch in an
corresponding to it important position should lose more. Note that a match or
Thus if an incoming -mer is now CGGGTC with mismatch at the 'ends' will count less here, since
weight 0.17, it is clear that its position must be adjusted 1 importance multiples each column.
position to the left for the best match, so the PWM is
updated with alignment Scoring: How do we pick out the true cluster among all
C G G G T C clusters (each now a PWM) for !? The first part of
G G G T A A selecting the best PWM involves the relative entropy score
Thus we add 3
, ~ Total_Weight h log Á
x { ~ ~
z À À }
4 ~ z }
À À À
y À | to pick out overrepresented clusters. Here is the
probability (frequency) of letter in the ! PWM position,
transposed to the left, yielding the jogged sum is the probability of in the background model (here
0 0 0 0 0 .23 .23 .25); is the importance score (see (2)) of the ! column.
x {
z À 0 0 0 0 .17 0 } Based on their entropy scores, we choose the top 5 clusters
: ~ z }
0 . . .4 0 0 0 (and their PWM's) as candidates and use them to scan the
y 0 0 0 0 .4 0 0 |
promoter sequences which are positive for !.
577
Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on October 16,2023 at 06:01:45 UTC from IEEE Xplore. Restrictions apply.
The hitting ratio score for a PWM is defined as The 85 TF's were tested using AlignAce [30],
BioProspector [21], and SVMotif. The following table
# hits on positive genes
HR ~ compares the performances these three methods on the full
# hits on negative genes MacIsaac dataset.
where negative genes are randomly undersampled to match AlignAce BioProspector SVMotif
the current number of positives. This is used to assess the Top 1 to 3 Top 1 to 3 Top 1 to 3
quality of these 5 candidate clusters. Appearing Ungapped (74) 19 34 27 33 29 43
significantly more often in positives than in negatives is Gapped (11) - - 3 3 6 9
assumed to be a property of a true motif. If we use
fabricated negatives instead of true ones (see above) then, Table 2: Motif finding performance on
as suggested earlier, regular and generally overrepresented MacIsaac [23] TF's with known binding
-mers like 'AAAAGA' and 'CTTCTTCTT', can get high specificities among AlignAce, BioProspector
hitting scores in promoter regions. For this reason we have and SVMotif. The 'top' PWM is the one with
chosen to use a product of entropy and hitting ratio scores the highest score for each TF; the number in
as the final criterion; the cluster with highest product is that column indicates how many TF's have
output as the best prediction. top PWM's which coincide with the reported
For computation of the hitting ratio we must set a motif in supplemental file 1 of MacIsaac [23].
threshold ; on a local PWM score to count a hit. 1 to 3 above represents the top 3 PWM for a
However, it is difficult to produce common thresholds for single TF. Each score represents the number
all TFs. We have fixed the value ; ~ 6.4 for all TFs, of PWM hits of the 'top' motif as reported by
based on the heuristic fact that for a normalized PWM, a MacIsaac among the union of the top 3
single strong match can add 1.65 PWM's for all TF's in the indicated row. Each
( ~ log ´ b À!°²À b À³µÁ the maximal log odds ratio³ PWM is scored either 0 or 1 total depending
to the PWM score, while a moderate match gains 1 and a on whether or not it coincides with reported
weak match gains 0.5. Mismatches in corresponding motifs. The above indicates there were a total
different levels will give losses of 1.8, 1 and 0.5. A hit of 74 TF's without gapped motifs, and 11 with
should have at least 4 strong matches and some matches or gapped motifs. For gapped motifs each
mismatches which on the average do not affect the hitting program was given a range of 3 gap sizes to
score. try.
Gapped motifs: The procedure for gapped motifs simply Below are the results for the subgroup of the above
tries a number of allowed gap sizes, and considers the TF's which also appear in Transfac [24].
regions adjacent to the gap to be contiguous for the
purpose of the SVM algorithm, similarly to what is done in AlignAce BioProspector SVMotif
Top 1 to 3 Top 1 to 3 Top 1 to 3
BioProspector.
Ungapped (25) 9 15 14 15 12 17
Gapped (3) - - 1 1 2 3
3. Experimental results
Table 3: Motif finding performance on
We chose 85 TF's from supplemental file 1 of binding data MacIsaac TF's (Transfac subgroup).
from MacIsaac [23], consisting of TF's with binding
specificities known from various sources. These 85 are Finally, we have taken the 102 S. cerevisiae TF's from the
chosen from a total of 88 used as a benchmark in [23]. UCSC Genome Browser [16]. In this database there are
Three TF's (MATA1, CRZ1 and ECM22) are omitted published PWM's for 102 TF's, complied largely from [8].
because positive examples for them did not exist in our These have been converted by us to consensus sequences.
original database. Of these, 74 are ungapped and 11 are Again after elimination of two TF's because of a lack of
gapped. Of the full group of 85 TF's, 28 of these (25 positive/negative examples, the corresponding results are:
ungapped and 3 gapped) also appear in the Transfac [24]
database, and so are analyzed separately given their AlignAce BioProspector SVMotif
presumably more accurate resulting known binding Top 1 to 3 Top 1 to 3 Top 1 to 3
specificities. Ungapped (90) 27 37 33 39 34 49
Gapped (10) - - 3 3 6 8
578
Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on October 16,2023 at 06:01:45 UTC from IEEE Xplore. Restrictions apply.
4. Signals indicating multiple switches A remark on methodology: We remark on the effective
simplicity of this process. The two basic components are
We have verified, as have others [17], [8] that there are identification of a statistical correlation between -mers
functional and statistical reasons for the multiple hits and gene activation (in sufficiently controlled
which occur in all methods that correlate motifs with TF experiments), and the inference of motif PWM's from
binding. For a given TF there are cofactors which can agglomeration of -mers which are significantly correlated
modulate the TF itself or reinforce the activity of the TF, with TF activation. The algorithms for both parts
leading to a number of the 'additional' motifs discovered by contribute to usefulness of the result.
programs such as BioProspector and SVMotif. This is This method could be further improved using ideas
added to in the multiple TF's in overlapping genomic introduced in [32], in which -mer selection is improved
transcription modules. Thus there are consistent using phylogenetic information from orthologous species.
appearances of 'additional' PWM's statistically associated As is shown there, this can be accomplished by
with TF's but apparently not acting as their binding sites. improvement of the kernel. As suggested above, such
We have tested some of these apparently false functional incorporation of prior knowledge can also be
relationships (in that the TF of interest does not bind the accomplished, in use of random forests through proper
additional sites), and verified them statistically. We have weighting of -mers in feature vectors using phylogenetic
divided several TF data sets in half, and showed that considerations. This will be considered in later work.
'additional' PWM's were consistently duplicated on cross- Other promising feature vector-based methods include
validation. This cross-functionality of motifs and the random projection methods [5], which are capable of
regulators which may bind them is not completely handling large numbers of variables and determining
understood, though it has been examined in specific important ones. Indeed, random projection and random
situations. variable selection methods (such as RF) can be combined
We note that prediction success rate for TF motifs with any number of other machine learning algorithms to
increases significantly if we consider the two or three handle feature spaces which would otherwise be
highest scoring PWM candidates for !, as opposed to just prohibitively large.
the highest scoring one. This may imply that there are Another important variation in the approach involves
typically three or less likely motifs which have a statistical the choice of positives and negatives in the machine
(and possibly functional) significance correlated with gene learning method. In the case of SVM we can undersample,
expression. With this hypothesis, scoring methods by the choosing only more reliably classified upstream regions x.
number of hits in the top 3 is a reasonable measure of A balance may need to be drawn between the choice of
accuracy of our methods in finding validated motifs from threshold and the lower number of examples which result.
Transfac or YeastGenome. It is shown in [8] and Finally, we note that the greedy agglomeration
elsewhere that multiple binding sites in a promoter are algorithm described here can be replaced by other existing
typical in yeast, and they often can act in concert; [18] agglomeration procedures, though their effectiveness has
offers more verification in this direction. not been studied in this application. Other procedures
include that used by [25]. The step of statistical
5. Discussion identification of -mers associated with TF activation is
what here requires machine learning methodologies.
Kernel methods such this one as are based on assigning
feature vectors to genes. Such feature space-based 6. References
methods can be emulated by other feature vector
[1] T. Bailey and C. Elkan (1994), "Unsupervised learning of
classification methods such as random forests or LARS
multiple motifs in biopolymers using EM," Machine
(least angle regression) [7] in ways which incorporate prior Learning 21, pp. 51-80.
knowledge into feature spaces in an identical way. The [2] P. Cliften, M. Johnston, et al., "Surveying Saccharomyces
choice of kernel in SVM is equivalent to choosing a genomes to identify functional elements by comparative
feature map ) into a space with Euclidean geometry. Thus DNA sequence analysis," Genome Research 11, pp. 1175-
any prior knowledge (e.g., phylogenetic information [36]) 1186, 2001.
which can be incorporated into a kernel can also be [3] E. Conlon, X. Liu, J. Lieb, and J. Liu, "Integrating regulatory
incorporated into alternative methods such as those above, motif discovery and genome-wide expression analysis,"
or any number of neural network-based techniques such as PNAS 100, pp. 3339-3344, 2003.
ART [10], by a proper adaptation of the feature map.
579
Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on October 16,2023 at 06:01:45 UTC from IEEE Xplore. Restrictions apply.
[4] N. Cristianini, and J. Shawe-Taylor, An Introduction to co-expressed genes," Pac. Symp. Biocomput. 6, pp 127-
Support Vector Machines, Cambridge University Press, 138, 2001.
Cambridge, 2000. [22] X. Liu, D. Brutlag, and J. Liu, "An algorithm for finding
[5] Q. Ding, Dimensionality reduction and its application in protein-DNA interaction sites with applications to
machine learning, Technical Report, Boston University, chromatin immunoprecipitation microarray experiments."
2006. Nature Biotechnology 20, pp. 835-39, 2002.
[6] S. Dwight, M. Harris, K. Dolinski, et al., "Saccharomyces [23] K. MacIsaac, et al., "An improved map of conserved
Genome Database (SGD) provides secondary gene regulatory sites for Saccharomyces cerevisiae." BMC
annotation using the Gene Ontology," Nucleic Acid Bioinformatics 7, pp. 113-127, 2006.
Research 30, pp. 69-72, 2002. [24] V. Matys, O. Kel-Margoulis, et al. "TRANSFAC(R) and its
[7] B. Efron, T. Hastie, I. Johnstone and R. Tibshirani, "Least module TRANSCompel(R): transcriptional gene regulation
angle regression," Annals of Statistics, pp. 407-499, 2003; in eukaryotes." Nucl. Acids Res. 34, pp. 108-110, 2006.
see also http://www-stat.stanford.edu/hastie/Papers/LARS [25] M. Middendorf, A. Kundaje, C. Wiggins, Y. Freund, and C.
/LeastAngle_2002.ps Leslie, "Predicting genetic regulatory response using
[8] C. Harbison, et al., "Transcriptional regulatory code of a classification," Twelfth International Conference on
eukaryotic genome," Nature 431, pp. 99-104, 2004. Intelligent Systems for Molecular Biology, Bioinformatics
[9] F. Gao, B. Foat, et al., "Defining transcriptional networks 20 Suppl 1, 2004.
through integrative modeling of mRNA expression and [26] P. Pavlidis and W. Noble, "Gene functional classification
transcription factor binding data," BMC Bioinformatics from heterogeneous data." RECOMB Conference
5(1), p. 31, 2004. Proceedings, pp. 249-255, 2001.
[10] S. Grossberg, Studies of Mind and Brain: Neural Principles [27] T. Reddy, B. Shakhnovich, and C. DeLisi, "Binding site
of Learning, Perception, Development, Cognition, and graphs: A new graph theoretical framework for prediction of
Motor Control, Reidel Press, Boston, 1982. transcription factor binding sites," PLOS Computational
[11] D. Holloway, M. Kon and C. DeLisi, "Machine learning for Biology 3, pp. 844-854, 2007.
regulatory analysis and transcription factor target prediction [28] B. Ren, R. Young, et al., "Genome-wide location and
in yeast." Systems and Synthetic Biology 1, 2007. function of DNA binding proteins," Science 290, pp. 2306-
[12] D. Holloway, M. Kon and C. DeLisi, "In silico regulatory 2309, 2000.
analysis for exploring human disease progression," preprint, [29] B. Ren, F. Robert, J. Wyrick, et al., "Genome-wide location
2007. and function of DNA-binding proteins," Science 290, pp.
[13] P. Hong, et al., "A boosting approach for motif modeling 2306-2309, 2000.
using ChIP-chip data." Bioinformatics 21, pp. 2636-2643, [30] F. Roth, J. Hughes, P. Estep, and G. Church, "Finding DNA
2005 regulatory motifs within unaligned non-coding sequences
[14] E. Hong, R. Balakrishnan, KR Christie, MC Costanzo, SS clustered by whole-genome mRNA Quantitation", Nature
Dwight, SR Engel, DG Fisk, et al., "Saccharomyces Biotechnology 16(10), pp. 939-45, 1998.
Genome Database"; http://www.YeastGenome.org [31] B. Schölkopf, and A. Smola, Learning with Kernels -
[15] T. Joachims, "Making large-scale SVM learning practical." Support Vector Machines, Regularization, Optimization and
Advances in Kernel Methods - Support Vector Learning, B. Beyond, MIT Press, Cambridge, MA, 2002.
Schölkopf and C. Burges and A. Smola (eds.), MIT Press, [32] B. Schölkopf, K. Tsuda, and J. Vert, Kernel Methods in
Cambridge, MA, 1999. Computational Biology, MIT, Cambridge, MA, 2004.
[16] D. Karolchik, et al., "The UCSC Genome Browser [33] N. Simonis, S. J. Wodak, et al., "Combining pattern
Database." Nucl. Acids Res 31, pp. 51-54, 2003. discovery and discriminant analysis to predict gene co-
[17] R. Kuang, E. Ie, K. Wang, K. Wang, M. Siddiqi, Y. Freund, regulation," Bioinformatics 20(15), pp. 2370-2379, 2004.
C. Leslie, "Profile-based string kernels for remote [34] V. Vapnik, Statistical Learning Theory, Wiley, New York,
homology detection and motif extraction," Journal of 2000.
Bioinformatics and Computational Biology 3, No. 3, pp. [35] V. Vapnik, The Nature of Statistical Learning Theory,
527-550, 2005. Springer, New York, 1998.
[18] A. Kundaje, M. Middendorf, F. Gao, C. Wiggins, and C. [36] J-P Vert, R. Thurman and W. S. Noble, "Kernels for gene
Leslie, "Combining sequence and time series expression regulatory regions," Proceedings of the 19th Annual
data to learn transcriptional modules," IEEE/ACM Trans Conference on Neural Information Processing Systems,
Comput Biol Bioinfo. 2, pp. 194-202, 2005 2005.
[19] A. Kundaje, et al., "A classification-based framework for [37] C. Workman and G. Stormo, "ANN-Spec: a method for
predicting and analyzing gene regulatory response." BMC discovering transcription factor binding sites with improved
Bioinformatics 7. specificity," Pac Symp Biocomput., pp. 467-78, 2000.
[20] G. Lanckriet, N. Cristianini, et al., "A statistical framework [38] X. Zhang, et al. "Recursive SVM feature selection and
for genomic data fusion," Bioinformatics 20, pp. 2626- sample classification for mass-spectrometry and microarray
2635, 2004. data," Bioinformatics 7, p. 197, 2006.
[21] X. Liu, D. Brutlag, and J. Liu, "BioProspector: discovering
conserved DNA motifs in upstream regulatory regions of
580
Authorized licensed use limited to: Indian Instt of Engg Science & Tech- SHIBPUR. Downloaded on October 16,2023 at 06:01:45 UTC from IEEE Xplore. Restrictions apply.