Computational Approaches
Computational Approaches
A Comparative Survey
1 Introduction
The main focus of gene prediction methods is to find patterns in long DNA sequences
that indicate the presence of genes. The problem of gene prediction is an important
problem in the field of bioinformatics [5]. It presents many difficulties especially in
eukaryotes where exons (coding regions) are interrupted by introns (non-coding
regions).The gene prediction problem can be defined formally as:
Input: DNA sequence X x ,x ,….x Σ where Σ A, T, G, C
Output: correct labeling oof each element of x as coding, noncoding or intergenic
4.
Although researchers continue to devise new algorithms for gene prediction, the
satisfactory accuracy is still far from reach. According to [2], 90% of genes are
accurately predicted at the nucleotide level, 45% at the exon level and only 20% on
the gene level. This explains the reason why the number of genes in the human
genome cannot be precisely estimated (between 30,000 and 100,000). Current gene
prediction methods are based on exploiting two gene features: content sensors and
signal sensors [3]. Content sensors try to distinguish protein coding and non-protein
coding regions. On the other hand, signal sensors try to predict the presence of
functional sites. Many algorithms are used for gene prediction including: dynamic
programming, hidden Markov models (HMMs), artificial neural networks (ANNs)
and linear discriminate analysis.
A. Abd Manaf et al. (Eds.): ICIEIS 2011, Part II, CCIS 252, pp. 14–25, 2011.
© Springer-Verlag Berlin Heidelberg 2011
Computational Approaches for Gene Prediction 15
2 Gene Prediction
Gene prediction problem can be divided into two sub-problems: prediction of protein
coding regions and prediction of functional sites [5, 20]. The development of gene
prediction programs have evolved in four generations [1]. In the first generation the
programs such as GRAIL [10] and TestCode [11], were designed to identify
approximate locations of protein coding regions. According to [1] such programs
were not able to produce precise location predictions. In the second generations,
programs such as SORFIND [12] and Xpound [13] combined splice signal and coding
region identification to predict exons. However, the programs did not attempt to
produce complete genes by assembling the predicted exons. The task of predicting
complete gene structures was addressed by the third generation programs such as
GeneID [14], GeneParser [15] ,GenLang [16] and FGENEH [17]. All those
programs were based on the unrealistic assumption that the input sequence contains
a single complete gene. The fourth generation programs came to solve this
shortcoming and improve the prediction accuracy. They include GENSCAN [18] and
AUGUSTUS[19].
Gene prediction methods rely on exploiting two types of gene information: content
and signal. Contents refer to variable-length features such as exons or introns. On the
other hand, signals are local sites such as splice sites, start and stop codons, branch
points, promoters and terminators polyadenylation sites, ribosomal binding sites and
transcription binding sites. Accordingly, Haussler [21] classified gene prediction
methods into three categories: content sensors, signal sensors and integrated methods.
Content sensors are methods that predict coding and non coding regions. Signal
sensors refer to methods that try to detect the presence of functional sites. Integrated
gene finding methods combine content and signal sensors in order to predict complete
gene structure.
Mathe et al. [3] divided gene finding approaches into finding the evidence for a
gene and combining the evidence in order to predict gene structure. Fig. 1 depicts
Gene Finding Approachs. Content sensors and signal sensors are used to look for the
evidence of a gene. Content sensors are either extrinsic or intrinsic [3]. In extrinsic
content sensors, local alignment methods such as Smith-Waterman, BLAST and
FASTA are used to detect similarity between a sequence and DNA or protein
sequence in a database. This approach is based on observing that exons are more
conserved that introns and intergenic regions. Three types of sequences may be used
in such case: protein sequences, ESTs, or genomic DNA. One limitation in ETS-based
similarity search is that only small parts of a gene correspond to ESTs which makes it
difficult to predict the complete gene structure [1]. The strength of this similarity-
based method lies in fact that prediction is based on accumulated annotated
sequences. The presence of a gene can be detected even with a single match.
However, this method is highly dependent on the content of the database. The
database may contain poor quality sequences or it may not have sequences similar to
the query sequence. Moreover, only half of the genes being discovered are
significantly homologous to database sequences.
16 I.M. Al-Turaiki et al.
donor site [6]. Fig. 2 depicts exon classifications. Each of the four exon types present
different challenges and the ability of gene prediction programs to predict each type is
different. Evidences can be combined to predict the complex structure of a gene. "In
theory, each consistent pair of detected signals defines a potential gene region" [3].
If one takes into consideration all possible gene regions to build a gene model,
the number of possible gene models grow exponentially with the number of
predicted exons.
According to [3] the method that try to predict the whole gene structure can be
classified into: extrinsic, intrinsic or integrated. The classification is based on the way
these methods use to asses contents. Programs that follow the extrinsic approach are
based on combining similarity information and signal information. This information is
used to refine region boundaries. Such programs inherit the strengths and weaknesses
of the sensors used. Intrinsic programs try to locate all gene elements in a genomic
sequence. Dynamic programming is used to identify the high scoring gene structures.
Integrated methods combine both extrinsic and intrinsic methods.
Prokaryotic organism, like bacteria and Archaea, are known to have small genomes
with size ranging from 0.5 to 10 Mbp. They have high gene density, where coding
sequences make more than 90% of the genome. DNA sequences in prokaryotes are
transcribed into mRNA and translated to protein without significant modifications.
The main characteristic of prokaryotic genes is the absence of introns. Several
conserved patterns are found in prokaryotic genes. Gene prediction can make use of
gene regulatory sequences. The start codon of most bacterial genes is ATG. But
sometimes, GTG and TTG are used as start codons. However the start codon alone is
not enough to identify a gene since there could be more than one start codon at the
beginning of a frame. The ribosomal binding site which is located directly after the
transcription initiation site and before the translation start codon is used to help in
locating the start codon. In many bacteria the ribosomal binding site has a consensus
motif of AGGAGGT. At the end of protein coding region a stop codon is found.
There are three possible stop codons and they are easy to identify.
Conventionally, a prokaryotic gene can be identified manually using the longest
open reading frame (ORF) and the major prokaryotic gene signals. One limitation of
this approach is the possibility of missing very small genes. Although prokaryotic
genomes have small intergenic regions and lack introns, genes may overlap and
18 I.M. Al-Turaiki et al.
makes it difficult to predict translation starts [3]. Early gene prediction algorithms
relied on examining the nucleotide distribution in DNA sequences. One method
observes the composition of the third position in a codon. In coding sequences it is
more likely that this position will be G or C. Statistical features of encoding regions
can be analyzed using Markov Models and Hidden Markov Models. This approach
may be statistically limited when there are no enough hexamers. The solution to such
problem is to use an alternative option called interpolated Markov models (IMM). A
program that finds potential genes in microbial DNA – Bacteria and Archaea is
Glimmer [9]. It is based on IMM. Other HMM/IMM programs include ECOPARSE
[7], GENMARK [8] and are able to predict genes with good specificity.
In order to deal with the extremely large number of possible gene models, dynamic
programming methods are used in gene finding applications. In terms of gene finding,
grammatical rules of gene structure are used. They provide constraints on the order on
which gene segments can appear. For example, an internal intron appears between
two exons. Giving a set of rules and scored introns and exons it is possible to scan the
sequence once and determine the highest scoring gene structure. This approach
although guaranteed to produce the highest scoring gene structure it does not always
produce correct predictions. However, researchers can focus on improving the
Computational Approaches for Gene Prediction 19
predicted gene structure and improve the score assigned to them. GeneParser [15] is a
gene prediction program that employs dynamic programming in order to compute
optimal combination of introns and exons.
Neural networks are capable of finding complex patterns and relationships between
sequence residues. In gene finding a neural network is composed of input, output and
hidden layers. The input to the network is a gene sequence. The output is the
probability of exon. A neural network is trained using known gene structures. Gene
structure is composed of several features: hexamer frequencies splice sites, GC
composition. The weights of the ANN are adjusted to recognize different patterns.
After training an artificial neural network (ANN) will be able to apply the rules
learned in the training phase to recognize unknown sequences. Fig. 3 depicts ANN for
Gene Prediction in Eukaryotes.
transcription start site (TSS). Then another system is used to predict the presence of
CpG islands. Then a four-layer ANN is used to predict the presence of a gene based
on the predicted TSS and CpG islands.
Fig. 4. Two discriminant analysis, LDA and QDA. ▲ coding features; ⊗ noncoding[24]
Hidden Markov models are designed to process sequences of data. They are widely
used in computational biology for analyzing DNA and protein sequences. They have
been already used in multiple sequence alignment, motif finding and protein structure
prediction. A Markov model is a one that assumes that the probability that a given
character will appear at a specific position depends on the previous k characters. k is
called the order of Markov model. The model is defined using conditional probability
⁄ where , , , for DNA sequences. In order
to build a Markov model a set of sequences is required to estimate the probabilities.
Using the built Markov model, it is easy to compute the likelihood that a given
sequence is generated by this model. A HMM is made of a number of states each of
which has two parameters: symbol-emission probabilities, and state-transition
probabilities. Symbol-emission probabilities indicate the probability of emitting each
possible symbol from a state. The state-transition probabilities determine the
Computational Approaches for Gene Prediction 21
probabilities of moving from one state to the next. Sequences can be generated by
beginning at some initial state and moving from state to another according to the
state-transition probabilities until an end state is reached. Each state then emits
symbols according to that state's emission probability distribution, creating a sequence
of residues.
There are three attributes of HMMs that led to their popularity in gene finding.
They have intuitive analogy to the gene structure. They have consistent mathematical
formalism on which rigorous analysis can be made. There are many algorithms that
allow for efficient determination of their properties [6]. According to [4] there is no
HMM system that can deal with the size and complexity required in gene finding. A
full gene finding HMM model is divided into three smaller models one for each class
of DNA nucleotides: exons, introns and integenic regions.
The HMM is first trained using the expectation maximization algorithm. It is used
to determine reasonable values for all HMM probabilities. Emission and state
transition probabilities are initialized to random values. Sometimes it is good to use
prior estimations of those probabilities if available. When it receives a DNA
sequences the EM algorithm re-estimate the probabilities. It runs each training
sequence through the model and calculates the posterior probability P s⁄λ for each
sequence s. The P s⁄λ are multiplied to produce P S⁄λ where S is the set of all
sequences. The probabilities are adjusted in order to maximize the value of P S⁄λ .
The procedure continues to run data through the model and adjust probabilities
according to P S⁄λ until the value is maximized. The EM algorithm is guaranteed to
produce the local optimal estimates of all probabilities. After training the HMM it
becomes ready to interpret sequences. For this purpose a dynamic programming
algorithm called Viterbi algorithm is often used. It can efficiently align any sequence
to a trained HMM. It finds the most likely sequence of states of the model for the
given sequence.
GENESCAN [18] is an online gene prediction program based on fifth-order HMM.
The program allows for predicting the locations and exon-intron structures of genes in
genomic sequences from a variety of organisms. It combines hexamer frequencies
with coding signals in prediction. Putative exons are assigned a score of being
true exons. If the score exceeds 0.5 the prediction is considered reliable. This
program has been used in annotating the human genome. It is available at
http://genes.mit.edu/GENSCAN.html.
HMMgene available at http://www.cbs.dtu.dk/services/HMMgene/ is another
online HMM based gene prediction program. It distinguishes coding regions from
non-coding regions using a unique measure called conditional maximum likelihood. If
a sequence contains sub-regions that have been already identified as coding regions
(by similarity search) the regions are locked as coding regions. Then the HMM
prediction is made with bias to locked regions. The program tries to extend locked
regions to predict the remaining coding parts of a gene.
4 Promoter Identification
Promoters are DNA sequences found at the 5 side of a gene. They regulated
transcription by binding regulatory proteins to specific binding sites. Protein
22 I.M. Al-Turaiki et al.
Using this type of algorithms, both eukaryotic and prokaryotic promoters can be
predicted. The prediction process is based on characteristic patterns for promoters and
regulatory elements. The idea of predicting promoters and regulatory sites is based in
match consensus sequence patterns. Ab initio algorithms require training and thus
trained algorithms are species specific. Using ab initio algorithms it is difficult to
predict unknown motifs.
In prokaryotes the most important aspect of promoter prediction is the prediction of
the operon structure. Within an operon linked genes share a common promoter
located upstream from the first gene. Knowing the operon structure, only prediction
of one promoter is required. The most accurate method for operon prediction was
Computational Approaches for Gene Prediction 23
developed in 2004 [30] but it is not yet available as a computer program. The most
widely used promoter prediction program for prokaryotes is BPROM. BPROM is a
bacterial sigma70 promoter recognition program with about 80% accuracy and
specificity. It uses LDA with signal and content information. First operon structure is
predicted using 100 base pairs as a distance for genes in an operon. Then promoter
prediction takes place. The program is available at http://linux1.softberry.com/
berry.phtml?topic=bprom&group=programs&subgroup=gfindb
For eukaryotes a unique feature of eukaryotic promoter, CpG islands, is used to
increase the prediction accuracy. By predicting the presence of CpG islands near the
promoter region overlapping the transcription start site promoters can be traced on the
upstream of the island. CpGProD is a program for predicting promoters containing
high density of GpC islands in mammalian sequences. CpGProD is available either
via a web server, useful for a small dataset, or as a standalone application for a larger
dataset. The online version of the program can be found at http://pbil.univ-
lyon1.fr/software/cpgprod_query.html .
Promoter and regulatory elements from closely related organisms are highly
conserved. The conservation can be observed at both sequence and element
organization levels. This makes it possible to predict promoters based on comparative
analysis. Phylogenetic footprinting refers to the identification of functionally
important noncoding DNA sequences. These methods can be applied to both
eukaryotic and prokaryotic sequences. In this type of comparative analysis, the choice
of organisms is very important. If the two organisms are closely related (human and
chimpanzee for example) it would be difficult to extract functional elements. On the
other hand, if the evolutionary distance between two organism is very long (for
example between human and fish) detecting promoters and other elements would be
very difficult. A good choice is usually using human and mouse sequences. Using
Phylogenetic footprinting algorithms no training of probabilistic models is required,
thus this approach is widely applicable. It is also possible to discover new regulatory
elements that are common to both organisms. However, this approach puts restrictions
on the evolutionary distance among organisms. ConSite is an online Phylogenetic
footpriniting program. It finds promoters by comparing two sequences using global
alignment algorithm. The program is available at http://asp.ii.uib.no:8090/cgi-
bin/CONSITE/consite/ . Similar programs include rVISTA at http://rvista.dcode.org/
and PromH(W) from www.softberry.com.
6 Conclusion
Gene prediction is an important problem in bioinformatics. Different approaches have
been used to tackle this problem including: dynamic programming, HMM, ANN and
GA. In prokaryotes gene prediction is much easier than in eukaryotes. On the
nucleotide level more than 90% of nucleotides can be classified as coding or non-
coding. But the exact exon boundaries are still very difficult to predict. Some
problems in gene finding are still open. According to [1], it is difficult to locate short
24 I.M. Al-Turaiki et al.
exons because discriminative characteristics are not apparent in short sequences. The
problem is worse when a coding exon is a multiple of three. Missing such exon will
not affect the gene assembly. Another problem that has not been solved effectively is
the alternative splicing which results in genes having more than one possible exon
assembly. Although programs like GENESCAN and MZEF tried to solve the problem
by identifying sub-optimal exons, the problem remains open [1]. Moreover, there is
no current method that can predict overlapping eukaryotic genes. There prediction of
multiple genes in a single sequence is still difficult. With the wide variety of gene
prediction programs there is a need for comprehensive criteria for assessing the
quality of gene prediction programs.
References
1. Wang, Z., Chen, Y., Li, Y.: A Brief Review of Computational Gene Prediction Methods.
Geno. Prot. Bioinfo. 2, 216–221 (2004)
2. Zhang, M.Q.: Computational Prediction of Eukaryotic Protein-Coding Genes. Nature
Reviews Genetics 3, 698–709 (2002)
3. Mathe, C., Sagot, M., Schiex, T., Rouze, P.: Current Methods for Gene Prediction, Their
Strengths and Weakness. Nucleic Acid Research 30, 4103–4117 (2002)
4. Bandyopadhyay, S., Maulik, U., Roy, D.: Gene Identification: Classical and
Computational Intelligence Approaches. IEEE Transactions On Systems, Man, And
Cybernetics—Part C: Applications And Reviews 38, 55–68 (2008)
5. Mount, D.W.: Bioinformatics: Genome and Sequence Analysis. Cold Spring Harbor
Laboratory Press, New York (2004)
6. Stormo, G.D.: Gene-Finding Approaches in Eukaryotes. Genome Research 10, 394–397
(2000)
7. Krogh, A., Mian, I.S., Haussler, D.: A hidden Markov model that finds genes in E. coli
DNA. Nucleic Acids Res. 22, 4768–4778 (1994)
8. Borodovsky, M., McIninch, J.: GENMARK: parallel gene recognition for both DNA
strands. Comput. Chem. 17, 123–133 (1993)
9. Salzberg, S.L., Delcher, A.L., Kasif, S., White, O.: Microbial gene identification using
interpolated Markov models. Nucleic Acids Res. 26, 544–548 (1998)
10. Uberbacher, E.C., Mural, R.J.: Locating protein-coding regions in human DNA sequences
by a multiple sensor-neural network approach. Proc. Natl. Acad. Sci. USA 88, 11261–
11265 (1991)
11. Fickett, J.W.: Recognition of protein coding regions in DNA sequences. Nucleic Acids
Res. 10, 5303–5318 (1982)
12. Hutchinson, G.B., Hayden, M.R.: The prediction of exons through an analysis of
spliceable open reading frames. Nucleic Acids Res. 20, 3453–3462 (1992)
13. Thomas, A., Skolnick, M.H.: A probabilistic model for detecting coding regions in DNA
sequences. IMA J. Math. Appl. Med. Biol. 11, 149–160 (1994)
14. Guigo, R., Knudsen, S., Drake, N., Smith, T.: Prediction of gene structure. J. Mol.
Biol. 226, 141–157 (1992)
15. Snyder, E.E., Stormo, G.D.: Identification of coding regions in genomic DNA sequences.
J. Mol. Biol. 248, 1–18 (1995)
16. Dong, S., Searls, D.B.: Gene structure pre-diction by linguistic methods. Genomics 23,
540–551 (1994)
Computational Approaches for Gene Prediction 25
17. Solovyev, V.V., Salamov, A.A., Lawrence, C.B.: Predicting internal exons by
oligonucleotide composition and discriminate analysis of spliceable open reading frames.
Nucleic Acids Res. 22, 5156–5163 (1994)
18. Burge, C., Karlin, S.: Prediction of Complete Gene Structure in Human Genomic DNA. J.
Mol. Biol. 268, 78–94 (1997)
19. Stanke, M., Waack, S.: Gene Prediction With A Hidden Markov Model and A New Intron
Submodel. Bioinformatics 19, 215–225 (2003)
20. Burge, C., Karlin, S.: Finding the Genes in Genomic DNA. Curr. Opin. Struct. Biol. 8,
346–354 (1998)
21. Haussler, D.: Computational Genefinding. Trends Biochem. Sci., 12–15 (1998)
22. Zhang, M.Q.: Identifcation of protein coding regions in the human genome by quadratic
discriminate analysis. Proc. Natl. Acad. Sci. USA 94, 565–568 (1997)
23. Milanesi, L., Kolchanov, N.A., Rogozin, I.B., Ischenko, I.V., Kel, A.E., Orlov, Y.L.,
Ponomarenko, M.P., Vezzoni, P.: GenView: a computing tool for protein-coding regions
prediction in nucleotide sequences. In: Second International Conference on
Bioinformatics, Supercomputing and Complex Genome Analysis, pp. 573–588. World
Scientific Publishing, Singapore (1993)
24. Xiong, J.: Essential Bioinformatics. Cambridge University Press, New York (2006)
25. Fogel, D.B., Chellapilla, K., Fogel, D.B.: Identification of Coding Regions in DNA
Sequences Using Evolved Nueral Networks. In: Fogel, G.B., Corne, D.W. (eds.)
Evolutionary Computation is Bioinformatics, pp. 195–218. Morgan Kaufmann, USA
(2003)
26. Bajic, V.B., Seah, S.H.: Dragon gene start finder: An advanced system for finding
approximate locations of the start of gene transcriptional units. Genome Res. 13, 1923–1929
(2003)
27. Bajic, V.B., Seah, S.H.: Dragon gene start finder identifies approximate locations of the 5‘
ends of genes. Nucleic Acids Res. 31, 3560–3563 (2003)
28. http://www.cshl.edu/OTT/html/mzef.html
29. http://www.geneprediction.org/book/overview.pdf
30. Wang, L., Trawick, J.D., Yamamoto, R., Zamudio, C.: Genome-Wide Operon Prediction
in Staphylococcus Aureus. Nucleic Acids Res. 32, 3689–3702 (2004)
31. Staden, R.: Graphic Methods to Determine the Function of Nucleic Acid Sequences.
Nucleic Acids Research, 521–538 (1984)
32. Fields, C.A., Soderlund, C.A.: Gm a practical tool for automating DNA Sequence
Analysis. Comput. Appl. Biosci. 6, 263–270 (1990)
33. Rogozin, I.B., Milanesi, L.: Analysis of Donor Splice Signals in Different Organisms. J.
Mol. Evol. 45, 50–59 (1997)
34. Kleffe, J., Hermann, K., Vahrson, W., Wittig, B., Brendel, V.: Logitlinear Models for the
Prediction of Splice Sites in Plant pre-mRNA Sequences. Nucleic Acid Research 24,
4709–4718 (1996)