KEMBAR78
Computational Approaches | PDF | Promoter (Genetics) | Gene
0% found this document useful (0 votes)
83 views12 pages

Computational Approaches

for gene

Uploaded by

Sandeep Kaushik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views12 pages

Computational Approaches

for gene

Uploaded by

Sandeep Kaushik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Computational Approaches for Gene Prediction:

A Comparative Survey

Israa M. Al-Turaiki, Hassan Mathkour,


Ameur Touir, and Saleh Hammami

Department of Computer Science,


College of Computer and Information Sciences,
King Saud University, Riyadh, Saudi Arabia
{ialturaiki,mathkour,touir,shammami}@ksu.edu.sa

Abstract. Accurate gene structure prediction plays a fundamental role in


functional annotation of genes. The main focus of gene prediction methods is to
find patterns in long DNA sequences that indicate the presence of genes. The
problem of gene prediction is an important problem in the field of
bioinformatics. With the explosive growth of genomic information there is a
need for computational approaches that facilitate gene location, structure and
functional prediction. In this paper, we survey various computational
approaches for gene predictions.

Keywords: Bio-informatics, Gene Prediction, Bio-Computation, Promoter


Identification.

1 Introduction
The main focus of gene prediction methods is to find patterns in long DNA sequences
that indicate the presence of genes. The problem of gene prediction is an important
problem in the field of bioinformatics [5]. It presents many difficulties especially in
eukaryotes where exons (coding regions) are interrupted by introns (non-coding
regions).The gene prediction problem can be defined formally as:
Input: DNA sequence X x ,x ,….x Σ where Σ A, T, G, C
Output: correct labeling oof each element of x as coding, noncoding or intergenic
4.
Although researchers continue to devise new algorithms for gene prediction, the
satisfactory accuracy is still far from reach. According to [2], 90% of genes are
accurately predicted at the nucleotide level, 45% at the exon level and only 20% on
the gene level. This explains the reason why the number of genes in the human
genome cannot be precisely estimated (between 30,000 and 100,000). Current gene
prediction methods are based on exploiting two gene features: content sensors and
signal sensors [3]. Content sensors try to distinguish protein coding and non-protein
coding regions. On the other hand, signal sensors try to predict the presence of
functional sites. Many algorithms are used for gene prediction including: dynamic
programming, hidden Markov models (HMMs), artificial neural networks (ANNs)
and linear discriminate analysis.

A. Abd Manaf et al. (Eds.): ICIEIS 2011, Part II, CCIS 252, pp. 14–25, 2011.
© Springer-Verlag Berlin Heidelberg 2011
Computational Approaches for Gene Prediction 15

2 Gene Prediction
Gene prediction problem can be divided into two sub-problems: prediction of protein
coding regions and prediction of functional sites [5, 20]. The development of gene
prediction programs have evolved in four generations [1]. In the first generation the
programs such as GRAIL [10] and TestCode [11], were designed to identify
approximate locations of protein coding regions. According to [1] such programs
were not able to produce precise location predictions. In the second generations,
programs such as SORFIND [12] and Xpound [13] combined splice signal and coding
region identification to predict exons. However, the programs did not attempt to
produce complete genes by assembling the predicted exons. The task of predicting
complete gene structures was addressed by the third generation programs such as
GeneID [14], GeneParser [15] ,GenLang [16] and FGENEH [17]. All those
programs were based on the unrealistic assumption that the input sequence contains
a single complete gene. The fourth generation programs came to solve this
shortcoming and improve the prediction accuracy. They include GENSCAN [18] and
AUGUSTUS[19].
Gene prediction methods rely on exploiting two types of gene information: content
and signal. Contents refer to variable-length features such as exons or introns. On the
other hand, signals are local sites such as splice sites, start and stop codons, branch
points, promoters and terminators polyadenylation sites, ribosomal binding sites and
transcription binding sites. Accordingly, Haussler [21] classified gene prediction
methods into three categories: content sensors, signal sensors and integrated methods.
Content sensors are methods that predict coding and non coding regions. Signal
sensors refer to methods that try to detect the presence of functional sites. Integrated
gene finding methods combine content and signal sensors in order to predict complete
gene structure.
Mathe et al. [3] divided gene finding approaches into finding the evidence for a
gene and combining the evidence in order to predict gene structure. Fig. 1 depicts
Gene Finding Approachs. Content sensors and signal sensors are used to look for the
evidence of a gene. Content sensors are either extrinsic or intrinsic [3]. In extrinsic
content sensors, local alignment methods such as Smith-Waterman, BLAST and
FASTA are used to detect similarity between a sequence and DNA or protein
sequence in a database. This approach is based on observing that exons are more
conserved that introns and intergenic regions. Three types of sequences may be used
in such case: protein sequences, ESTs, or genomic DNA. One limitation in ETS-based
similarity search is that only small parts of a gene correspond to ESTs which makes it
difficult to predict the complete gene structure [1]. The strength of this similarity-
based method lies in fact that prediction is based on accumulated annotated
sequences. The presence of a gene can be detected even with a single match.
However, this method is highly dependent on the content of the database. The
database may contain poor quality sequences or it may not have sequences similar to
the query sequence. Moreover, only half of the genes being discovered are
significantly homologous to database sequences.
16 I.M. Al-Turaiki et al.

Fig. 1. Gene Finding Approachs [4]

Intrinsic content sensors use different measures in order to classify a region as


protein coding. In protein coding regions nucleotide composition and codon
composition is different from non-protein regions. Protein coding regions are (C+G)
rich. While non coding regions are more (A+T) rich. Hexamer frequency is higher in
protein coding regions than in non-protein coding regions and is the most
discriminative feature between coding and non-coding regions. It is one of the main
features used in SORFIND [12], Geneview2 [23] and MZEF[24]. Statistical models
such as HMM and ANN can be used to derive intrinsic features.
Signal sensors try to predict the presence of functional sites. The presence of
functional sites can be detected simply by searching using a consensus sequence
obtained by multiple sequence alignment. This simple signal prediction method is
implemented in SPLIECVIEW [33] and SplicePredictor [34]. Signals may also be
represented using positional weight matrices (PWMs) computed from a multiple
alignment of functionally related sequences. The PWM indicates the probability of a
given base appearing in specific position of the functional sites. Weight matrices
sensors return a score that is the sum of individual residue matches over the siteq
Algorithms which use PWM rely on previously known motif sequences and assume
independence between adjacent nucleotides. PMWs can be regarded as a perceptron
which a simple neural network. They can be also defined as one zero order Markov
model per position. Dependences between adjacent nucleotides can be captured using
higher order Markov models called Weight Array Models (WAMs). To take
insertions and deletions into account, one can use Hidden Markov Models (HMM).
In combining the evidence for gene prediction, rather than looking for independent
exons, gene prediction programs try to identify the complete gene structure [3]. Using
signal sensors, one can detect the presence of translation starts and stops and splice
sites. These signals are the most important signs of the presence of genes. According
to stop and start codons, exon can be classified as single exon gene which starts with a
start codon and ends with a stop codon, an initial exon that begins with a start codon
and ends with a donor site, a terminal exon that starts with an acceptor site and ends
with a stop codon and an internal exon that starts with an acceptor site and ends with a
Computational Approaches for Gene Prediction 17

donor site [6]. Fig. 2 depicts exon classifications. Each of the four exon types present
different challenges and the ability of gene prediction programs to predict each type is
different. Evidences can be combined to predict the complex structure of a gene. "In
theory, each consistent pair of detected signals defines a potential gene region" [3].
If one takes into consideration all possible gene regions to build a gene model,
the number of possible gene models grow exponentially with the number of
predicted exons.

Fig. 2. Exon Classifications [29]

According to [3] the method that try to predict the whole gene structure can be
classified into: extrinsic, intrinsic or integrated. The classification is based on the way
these methods use to asses contents. Programs that follow the extrinsic approach are
based on combining similarity information and signal information. This information is
used to refine region boundaries. Such programs inherit the strengths and weaknesses
of the sensors used. Intrinsic programs try to locate all gene elements in a genomic
sequence. Dynamic programming is used to identify the high scoring gene structures.
Integrated methods combine both extrinsic and intrinsic methods.

2.1 Prokaryotic Gene Prediction

Prokaryotic organism, like bacteria and Archaea, are known to have small genomes
with size ranging from 0.5 to 10 Mbp. They have high gene density, where coding
sequences make more than 90% of the genome. DNA sequences in prokaryotes are
transcribed into mRNA and translated to protein without significant modifications.
The main characteristic of prokaryotic genes is the absence of introns. Several
conserved patterns are found in prokaryotic genes. Gene prediction can make use of
gene regulatory sequences. The start codon of most bacterial genes is ATG. But
sometimes, GTG and TTG are used as start codons. However the start codon alone is
not enough to identify a gene since there could be more than one start codon at the
beginning of a frame. The ribosomal binding site which is located directly after the
transcription initiation site and before the translation start codon is used to help in
locating the start codon. In many bacteria the ribosomal binding site has a consensus
motif of AGGAGGT. At the end of protein coding region a stop codon is found.
There are three possible stop codons and they are easy to identify.
Conventionally, a prokaryotic gene can be identified manually using the longest
open reading frame (ORF) and the major prokaryotic gene signals. One limitation of
this approach is the possibility of missing very small genes. Although prokaryotic
genomes have small intergenic regions and lack introns, genes may overlap and
18 I.M. Al-Turaiki et al.

makes it difficult to predict translation starts [3]. Early gene prediction algorithms
relied on examining the nucleotide distribution in DNA sequences. One method
observes the composition of the third position in a codon. In coding sequences it is
more likely that this position will be G or C. Statistical features of encoding regions
can be analyzed using Markov Models and Hidden Markov Models. This approach
may be statistically limited when there are no enough hexamers. The solution to such
problem is to use an alternative option called interpolated Markov models (IMM). A
program that finds potential genes in microbial DNA – Bacteria and Archaea is
Glimmer [9]. It is based on IMM. Other HMM/IMM programs include ECOPARSE
[7], GENMARK [8] and are able to predict genes with good specificity.

2.2 Eukaryotic Gene Prediction

Gene prediction is more difficult in eukaryotes than in prokaryotes. Eukaryotic


organisms have large genomes with size falling in the range between 10Mbp and
670Gbp. They have low gene density. For example, only 3% of the human genome
actually codes for protein. The structure of eukaryotic genes is more complicated.
Many repetitive sequences appear between genes. A eukaryotic gene consists of
coding regions called exons and non-coding regions called introns. Introns are
removed by splicing mechanism that results in mRNA. Splice junctions have no
strong consensus and it is not clear how exons are identified in long DNA stretches.
The main issue in gene prediction is the identification of exons, introns and functional
sites. Gene prediction distinguishes between coding and non-coding regions by
exploiting many features such as nucleotide composition, codon bias , ployA site
signal and hexamer frequency.

3 Computational Approaches for Gene Prediction


In 1984, Staden [31] developed a program with graphic interface that displayed signal
information and content measures. From the user interface the user could make exon
combinations that likely make a gene. In 1999 a truly automated gene prediction was
developed by Fields and Soderlund [32]. GeneModeler combined many content and
signal measures. It was able to identify all exons that with a score exceeding a certain
threshold and generate all possible genes. Similarly another program, GeneID [14],
was developed to for vertebrate sequences. The main limitation of the programs was
the extremely large number of possible combinations.

3.1 Dynamic Programming

In order to deal with the extremely large number of possible gene models, dynamic
programming methods are used in gene finding applications. In terms of gene finding,
grammatical rules of gene structure are used. They provide constraints on the order on
which gene segments can appear. For example, an internal intron appears between
two exons. Giving a set of rules and scored introns and exons it is possible to scan the
sequence once and determine the highest scoring gene structure. This approach
although guaranteed to produce the highest scoring gene structure it does not always
produce correct predictions. However, researchers can focus on improving the
Computational Approaches for Gene Prediction 19

predicted gene structure and improve the score assigned to them. GeneParser [15] is a
gene prediction program that employs dynamic programming in order to compute
optimal combination of introns and exons.

3.2 Neural Networks

Neural networks are capable of finding complex patterns and relationships between
sequence residues. In gene finding a neural network is composed of input, output and
hidden layers. The input to the network is a gene sequence. The output is the
probability of exon. A neural network is trained using known gene structures. Gene
structure is composed of several features: hexamer frequencies splice sites, GC
composition. The weights of the ANN are adjusted to recognize different patterns.
After training an artificial neural network (ANN) will be able to apply the rules
learned in the training phase to recognize unknown sequences. Fig. 3 depicts ANN for
Gene Prediction in Eukaryotes.

Fig. 3. ANN for Gene Prediction in Eukaryotes [24]

GRAIL is an example of a gene prediction program based on artificial neural


networks. It is a web based program that provides analyses of protein coding regions,
ploy(A) sites, promoters and gene model construction. GRAIL is available at
http://compbio.ornl.gov/Grail-1.3/. GrailEXP is a newer version of GRAIL. It is a
software package that predicts exons, genes, promoters, poly(A)s, CpG islands, EST
similarities, and repetitive elements within DNA sequence.
Fogel et al. [25] developed an approach that uses genetic algorithms to evolve
artificial neural networks (ANNs) for identifying coding and non coding regions in a
DNA sequence. A DNA sequence was obtained using a window of 99 nucleotides.
The ANN was used to classify the nucleotide in the center of the window as coding or
non-coding. The ANN architecture consisted of 9 input nodes corresponding to 9
features, 14 hidden layers and one output node. The output ranged from -1 (non
coding) to 1 (coding). The interconnection weights and biases were evolved using
GA. This approach is superior to traditional ANN since evolving ANNs can adapt to
various inputs with less involvement of the operator.
In [26] [27], ANNs were employed in the prediction of regions that overlap with
the first exon of a gene or exists close to it. First, a promoter finder is used to predict
20 I.M. Al-Turaiki et al.

transcription start site (TSS). Then another system is used to predict the presence of
CpG islands. Then a four-layer ANN is used to predict the presence of a gene based
on the predicted TSS and CpG islands.

3.3 Pattern Discrimination Methods

Discrimination methods are statistical methods used to classify sequences based on


one or more observed patterns. Some gene prediction use linear discrimination
analysis (LDA) or quadratic discrimination analysis (QDA) to improve the accuracy
of prediction. Discrimination methods try to define boundaries between coding and no
coding sequences based on sequence features. Fig. 4 depicts two discriminant
analysis, LDA and QDA. FGENES and MZEF [22] [28] are two program for exon
prediction. The programs are based on LDA and QDA respectively. FGENES is
available at www.softberry.com.

Fig. 4. Two discriminant analysis, LDA and QDA. ▲ coding features; ⊗ noncoding[24]

3.4 Hidden Markov Models

Hidden Markov models are designed to process sequences of data. They are widely
used in computational biology for analyzing DNA and protein sequences. They have
been already used in multiple sequence alignment, motif finding and protein structure
prediction. A Markov model is a one that assumes that the probability that a given
character will appear at a specific position depends on the previous k characters. k is
called the order of Markov model. The model is defined using conditional probability
⁄ where , , , for DNA sequences. In order
to build a Markov model a set of sequences is required to estimate the probabilities.
Using the built Markov model, it is easy to compute the likelihood that a given
sequence is generated by this model. A HMM is made of a number of states each of
which has two parameters: symbol-emission probabilities, and state-transition
probabilities. Symbol-emission probabilities indicate the probability of emitting each
possible symbol from a state. The state-transition probabilities determine the
Computational Approaches for Gene Prediction 21

probabilities of moving from one state to the next. Sequences can be generated by
beginning at some initial state and moving from state to another according to the
state-transition probabilities until an end state is reached. Each state then emits
symbols according to that state's emission probability distribution, creating a sequence
of residues.
There are three attributes of HMMs that led to their popularity in gene finding.
They have intuitive analogy to the gene structure. They have consistent mathematical
formalism on which rigorous analysis can be made. There are many algorithms that
allow for efficient determination of their properties [6]. According to [4] there is no
HMM system that can deal with the size and complexity required in gene finding. A
full gene finding HMM model is divided into three smaller models one for each class
of DNA nucleotides: exons, introns and integenic regions.
The HMM is first trained using the expectation maximization algorithm. It is used
to determine reasonable values for all HMM probabilities. Emission and state
transition probabilities are initialized to random values. Sometimes it is good to use
prior estimations of those probabilities if available. When it receives a DNA
sequences the EM algorithm re-estimate the probabilities. It runs each training
sequence through the model and calculates the posterior probability P s⁄λ for each
sequence s. The P s⁄λ are multiplied to produce P S⁄λ where S is the set of all
sequences. The probabilities are adjusted in order to maximize the value of P S⁄λ .
The procedure continues to run data through the model and adjust probabilities
according to P S⁄λ until the value is maximized. The EM algorithm is guaranteed to
produce the local optimal estimates of all probabilities. After training the HMM it
becomes ready to interpret sequences. For this purpose a dynamic programming
algorithm called Viterbi algorithm is often used. It can efficiently align any sequence
to a trained HMM. It finds the most likely sequence of states of the model for the
given sequence.
GENESCAN [18] is an online gene prediction program based on fifth-order HMM.
The program allows for predicting the locations and exon-intron structures of genes in
genomic sequences from a variety of organisms. It combines hexamer frequencies
with coding signals in prediction. Putative exons are assigned a score of being
true exons. If the score exceeds 0.5 the prediction is considered reliable. This
program has been used in annotating the human genome. It is available at
http://genes.mit.edu/GENSCAN.html.
HMMgene available at http://www.cbs.dtu.dk/services/HMMgene/ is another
online HMM based gene prediction program. It distinguishes coding regions from
non-coding regions using a unique measure called conditional maximum likelihood. If
a sequence contains sub-regions that have been already identified as coding regions
(by similarity search) the regions are locked as coding regions. Then the HMM
prediction is made with bias to locked regions. The program tries to extend locked
regions to predict the remaining coding parts of a gene.

4 Promoter Identification

Promoters are DNA sequences found at the 5 side of a gene. They regulated
transcription by binding regulatory proteins to specific binding sites. Protein
22 I.M. Al-Turaiki et al.

interactions can either induce or repress the transcription process. Traditionally


promoters and regulatory proteins are determined by experimental analysis which is
time consuming and laborious. Computational promoter identification is required to
replace a great deal of experimental analysis. Identifying promoters is a difficult task.
This is mainly due to the shortness of binding sites and their variability. More over
these short sequences can be found in any sequence by chance.
In prokaryotes, transcription is initiated by RNA polymerase. The small unit of
RNA polymerase recognizes specific sequence upstream of a gene called, promoter,
and then allows the larger unit to bind. The promoter includes segments located 35
and 10 base pairs upstream from the transcription start site. For the E. coli the -35 box
has a consensus of TTGACA. The -10 box has a consensus of TATAAT. In addition
to the RNA polymerase, there are other DNA binding proteins that play a
fundamental role in the transcription process. These proteins are called transcription
factors. They bind to specific DNA sequences either to enhance RNA polymerase or
to inhibit its function. The specific DNA sequences to which transcription factors bind
are called regulatory elements.
In eukaryotes, transcription is more complex since there are three different types of
RNA polymerase( , ). Each polymerase transcribes different set of genes. RAN
polymerase II is responsible for transcribing protein coding genes. Unlike prokaryotes
where genes many share a common promoter, each eukaryotic gene has its own
promoter. In eukaryotes, more transcription factors are required to initiate the
transcription process. RNA polymerase requires a dozen or more transcription factors
to recognize and bind to the promoter in a specific order. After that it can bind itself
to the promoter. Many eukaryotic promoters contain a sequence called TATA box
having the consensus TATA(A/T)A. It is located 30 base pairs upstream from
transcription start site. Although not always present in promoters, the TATA box is
often used as an indicator of presence of a promoter.

5 Computational Approaches for Promoter Identification


According to [24] promoter identification algorithms can be either ab initio based or
homology based also called Phylogenetic footprinting. Most algorithms focus on the
prediction of RNA polymerase II promoter and its associated regulatory elements.

5.1 Ab Initio-Based Algorithms

Using this type of algorithms, both eukaryotic and prokaryotic promoters can be
predicted. The prediction process is based on characteristic patterns for promoters and
regulatory elements. The idea of predicting promoters and regulatory sites is based in
match consensus sequence patterns. Ab initio algorithms require training and thus
trained algorithms are species specific. Using ab initio algorithms it is difficult to
predict unknown motifs.
In prokaryotes the most important aspect of promoter prediction is the prediction of
the operon structure. Within an operon linked genes share a common promoter
located upstream from the first gene. Knowing the operon structure, only prediction
of one promoter is required. The most accurate method for operon prediction was
Computational Approaches for Gene Prediction 23

developed in 2004 [30] but it is not yet available as a computer program. The most
widely used promoter prediction program for prokaryotes is BPROM. BPROM is a
bacterial sigma70 promoter recognition program with about 80% accuracy and
specificity. It uses LDA with signal and content information. First operon structure is
predicted using 100 base pairs as a distance for genes in an operon. Then promoter
prediction takes place. The program is available at http://linux1.softberry.com/
berry.phtml?topic=bprom&group=programs&subgroup=gfindb
For eukaryotes a unique feature of eukaryotic promoter, CpG islands, is used to
increase the prediction accuracy. By predicting the presence of CpG islands near the
promoter region overlapping the transcription start site promoters can be traced on the
upstream of the island. CpGProD is a program for predicting promoters containing
high density of GpC islands in mammalian sequences. CpGProD is available either
via a web server, useful for a small dataset, or as a standalone application for a larger
dataset. The online version of the program can be found at http://pbil.univ-
lyon1.fr/software/cpgprod_query.html .

5.2 Phylogenetic Footprinting Algorithms

Promoter and regulatory elements from closely related organisms are highly
conserved. The conservation can be observed at both sequence and element
organization levels. This makes it possible to predict promoters based on comparative
analysis. Phylogenetic footprinting refers to the identification of functionally
important noncoding DNA sequences. These methods can be applied to both
eukaryotic and prokaryotic sequences. In this type of comparative analysis, the choice
of organisms is very important. If the two organisms are closely related (human and
chimpanzee for example) it would be difficult to extract functional elements. On the
other hand, if the evolutionary distance between two organism is very long (for
example between human and fish) detecting promoters and other elements would be
very difficult. A good choice is usually using human and mouse sequences. Using
Phylogenetic footprinting algorithms no training of probabilistic models is required,
thus this approach is widely applicable. It is also possible to discover new regulatory
elements that are common to both organisms. However, this approach puts restrictions
on the evolutionary distance among organisms. ConSite is an online Phylogenetic
footpriniting program. It finds promoters by comparing two sequences using global
alignment algorithm. The program is available at http://asp.ii.uib.no:8090/cgi-
bin/CONSITE/consite/ . Similar programs include rVISTA at http://rvista.dcode.org/
and PromH(W) from www.softberry.com.

6 Conclusion
Gene prediction is an important problem in bioinformatics. Different approaches have
been used to tackle this problem including: dynamic programming, HMM, ANN and
GA. In prokaryotes gene prediction is much easier than in eukaryotes. On the
nucleotide level more than 90% of nucleotides can be classified as coding or non-
coding. But the exact exon boundaries are still very difficult to predict. Some
problems in gene finding are still open. According to [1], it is difficult to locate short
24 I.M. Al-Turaiki et al.

exons because discriminative characteristics are not apparent in short sequences. The
problem is worse when a coding exon is a multiple of three. Missing such exon will
not affect the gene assembly. Another problem that has not been solved effectively is
the alternative splicing which results in genes having more than one possible exon
assembly. Although programs like GENESCAN and MZEF tried to solve the problem
by identifying sub-optimal exons, the problem remains open [1]. Moreover, there is
no current method that can predict overlapping eukaryotic genes. There prediction of
multiple genes in a single sequence is still difficult. With the wide variety of gene
prediction programs there is a need for comprehensive criteria for assessing the
quality of gene prediction programs.

References
1. Wang, Z., Chen, Y., Li, Y.: A Brief Review of Computational Gene Prediction Methods.
Geno. Prot. Bioinfo. 2, 216–221 (2004)
2. Zhang, M.Q.: Computational Prediction of Eukaryotic Protein-Coding Genes. Nature
Reviews Genetics 3, 698–709 (2002)
3. Mathe, C., Sagot, M., Schiex, T., Rouze, P.: Current Methods for Gene Prediction, Their
Strengths and Weakness. Nucleic Acid Research 30, 4103–4117 (2002)
4. Bandyopadhyay, S., Maulik, U., Roy, D.: Gene Identification: Classical and
Computational Intelligence Approaches. IEEE Transactions On Systems, Man, And
Cybernetics—Part C: Applications And Reviews 38, 55–68 (2008)
5. Mount, D.W.: Bioinformatics: Genome and Sequence Analysis. Cold Spring Harbor
Laboratory Press, New York (2004)
6. Stormo, G.D.: Gene-Finding Approaches in Eukaryotes. Genome Research 10, 394–397
(2000)
7. Krogh, A., Mian, I.S., Haussler, D.: A hidden Markov model that finds genes in E. coli
DNA. Nucleic Acids Res. 22, 4768–4778 (1994)
8. Borodovsky, M., McIninch, J.: GENMARK: parallel gene recognition for both DNA
strands. Comput. Chem. 17, 123–133 (1993)
9. Salzberg, S.L., Delcher, A.L., Kasif, S., White, O.: Microbial gene identification using
interpolated Markov models. Nucleic Acids Res. 26, 544–548 (1998)
10. Uberbacher, E.C., Mural, R.J.: Locating protein-coding regions in human DNA sequences
by a multiple sensor-neural network approach. Proc. Natl. Acad. Sci. USA 88, 11261–
11265 (1991)
11. Fickett, J.W.: Recognition of protein coding regions in DNA sequences. Nucleic Acids
Res. 10, 5303–5318 (1982)
12. Hutchinson, G.B., Hayden, M.R.: The prediction of exons through an analysis of
spliceable open reading frames. Nucleic Acids Res. 20, 3453–3462 (1992)
13. Thomas, A., Skolnick, M.H.: A probabilistic model for detecting coding regions in DNA
sequences. IMA J. Math. Appl. Med. Biol. 11, 149–160 (1994)
14. Guigo, R., Knudsen, S., Drake, N., Smith, T.: Prediction of gene structure. J. Mol.
Biol. 226, 141–157 (1992)
15. Snyder, E.E., Stormo, G.D.: Identification of coding regions in genomic DNA sequences.
J. Mol. Biol. 248, 1–18 (1995)
16. Dong, S., Searls, D.B.: Gene structure pre-diction by linguistic methods. Genomics 23,
540–551 (1994)
Computational Approaches for Gene Prediction 25

17. Solovyev, V.V., Salamov, A.A., Lawrence, C.B.: Predicting internal exons by
oligonucleotide composition and discriminate analysis of spliceable open reading frames.
Nucleic Acids Res. 22, 5156–5163 (1994)
18. Burge, C., Karlin, S.: Prediction of Complete Gene Structure in Human Genomic DNA. J.
Mol. Biol. 268, 78–94 (1997)
19. Stanke, M., Waack, S.: Gene Prediction With A Hidden Markov Model and A New Intron
Submodel. Bioinformatics 19, 215–225 (2003)
20. Burge, C., Karlin, S.: Finding the Genes in Genomic DNA. Curr. Opin. Struct. Biol. 8,
346–354 (1998)
21. Haussler, D.: Computational Genefinding. Trends Biochem. Sci., 12–15 (1998)
22. Zhang, M.Q.: Identifcation of protein coding regions in the human genome by quadratic
discriminate analysis. Proc. Natl. Acad. Sci. USA 94, 565–568 (1997)
23. Milanesi, L., Kolchanov, N.A., Rogozin, I.B., Ischenko, I.V., Kel, A.E., Orlov, Y.L.,
Ponomarenko, M.P., Vezzoni, P.: GenView: a computing tool for protein-coding regions
prediction in nucleotide sequences. In: Second International Conference on
Bioinformatics, Supercomputing and Complex Genome Analysis, pp. 573–588. World
Scientific Publishing, Singapore (1993)
24. Xiong, J.: Essential Bioinformatics. Cambridge University Press, New York (2006)
25. Fogel, D.B., Chellapilla, K., Fogel, D.B.: Identification of Coding Regions in DNA
Sequences Using Evolved Nueral Networks. In: Fogel, G.B., Corne, D.W. (eds.)
Evolutionary Computation is Bioinformatics, pp. 195–218. Morgan Kaufmann, USA
(2003)
26. Bajic, V.B., Seah, S.H.: Dragon gene start finder: An advanced system for finding
approximate locations of the start of gene transcriptional units. Genome Res. 13, 1923–1929
(2003)
27. Bajic, V.B., Seah, S.H.: Dragon gene start finder identifies approximate locations of the 5‘
ends of genes. Nucleic Acids Res. 31, 3560–3563 (2003)
28. http://www.cshl.edu/OTT/html/mzef.html
29. http://www.geneprediction.org/book/overview.pdf
30. Wang, L., Trawick, J.D., Yamamoto, R., Zamudio, C.: Genome-Wide Operon Prediction
in Staphylococcus Aureus. Nucleic Acids Res. 32, 3689–3702 (2004)
31. Staden, R.: Graphic Methods to Determine the Function of Nucleic Acid Sequences.
Nucleic Acids Research, 521–538 (1984)
32. Fields, C.A., Soderlund, C.A.: Gm a practical tool for automating DNA Sequence
Analysis. Comput. Appl. Biosci. 6, 263–270 (1990)
33. Rogozin, I.B., Milanesi, L.: Analysis of Donor Splice Signals in Different Organisms. J.
Mol. Evol. 45, 50–59 (1997)
34. Kleffe, J., Hermann, K., Vahrson, W., Wittig, B., Brendel, V.: Logitlinear Models for the
Prediction of Splice Sites in Plant pre-mRNA Sequences. Nucleic Acid Research 24,
4709–4718 (1996)

You might also like