DNA sequencing and applying
classifier with ML
INTRODUCATION:- 2
In the field of medical information research, the
genetic series is widely used as a component of a
category. One of the applications of ML is
biochemistry. Bioinformatics is an interdisciplinary
science that uses computers and communication
science to understand biological data. One of its
most difficult tasks is to distinguish between regular
genes and disease-causing genes.
3
The classification of gene sequences into
existing categories is utilized in genomic
research to discover the functions of novel
proteins. As a result, it is critical to identify
and categorize such genes. We employ ML
approaches to distinguish between infected
and normal genes using classification
methods.
I will apply a classification model that
can predict a gene's function based on
the DNA sequence of the coding
sequence alone.
5
You will need some libraries
such as: numpy, pandas ..
I will upload human data and read it 6
to became have some data for human
DNA sequence coding regions
and a class label.
7
I also upload and read data
for Chimpanzee and a more
divergent species, the dog.
Here are the definitions for each of 8
the 7 classes and how many there are
in the human training data. They are
gene sequence function groups.
9
Since seq is not equal, we will apply the k-
mers to the complete sequences.
Using get Kmers function
10
Now, our coding sequence data is
changed to lowercase, split up into all
possible k-mer words of length 6
11
12
13
Since we are going to use scikit-learn
natural language processing tools to
do the k-mer counting, we need to
now convert the lists of k-mers for
each gene into string sentences of
words that the count vectorizer can
use.
14
We can also make a y variable
to hold the class labels.
16
17
We will perform the same
steps for chimpanzee and dog
18
19
20
21
we will apply the BAG of WORDS
using CountVectorizer using NLP.
This is equivalent to k-mer counting.
23
24
If we have a look at class balance we can
see we have relatively balanced dataset.
25
26
27
Splitting the human dataset into the
training set and test set.
28
A multinomial naive Bayes classifier will be
created. I previously did some parameter
tuning and found the ngram size of 4
(reflected in the Countvectorizer() instance)
and a model alpha of 0.1 did the best
29
let's look at some model
performce metrics like the
confusion matrix, accuracy,
precision, recall and f1 score.
We are getting really good
results on our unseen data,
31
32
33
THANK YOU