1
PROJECT REPORT
On
Language detection with machine learning
Submitted to Centurion University of Technology& Management
in partial fulfillment of the requirement for award of the degree of
B. TECH
in
COMPUTER SCIENCE & ENGINEERING
Submitted By
M.Syamla
Holding university registration number
210101120103
Under the Guidance of
MS. ARYALOPA MALLA
DEPT. OF COMPUTER SCIENCE & ENGINEERING
SCHOOL OF ENGINEERING &TECHNOLOGY,
CUTM, Paralakhemundi-761211
2
CERTIFICATE
This is to be certified that the project entitled “Language
detection with machine learning” has been
submitted for the Bachelor of Technology in Computer
Science Engineering of School of Engineering & Technology,
CUTM, Paralakhemundi during the academic year 2022-2023
is a persuasive piece of project work carried out by
“M.SYAMLA” towards the partial fulfillment for award of
the degree (B.Tech.) under the guidance of “MS.
ARYALOPA MALLA” and no part thereof has been
submitted by them for any degree to the best of my
knowledge.
Signature of Candidate Signature of Project Guide
Name of the Candidate Name of the Guide
3
EVALUATION SHEET
1.Title of the Project: Language detection with
machine learning
2. Year of submission: 2023
3. Name of the degree: B. TECH (C.S.E)
4. Date of Examination / Viva:
5. Student Name: M.SYAMLA
6. Reg No: 210101120103
Name of the Guide: -MS. ARYALOPA MALLA
[APPROVED/REJECTED]
Signature of Project Guide
4
CANDIDATE’S DECLARATION
I “M.SYAMLA” B. Tech CSE (Semester-
IV) of School of Engineering &Technology,
CUTM, Paralakhemundi, hereby declare that the
Project Report entitled “L” is an original work and
data provided in the study is authentic one. This
report has not been submitted to any other Institute
for the award of any other degree by me.
Signature of Student
INDEX
5
SI.NO CONTENT PAGE.NO.
01 Abstract 06
02 Introduction 07
03 What is language dection 08
04 Use -case 09
05 Installation and importing 11-12
of libries
06 Data Set 12-13
07 Importing the dataset 13-14
08 Differentiating Independent 15
from dependent features
09 Performing label encoding 15
10 Text preparation 15-16
11 CountVectorizer 16-17
12 Model evalution 18-19
13 Visualization 20-22
14 Conclusion 22
6
Abstract :-
Language detection is an essential task in natural
language processing (NLP) and has numerous
applications in various fields, such as text
classification, machine translation, and speech
recognition. This paper proposes a machine
learning approach for language detection that
utilizes character n-grams and word n-grams as
features. We train and evaluate several models
using a large dataset of text documents in multiple
languages. Our results demonstrate that the
proposed approach achieves high accuracy in
identifying the language of the input text,
outperforming previous state-of-the-art methods.
The proposed method provides a practical solution
for language detection in real-world applications
and can be easily extended to support additional
languages.
7
Introduction
Recently, a wide range of human sectors (e.g
Engineering, Education, Healthcare, Finance,
Media, etc.,) have shown a lot of interest in
machine learning. ML’s attractiveness has
largely been attributed to its ability to make
decisions without human interference.One
common ML task is NLP and today we’ll be
creating a model trained to get a text input and
then predict what language it is. The
technique of determining the language of a
text or document is known as language
detection in natural language processing. It
was difficult to identify languages using
machine learning when little data was
available about them. There are now a
number of effective machine learning models
8
for language detection since data is so easily
accessible.
What is language detection?
The initial stage in any pipeline for text
analysis or natural language processing is
language identification. All ensuing language-
specific models will yield wrong results if the
language of a document is incorrectly
determined. Similar to what happens when an
English language analyzer is used on a
French document, errors at this step of the
analysis might accumulate and provide
inaccurate conclusions. Each document’s
language and any elements written in another
language need to be identified. The language
used in papers varies widely depending on the
nation and culture.
9
Use-Cases
Monolingual chatbots: When a user
starts speaking in a particular language,
a bot must be able to recognize it even
if it hasn’t been properly educated to
carry on a discussion in that language.
Spam filtering: Spam filtering systems
that support many languages must
identify the language that emails, online
comments, and other input are written in
before utilizing real spam filtering
algorithms. Internet platforms cannot
efficiently remove content from certain
countries, regions, or locations
suspected to be creating spam without
this identification.
10
Recognize the language used in emails
and chats: Language detection
identifies the language of a text as well
as the words and sentences where the
language diverges. Since business
messages (chats, emails, and so on)
may be written in a variety of
languages, it is frequently utilized.
Linguistic blending: Some people are
used to having conversations that are
bilingual. Hinglish, an amalgam of Hindi
and English terminology used in India,
would be a good illustration of this. In
these situations, a language detection
model will examine the number of words
in a sentence written in one or more
languages, with the language with the
most words serving as the primary
11
language for the interaction but the
secondary language also being
recognized and receiving a high
confidence score in our ranking.
With this being settled, let’s get our hands
dirty by building a model which will be able to
predict the given language.
Installation and importing of libraries
We will import all of the necessary libraries
first, but if you don’t have them already
installed, I advise you to install them before
moving on with the article.
import re
import warnings
warnings.simplefilter("ignore")imp
ort pandas as pd
12
import numpy as npimport seaborn
as sns
import matplotlib.pyplot as plt
Dataset
We will make use of a small language
detection dataset from Kaggle. You will build
an NLP model for predicting 17 distinct
languages using this dataset, which contains
text details for 17 different languages.
Languages: English, Malayalam, Hindi, Tamil,
Kannada, French, Spanish, Portuguese,
Italian, Russian, Swedish, Dutch, Arabic,
Turkish, German, Danish, and Greek.
We must build a model that can predict the
given language using the text as a guide. This
provides a solution for many computational
13
linguistics and artificial intelligence
applications. For machine translation, these
sorts of prediction algorithms are frequently
utilized on robots as well as electronic devices
like mobile phones and laptops. Additionally, it
aids in managing and locating papers that are
multilingual. Researchers are still active in the
field of NLP.
Importing the dataset
df = pd.read_csv("Language
Detection.csv")
df.head()
This dataset has 10,337 rows, two columns,
and text details for 17 distinct languages. We
14
can quickly calculate the value count for each
language.
df["Language"].value_counts()
15
Differentiating Independent from
dependent features
The dependent variable, in this case, is the
name of the language (y), and the
independent variable is text data (X), which
we can now separate from each other.
X = data["Text"]
y = data["Language"]
Performing label encoding
Language names make up our output
variable, which is a categorical variable. We
are conducting label encoding on that output
variable since we should need to turn it into a
numerical form for training the model. We are
importing LabelEncoder from sklearn for
this procedure.
from sklearn.preprocessing import
LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
Text preparation
16
This dataset contains a lot of
irrelevant/unwanted symbols and numbers
that may degrade the performance of our
model, thus text preparation is required.
text_list = []for text in X:
text = re.sub(r'[!@#$
(),n"%^*?:;~`0-9]', ' ', text)
text = re.sub(r'[[]]', '
', text)
text = text.lower()
text_list.append(text)
In the code above, we created an empty
list text_list for appending the
preprocessed text, we then iterate through all
the text (X), removed the symbols and
numbers, converted the text to lowercase, and
finally append it to the list text_list .
CountVectorizer
Both the input and the output features must
take the form of numbers. We will use the
CountVectorizer’s Bag of Words model to
convert text into numerical form.
from
sklearn.feature_extraction.text
import CountVectorizer
17
cv = CountVectorizer()
X =
cv.fit_transform(text_list).toarra
y()
X.shape
You should get (10337, 39419) as an
output.
Train Test split
Our input and output variables have been
preprocessed, therefore the next stage is to
split our dataset into training and test data.
The training set is for the model’s training and
the test set is for the test set’s evaluation. We
will make use of the train test split for this
procedure.
from sklearn.model_selection
import train_test_split
x_train, x_test, y_train, y_test =
train_test_split(X, y, test_size =
0.20)
The test size is just 20%.
Training and prediction of models
18
The process of creating the model is almost
complete. The Naive Bayes algorithm is what
we’re utilizing to build our model. The model is
afterwards trained using the training set.
from sklearn.naive_bayes import
MultinomialNB
model = MultinomialNB()
model.fit(x_train, y_train)
We used the training set to train our model.
Predicting the results for the test set is what
we’ll do next.
y_prediction =
model.predict(x_test)
Model evaluation
After the successful completion of training,
test, and prediction, the next thing we always
want to do is model evaluation and
assessment.
from sklearn.metrics import
accuracy_score, confusion_matrix,
classification_reportaccuracy =
accuracy_score(y_test,
y_prediction)
confusion_m =
confusion_matrix(y_test,
19
y_prediction)print("The accuracy
is :",accuracy)
We got an accuracy of 97%.
20
Visualization
Using the seaborn heatmap, let’s plot the
confusion matrix for the purpose of
visualization.
plt.figure(figsize=(15,10))
sns.heatmap(confusion_m, annot =
True)
plt.show()
21
Let’s try out the model prediction using text
from several languages. We will write a
function that will take in the text as input and
predict the language in which the text is
written.
def lang_predict(text):
x =
cv.transform([text]).toarray()
lang = model.predict(x)
lang =
le.inverse_transform(lang)
print("The langauge is
in",lang[0])
cv is CountVectorizer that is converting text to
a bag of words model (vector), the
variable lang is storing the predicted
language, and then we finally we can now
print the predicted language to the user.
To test this, we will call
the lang_predict() function and pass any
bunch of text into it, and then allow it to predict
the language.
22
Conclusion
We have come to the end of this article, I hope
you now have a better understanding of how
to predict language using machine learning.
The data has to be evaluated and then
preprocessed as necessary. The text data you
have becomes represented using a bag of
words model. In order to make accurate
predictions in NLP, text extraction and
vectorization are crucial tasks. In these text
classification issues, Naive Bayes consistently
proves to be a stronger model, leading to
more accurate results.