100% found this document useful (1 vote)

806 views22 pages

PROJECT REPORT For Machine Learning

This document is a project report submitted by M. Syamla to Centurion University of Technology & Management for their B.Tech degree in partial fulfillment of the degree requirements. The project aims to build a machine learning model for language detection using text data from documents in 17 different languages. The document outlines the introduction, use cases, dataset, text preprocessing steps, and model evaluation process for a language detection model built using machine learning techniques.

Uploaded by

Swapna Ray

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

806 views22 pages

PROJECT REPORT For Machine Learning

Uploaded by

Swapna Ray

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 22

1

PROJECT REPORT
On
Language detection with machine learning
Submitted to Centurion University of Technology& Management
in partial fulfillment of the requirement for award of the degree of

B. TECH
in
COMPUTER SCIENCE & ENGINEERING
Submitted By

M.Syamla

Holding university registration number

210101120103

Under the Guidance of

MS. ARYALOPA MALLA

DEPT. OF COMPUTER SCIENCE & ENGINEERING

SCHOOL OF ENGINEERING &TECHNOLOGY,

CUTM, Paralakhemundi-761211
2

CERTIFICATE

This is to be certified that the project entitled “Language

detection with machine learning” has been
submitted for the Bachelor of Technology in Computer
Science Engineering of School of Engineering & Technology,
CUTM, Paralakhemundi during the academic year 2022-2023
is a persuasive piece of project work carried out by
“M.SYAMLA” towards the partial fulfillment for award of
the degree (B.Tech.) under the guidance of “MS.
ARYALOPA MALLA” and no part thereof has been
submitted by them for any degree to the best of my
knowledge.

Signature of Candidate Signature of Project Guide

Name of the Candidate Name of the Guide

EVALUATION SHEET
1.Title of the Project: Language detection with
machine learning
2. Year of submission: 2023
3. Name of the degree: B. TECH (C.S.E)
4. Date of Examination / Viva:
5. Student Name: M.SYAMLA
6. Reg No: 210101120103

Name of the Guide: -MS. ARYALOPA MALLA

[APPROVED/REJECTED]

Signature of Project Guide

CANDIDATE’S DECLARATION

I “M.SYAMLA” B. Tech CSE (Semester-

IV) of School of Engineering &Technology,
CUTM, Paralakhemundi, hereby declare that the
Project Report entitled “L” is an original work and
data provided in the study is authentic one. This
report has not been submitted to any other Institute
for the award of any other degree by me.

Signature of Student

INDEX
5

SI.NO CONTENT PAGE.NO.

01 Abstract 06
02 Introduction 07
03 What is language dection 08
04 Use -case 09
05 Installation and importing 11-12
of libries
06 Data Set 12-13
07 Importing the dataset 13-14
08 Differentiating Independent 15
from dependent features
09 Performing label encoding 15
10 Text preparation 15-16
11 CountVectorizer 16-17
12 Model evalution 18-19
13 Visualization 20-22
14 Conclusion 22
6

Abstract :-

Language detection is an essential task in natural

language processing (NLP) and has numerous
applications in various fields, such as text
classification, machine translation, and speech
recognition. This paper proposes a machine
learning approach for language detection that
utilizes character n-grams and word n-grams as
features. We train and evaluate several models
using a large dataset of text documents in multiple
languages. Our results demonstrate that the
proposed approach achieves high accuracy in
identifying the language of the input text,
outperforming previous state-of-the-art methods.
The proposed method provides a practical solution
for language detection in real-world applications
and can be easily extended to support additional
languages.
7

Introduction

Recently, a wide range of human sectors (e.g

Engineering, Education, Healthcare, Finance,
Media, etc.,) have shown a lot of interest in
machine learning. ML’s attractiveness has
largely been attributed to its ability to make
decisions without human interference.One
common ML task is NLP and today we’ll be
creating a model trained to get a text input and
then predict what language it is. The
technique of determining the language of a
text or document is known as language
detection in natural language processing. It
was difficult to identify languages using
machine learning when little data was
available about them. There are now a
number of effective machine learning models
8

for language detection since data is so easily

accessible.

What is language detection?

The initial stage in any pipeline for text

analysis or natural language processing is
language identification. All ensuing language-
specific models will yield wrong results if the
language of a document is incorrectly
determined. Similar to what happens when an
English language analyzer is used on a
French document, errors at this step of the
analysis might accumulate and provide
inaccurate conclusions. Each document’s
language and any elements written in another
language need to be identified. The language
used in papers varies widely depending on the
nation and culture.
9

Use-Cases

 Monolingual chatbots: When a user

starts speaking in a particular language,
a bot must be able to recognize it even
if it hasn’t been properly educated to
carry on a discussion in that language.

 Spam filtering: Spam filtering systems

that support many languages must
identify the language that emails, online
comments, and other input are written in
before utilizing real spam filtering
algorithms. Internet platforms cannot
efficiently remove content from certain
countries, regions, or locations
suspected to be creating spam without
this identification.
10

 Recognize the language used in emails

and chats: Language detection
identifies the language of a text as well
as the words and sentences where the
language diverges. Since business
messages (chats, emails, and so on)
may be written in a variety of
languages, it is frequently utilized.

 Linguistic blending: Some people are

used to having conversations that are
bilingual. Hinglish, an amalgam of Hindi
and English terminology used in India,
would be a good illustration of this. In
these situations, a language detection
model will examine the number of words
in a sentence written in one or more
languages, with the language with the
most words serving as the primary
11

language for the interaction but the

secondary language also being
recognized and receiving a high
confidence score in our ranking.

With this being settled, let’s get our hands

dirty by building a model which will be able to
predict the given language.

Installation and importing of libraries

We will import all of the necessary libraries

first, but if you don’t have them already

installed, I advise you to install them before

moving on with the article.

import re
import warnings
warnings.simplefilter("ignore")imp
ort pandas as pd
12

import numpy as npimport seaborn

as sns
import matplotlib.pyplot as plt

Dataset

We will make use of a small language

detection dataset from Kaggle. You will build
an NLP model for predicting 17 distinct
languages using this dataset, which contains
text details for 17 different languages.

Languages: English, Malayalam, Hindi, Tamil,

Kannada, French, Spanish, Portuguese,
Italian, Russian, Swedish, Dutch, Arabic,
Turkish, German, Danish, and Greek.

We must build a model that can predict the

given language using the text as a guide. This
provides a solution for many computational
13

linguistics and artificial intelligence

applications. For machine translation, these
sorts of prediction algorithms are frequently
utilized on robots as well as electronic devices
like mobile phones and laptops. Additionally, it
aids in managing and locating papers that are
multilingual. Researchers are still active in the
field of NLP.

Importing the dataset

df = pd.read_csv("Language
Detection.csv")
df.head()

This dataset has 10,337 rows, two columns,

and text details for 17 distinct languages. We
14

can quickly calculate the value count for each

language.
df["Language"].value_counts()
15

Differentiating Independent from

dependent features
The dependent variable, in this case, is the
name of the language (y), and the
independent variable is text data (X), which
we can now separate from each other.
X = data["Text"]
y = data["Language"]

Performing label encoding

Language names make up our output
variable, which is a categorical variable. We
are conducting label encoding on that output
variable since we should need to turn it into a
numerical form for training the model. We are
importing LabelEncoder from sklearn for
this procedure.
from sklearn.preprocessing import
LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

Text preparation
16

This dataset contains a lot of

irrelevant/unwanted symbols and numbers
that may degrade the performance of our
model, thus text preparation is required.
text_list = []for text in X:
text = re.sub(r'[!@#$
(),n"%^*?:;~`0-9]', ' ', text)
text = re.sub(r'[[]]', '
', text)
text = text.lower()
text_list.append(text)

In the code above, we created an empty

list text_list for appending the
preprocessed text, we then iterate through all
the text (X), removed the symbols and
numbers, converted the text to lowercase, and
finally append it to the list text_list .

CountVectorizer
Both the input and the output features must
take the form of numbers. We will use the
CountVectorizer’s Bag of Words model to
convert text into numerical form.
from
sklearn.feature_extraction.text
import CountVectorizer
17

cv = CountVectorizer()
X =
cv.fit_transform(text_list).toarra
y()
X.shape

You should get (10337, 39419) as an

output.

Train Test split

Our input and output variables have been
preprocessed, therefore the next stage is to
split our dataset into training and test data.
The training set is for the model’s training and
the test set is for the test set’s evaluation. We
will make use of the train test split for this
procedure.
from sklearn.model_selection
import train_test_split
x_train, x_test, y_train, y_test =
train_test_split(X, y, test_size =
0.20)

The test size is just 20%.

Training and prediction of models

The process of creating the model is almost

complete. The Naive Bayes algorithm is what
we’re utilizing to build our model. The model is
afterwards trained using the training set.
from sklearn.naive_bayes import
MultinomialNB
model = MultinomialNB()
model.fit(x_train, y_train)

We used the training set to train our model.

Predicting the results for the test set is what
we’ll do next.
y_prediction =
model.predict(x_test)

Model evaluation
After the successful completion of training,
test, and prediction, the next thing we always
want to do is model evaluation and
assessment.
from sklearn.metrics import
accuracy_score, confusion_matrix,
classification_reportaccuracy =
accuracy_score(y_test,
y_prediction)
confusion_m =
confusion_matrix(y_test,
19

y_prediction)print("The accuracy
is :",accuracy)

We got an accuracy of 97%.

Visualization

Using the seaborn heatmap, let’s plot the

confusion matrix for the purpose of

visualization.
plt.figure(figsize=(15,10))
sns.heatmap(confusion_m, annot =
True)
plt.show()
21

Let’s try out the model prediction using text

from several languages. We will write a
function that will take in the text as input and
predict the language in which the text is
written.
def lang_predict(text):
x =
cv.transform([text]).toarray()
lang = model.predict(x)
lang =
le.inverse_transform(lang)
print("The langauge is
in",lang[0])

cv is CountVectorizer that is converting text to

a bag of words model (vector), the
variable lang is storing the predicted
language, and then we finally we can now
print the predicted language to the user.

To test this, we will call

the lang_predict() function and pass any
bunch of text into it, and then allow it to predict
the language.
22

Conclusion
We have come to the end of this article, I hope
you now have a better understanding of how
to predict language using machine learning.
The data has to be evaluated and then
preprocessed as necessary. The text data you
have becomes represented using a bag of
words model. In order to make accurate
predictions in NLP, text extraction and
vectorization are crucial tasks. In these text
classification issues, Naive Bayes consistently
proves to be a stronger model, leading to
more accurate results.

Mini Project Progress Presentation: Chatbot (Artificial Intellgence Customer Care Service
100% (1)
Mini Project Progress Presentation: Chatbot (Artificial Intellgence Customer Care Service
11 pages
Problem Statement
No ratings yet
Problem Statement
23 pages
Final Year Project Report-1
No ratings yet
Final Year Project Report-1
42 pages
Internship Report On Data Science
No ratings yet
Internship Report On Data Science
33 pages
Internship - Report - On - Ai - and - ML - 23P15A0513 SARATH - Final
No ratings yet
Internship - Report - On - Ai - and - ML - 23P15A0513 SARATH - Final
32 pages
Major Project Documentation Final 2
No ratings yet
Major Project Documentation Final 2
62 pages
Summer Internship Report: Bachelor of Technology
No ratings yet
Summer Internship Report: Bachelor of Technology
38 pages
Anush J Internship Report
No ratings yet
Anush J Internship Report
15 pages
Sentiment Analysis with AI-Deep Learning
No ratings yet
Sentiment Analysis with AI-Deep Learning
74 pages
Report of Industrial Training
No ratings yet
Report of Industrial Training
22 pages
Report On Robotics
No ratings yet
Report On Robotics
40 pages
EXAM CELL AUTOMATION SYSTEM-DOCUMENTATION-converted-pages-deleted
67% (3)
EXAM CELL AUTOMATION SYSTEM-DOCUMENTATION-converted-pages-deleted
34 pages
Results by Using Python Full Stack: An Internship Report On
No ratings yet
Results by Using Python Full Stack: An Internship Report On
66 pages
Brain Tumor Classification Project Report
No ratings yet
Brain Tumor Classification Project Report
39 pages
Machine Learning Seminar Report
33% (3)
Machine Learning Seminar Report
30 pages
Summer Internship Report On: Aws Data Engineering (Topic)
No ratings yet
Summer Internship Report On: Aws Data Engineering (Topic)
21 pages
Python Speech Recognition Guide
No ratings yet
Python Speech Recognition Guide
18 pages
Currency Detector App For Visually Impaired
No ratings yet
Currency Detector App For Visually Impaired
5 pages
Summer Training Report - Ishan Patwal
No ratings yet
Summer Training Report - Ishan Patwal
52 pages
Data Science Internship Report
No ratings yet
Data Science Internship Report
38 pages
Logistic Regression Basics
No ratings yet
Logistic Regression Basics
1 page
Flight Delay Prediction: Project Synopsis On
No ratings yet
Flight Delay Prediction: Project Synopsis On
13 pages
DIP Mini Project
100% (1)
DIP Mini Project
12 pages
AI-Based Picture Translation App: 1) Background/ Problem Statement
No ratings yet
AI-Based Picture Translation App: 1) Background/ Problem Statement
7 pages
ML Internship: Red Wine Analysis
No ratings yet
ML Internship: Red Wine Analysis
31 pages
Data Science Internship Report
No ratings yet
Data Science Internship Report
26 pages
NITHYA S - 412520403004 - Project Report
No ratings yet
NITHYA S - 412520403004 - Project Report
39 pages
Module 1 ML Mumbai University
No ratings yet
Module 1 ML Mumbai University
47 pages
NLP Final Mini Project
No ratings yet
NLP Final Mini Project
17 pages
Data Science Lab Guide
No ratings yet
Data Science Lab Guide
98 pages
Rock Paper Scissors
No ratings yet
Rock Paper Scissors
8 pages
R22 - IT - Python Programming Lab Manual
No ratings yet
R22 - IT - Python Programming Lab Manual
96 pages
Predict Stock Prices with ML
No ratings yet
Predict Stock Prices with ML
15 pages
Training Report On Machine Learning
No ratings yet
Training Report On Machine Learning
27 pages
Project Report On Crop Yield Prediction
No ratings yet
Project Report On Crop Yield Prediction
71 pages
Chatbot Report
No ratings yet
Chatbot Report
43 pages
L-2.9 Hmac Cmac
No ratings yet
L-2.9 Hmac Cmac
14 pages
Internship Report File
No ratings yet
Internship Report File
35 pages
A-Seminar-Report-on-Machine-Learining Final Report
No ratings yet
A-Seminar-Report-on-Machine-Learining Final Report
30 pages
Artificial Intelligence - Ibm Skills Build
No ratings yet
Artificial Intelligence - Ibm Skills Build
26 pages
Cat Vs Dog Classification Using Python
No ratings yet
Cat Vs Dog Classification Using Python
23 pages
FYBBA (CA) C Programming Sem - I (2021-22) Question Paper
No ratings yet
FYBBA (CA) C Programming Sem - I (2021-22) Question Paper
3 pages
Mini Project B.tech
100% (1)
Mini Project B.tech
15 pages
Python Internship Report
No ratings yet
Python Internship Report
49 pages
Supervised Learning Notes
No ratings yet
Supervised Learning Notes
13 pages
Ai in Electronics
100% (1)
Ai in Electronics
24 pages
Face Recognition System
No ratings yet
Face Recognition System
7 pages
DBMS Project Report - $#$&
100% (1)
DBMS Project Report - $#$&
22 pages
Format - Summer Internship Report
No ratings yet
Format - Summer Internship Report
6 pages
Agriculture Management System-3
No ratings yet
Agriculture Management System-3
22 pages
Internship Report
No ratings yet
Internship Report
20 pages
OS Mini Project
No ratings yet
OS Mini Project
20 pages
Internship PPT Final of Collage
No ratings yet
Internship PPT Final of Collage
19 pages
Django School Management Report and Documentation (1) - 1
No ratings yet
Django School Management Report and Documentation (1) - 1
53 pages
Ppl-Unit 1
No ratings yet
Ppl-Unit 1
8 pages
Minor Project Report
No ratings yet
Minor Project Report
49 pages
LP-II Lab Manual
No ratings yet
LP-II Lab Manual
11 pages
PROJECT REPORT For Machine Learning
No ratings yet
PROJECT REPORT For Machine Learning
22 pages
Batch11 Review PPT
No ratings yet
Batch11 Review PPT
7 pages
Language Detector: Bachelor of Engineering (Sem-VIII)
No ratings yet
Language Detector: Bachelor of Engineering (Sem-VIII)
10 pages
Shanto-Mariam University of Creative Technology: Module Name: Shipping and Banking Module Code: AMM-4323
No ratings yet
Shanto-Mariam University of Creative Technology: Module Name: Shipping and Banking Module Code: AMM-4323
16 pages
Engergy Saving Mode For BCCH TRX PDF
100% (1)
Engergy Saving Mode For BCCH TRX PDF
19 pages
Consti 1 - Syllabus
No ratings yet
Consti 1 - Syllabus
26 pages
FY13 Tigerair Annual Report
No ratings yet
FY13 Tigerair Annual Report
111 pages
CS5 Expire Fix - Readme - Amtlib - DLL
0% (1)
CS5 Expire Fix - Readme - Amtlib - DLL
1 page
Datasheet - DT50-P2113 - 1047314 - en - Sick
No ratings yet
Datasheet - DT50-P2113 - 1047314 - en - Sick
6 pages
SSS GuideBook 2010 PDF
No ratings yet
SSS GuideBook 2010 PDF
113 pages
Manual Space Ball Z - Ingles
No ratings yet
Manual Space Ball Z - Ingles
59 pages
Into vs. Valle
No ratings yet
Into vs. Valle
2 pages
E-Sabong Regulation Guide
No ratings yet
E-Sabong Regulation Guide
15 pages
Reduce DC Motor Energy Losses
No ratings yet
Reduce DC Motor Energy Losses
50 pages
Ch5 Admission of A Partner Q41 60
No ratings yet
Ch5 Admission of A Partner Q41 60
35 pages
Literature Review On Finger Millet
100% (2)
Literature Review On Finger Millet
4 pages
Accredited Investor Verification Guide
No ratings yet
Accredited Investor Verification Guide
3 pages
BAOU Ahmedabad 2024 Recruitment
No ratings yet
BAOU Ahmedabad 2024 Recruitment
3 pages
Vikranthi - Ranjit - NORMAL
No ratings yet
Vikranthi - Ranjit - NORMAL
1 page
Cash Remittance
No ratings yet
Cash Remittance
9 pages
ABBF Analysis in P6
100% (1)
ABBF Analysis in P6
11 pages
Work Experience Application Form
No ratings yet
Work Experience Application Form
5 pages
Handbook of Green Building Des6d7b8f7089cb - Anna's Archive 58
No ratings yet
Handbook of Green Building Des6d7b8f7089cb - Anna's Archive 58
1 page
Minimun Variance Estimator
No ratings yet
Minimun Variance Estimator
5 pages
UltraTech Cement Project Report
No ratings yet
UltraTech Cement Project Report
3 pages
Iwar Alto
0% (1)
Iwar Alto
1 page
Company Profile - Tech Power Engineering LTD
No ratings yet
Company Profile - Tech Power Engineering LTD
13 pages
Me 453 Heat Exchanger Design - Syllabus
No ratings yet
Me 453 Heat Exchanger Design - Syllabus
4 pages
E Insurance Project
No ratings yet
E Insurance Project
10 pages
Fabric, Trim and Accessories
No ratings yet
Fabric, Trim and Accessories
12 pages
SNT Autopart Oil Seal Catalog For MITSUBISHI FUSO PDF
No ratings yet
SNT Autopart Oil Seal Catalog For MITSUBISHI FUSO PDF
151 pages
Internship Report - Avinash K V PGP25091
No ratings yet
Internship Report - Avinash K V PGP25091
17 pages
Chapter 7: Probability II Probability of An Event No
100% (1)
Chapter 7: Probability II Probability of An Event No
2 pages

PROJECT REPORT For Machine Learning

Uploaded by

PROJECT REPORT For Machine Learning

Uploaded by

1

Holding university registration number

Under the Guidance of

MS. ARYALOPA MALLA

DEPT. OF COMPUTER SCIENCE & ENGINEERING

SCHOOL OF ENGINEERING &TECHNOLOGY,

This is to be certified that the project entitled “Language

Signature of Candidate Signature of Project Guide

Name of the Candidate Name of the Guide

Name of the Guide: -MS. ARYALOPA MALLA

Signature of Project Guide

I “M.SYAMLA” B. Tech CSE (Semester-

SI.NO CONTENT PAGE.NO.

Language detection is an essential task in natural

Recently, a wide range of human sectors (e.g

for language detection since data is so easily

What is language detection?

The initial stage in any pipeline for text

 Monolingual chatbots: When a user

 Spam filtering: Spam filtering systems

 Recognize the language used in emails

 Linguistic blending: Some people are

language for the interaction but the

With this being settled, let’s get our hands

Installation and importing of libraries

We will import all of the necessary libraries

first, but if you don’t have them already

installed, I advise you to install them before

moving on with the article.

import numpy as npimport seaborn

We will make use of a small language

Languages: English, Malayalam, Hindi, Tamil,

We must build a model that can predict the

linguistics and artificial intelligence

Importing the dataset

This dataset has 10,337 rows, two columns,

can quickly calculate the value count for each

Differentiating Independent from

Performing label encoding

This dataset contains a lot of

In the code above, we created an empty

You should get (10337, 39419) as an

Train Test split

The test size is just 20%.

Training and prediction of models

The process of creating the model is almost

We used the training set to train our model.

We got an accuracy of 97%.

Using the seaborn heatmap, let’s plot the

confusion matrix for the purpose of

Let’s try out the model prediction using text

cv is CountVectorizer that is converting text to

To test this, we will call

You might also like