INTERNSHIP REPORT ON DATA
SCIENCE
AND MACHINE LEARNING
BY:
Srishti Kashyap
(2135051238)
Under the guidance of:
Dr. Alok Yadav
Department of Computer
Science Engineering,
GURU TEGH BAHADUR POLYTECHNIC
INSTITUTE G-8 AREA , RAJOURI GARDEN,
NEW DELHI-110064
ACKNOWLEDGEMENT
I would like to express my gratitude for the people who were part
of my report, directly or indirectly people who gave unending
support right from the stage the idea was conceived. It gives me a
great pleasure to have an opportunity to acknowledge and to express gratitude
those who were associated with me during my Internship at YBI Foundation .
I take this opportunity to thank industrial training coordinator, H.O.D of
Computer science and Engineering department. I am highly indebted to
my project guide Dr. Alok Yadav (Training Instructor) for his guidance
and words of wisdom. He always showed me the right direction during
the course of his report project work. I am duly thankful to him
for teaching and referring me to various blocks, providing work and for
permitting me to have training of duration of 6 weeks
DECLARATION
I hereby declare that the projects done by me at YBI foundation based on Data
Science and Machine Learning , submitted by me is a record of bona-
fide project work completed during internship training. I further declare that the
work reported in this project has not been submitted anywhere else and is
not copied from anywhere.
Chapter-1
Introduction and Literature Survey
Introduction
Machine Learning is the science of getting computers to learn without
being explicitly programmed. It is closely related to computational
statistics, which focuses on making prediction using computer. In its
application across business problems, machine learning is also referred
as predictive analysis. Machine Learning is closely related to
computational statistics. Machine Learning focuses on the development
of computer programs that can access data and use it to learn
themselves. The process of learning begins with observations or data,
such as examples, direct experience, or instruction, in order to look for
patterns in data and make better decisions in the future based on the
examples that we provide. The primary aim is to allow the computers
learn automatically without human intervention or assistance and adjust
actions accordingly.
History of Machine Learning
The name machine learning was coined in 1959 by Arthur Samuel. Tom M.
Mitchell provided a widely quoted, more formal definition of the algorithms
studied in the machine learning field: "A computer program is said to learn from
experience E with respect to some class of tasks T and performance measure P if
its performance at tasks in 7, as measured by P, improves with experience E."
This follows Alan Turing's proposal in his paper "Computing Machinery and
Intelligence", in which the question "Can machines think?" is replaced with the
question "Can machines do what we (as thinking entities) can do?". In Turing's
proposal the characteristics that could be possessed by a thinking machine and the
various implications in constructing one are exposed.
Types of Machine Learning
The types of machine learning algorithms differ in their approach, the type of data
they input and output, and the type of task or problem that they are intended to
solve. Broadly Machine Learning can be categorized into four categorized .
1. Supervised Learning
2. UnSupervised Learning
3. Reinforcment Learning
4. Semi-Supervised Learning
Supervised Learning
Supervised learning is the types of machine learning in which machines are
trained using well "labelled" training data, and on basis of that data, machines
predict the output. The labelled data means some input data is already tagged
with the correct output.
Supervised learning is a process of providing input data as well as correct
output data to the machine learning model. The aim of a supervised learning
algorithm is to find a mapping function to map the input variable(x) with the
output variable(y).
In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering, etc.
Unsupervised Learning
Unsupervised learning is a machine learning technique in which models are not
supervised using training dataset. Instead, models itself find the hidden patterns
and insights from the given data. It can be compared to learning which takes
place in the human brain while learning new things.
It cannot be directly applied to a regression or classification problem because
unlike supervised learning, we have the input data but no corresponding output
data. The goal of unsupervised learning is to find the underlying structure of
dataset, group that data according to similarities, and represent that dataset in
a compressed format.
Reinforcement Learning
Reinforcement learning is a learning method that interacts with its environment
by producing actions and discovers errors or rewards. Trial and error search and
delayed reward are the most relevant characteristics of reinforcement learning.
This method allows machines and software agents to automatically determine the
ideal behavior within a specific context in order to maximize its performance.
Simple reward feedback is required for the agent to learn which action is best.
Semi-Supervised Learning
Semi-supervised learning fall somewhere in between supervised and
unsupervised learning, since they use both labeled and unlabeled data for training
– typically a small amount of labeled data and a large amount of unlabeled data.
The systems that use this method are able to considerably improve learning
accuracy. Usually, semi-supervised learning is chosen when the acquired labeled
data requires skilled and relevant resources in order to train it / learn from it.
Otherwise, acquiring unlabeled data generally doesn’t require additional
resources.
Literature Survey
Need for Machine Learning
Human beings, at this moment, are the most intelligent and advanced species on
earth because they can think, evaluate and solve complex problems. On the other
side, AI is still in its initial stage and haven’t surpassed human intelligence in many
aspects. Then the question is that what is the need to make machine learn? The
most suitable reason for doing this is, “to make decisions, based on data, with
efficiency and scale”.
Lately, organizations are investing heavily in newer technologies like Artificial
Intelligence, Machine Learning and Deep Learning to get the key information from
data to perform several real-world tasks and solve problems. We can call it data-
driven decisions taken by machines, particularly to automate the process. These
data-driven decisions can be used, instead of using programing logic, in the
problems that cannot be programmed inherently. The fact is that we can’t do
without human intelligence, but other aspect is that we all need to solve real-
world problems with efficiency at a huge scale. That is why the need for machine
learning arises.
Challenges in Machines Learning
While Machine Learning is rapidly evolving, making significant strides with
cybersecurity and autonomous cars, this segment of AI as whole still has a long
way to go. The reason behind is that ML has not been able to overcome number
of challenges. The challenges that ML is facing currently are:
Quality of data: Having good-quality data for ML algorithms is one of the biggest
challenges. Use of low-quality data leads to the problems related to data
preprocessing and feature extraction.
. Time-Consuming task: Another challenge faced by ML models is the
consumption of time especially for data acquisition, feature extraction and
retrieval.
Lack of specialist persons: As ML technology is still in its infancy stage, availability
of expert resources is a tough job.
No clear objective for formulating business problems: Having no clear objective
and well-defined goal for business problems is another key challenge for ML
because this technology is not that mature yet.
Issue of overfitting & underfitting: If the model is overfitting or underfitting, it
cannot be represented well for the problem. Curse of dimensionality: Another
challenge ML model faces is too many features of data points. This can be a real
hindrance.
Difficulty in deployment: Complexity of the ML model makes it quite difficult to
be deployed in real life.
Applications of Machines Learning
Machine Learning is the most rapidly growing technology and according to
researchers we are in the golden year of AI and ML. It is used to solve many real-
world complex problems which cannot be solved with traditional approach.
Following are some real-world applications of ML:
Emotion analysis
Sentiment analysis
Error detection and prevention
Weather forecasting and prediction
Stock market analysis and forecasting
Speech synthesis
Speech recognition Machine Learning with Python 5
Customer segmentation
Object recognition
Fraud detection
Chapter-1
Technology Implemented
Python – The New Generation Language
Python is a widely used general-purpose, high level programming language. It was
initially designed by Guido van Rossum in 1991 and developed by Python
Software Foundation. It was mainly developed for an emphasis on code
readability, and its syntax allows programmers to express concepts in fewer lines
of code. Python is dynamically typed and garbage-collected. It supports multiple
programming paradigms, including procedural, object-oriented, and functional
programming. Python is often described as a "batteries included "language due to
its comprehensive standard library.
Features
Interpreted
In Python there is no separate compilation and execution steps like C/C++.
It directly run the program from the source code. Internally, Python
converts the source code into an intermediate form called bytecodes which
is then translated into native language of specific computer to run it.
Platform Independent
Python programs can be developed and executed on the multiple operating
system platform. Python can be used on Linux, Windows, Macintosh,
Solaris and many more.
Multi- Paradigm
Python is a multi-paradigm programming language. Object-oriented
programming and structured programming are fully supported, and many
of its features support functional programming and aspect-oriented
programming .
Simple
Python is a very simple language. It is a very easy to learn as it is closer to
English language. In python more emphasis is on the solution to the
problem rather than the syntax.
Rich Library Support
Python standard library is very vast. It can help to do various things
involving regular expressions, documentation generation, unit testing,
threading, databases, web browsers, CGI, email, XML, HTML, WAV files,
cryptography, GUI and many more.
Free and Open Source
Firstly, Python is freely available. Secondly, it is open-source. This means
that its source code is available to the public. We can download it, change it,
use it, and distribute it. This is called FLOSS (Free/Libre and Open Source
Software). As the Python community, we’re all headed toward one
goal- an ever-bettering Python.
Why Python Is a Perfect Language for Machine Learning?
1. A great library ecosystem –
A great choice of libraries is one of the main reasons Python is the most
popular programming language used for AI. A library is a module or a group of
modules published by different sources which include a pre-written piece of
code that allows users to reach some functionality or perform different
actions. Python libraries provide base level items so developers don’t have to
code them from the very beginningevery time. ML requires continuous
data processing, and Python’s libraries let us access , handle and transform
data. These are some of the most widespread libraries you can use for ML and
AI:
O Scikit-learn for handling basic ML algorithms like clustering, linear and logistic
regressions, regression, classification, and others.
O Pandas for high-level data structures and analysis. It allows merging and filtering
of data, as well as gathering it from other external sources like Excel, for instance.
O Keras for deep learning. It allows fast calculations and prototyping, as it uses the
GPU in addition to the CPU of the computer.
O TensorFlow for working with deep learning by setting up, training, and utilizing
artificial neural networks with massive datasets.
O Matplotlib for creating 2D plots, histograms, charts, and other forms of
visualization.
O NLTK for working with computational linguistics, natural language recognition,
and processing.
O Scikit-image for image processing.
O PyBrain for neural networks, unsupervised and reinforcement learning.
O Caffe for deep learning that allows switching between the CPU and the GPU and
processing60+ mln images a day using a single NVIDIA K40 GPU.
O StatsModels for statistical algorithms and data exploration.
In the PyPI repository, we can discover and compare more python libraries.
2. A low entry barrier –
Working in the ML and AI industry means dealing with a bunch of data that we
need to process in the most convenient and effective way. The low entry barrier
allows more data scientists to quickly pickup Python and start using it for AI
development without wasting too much effort into learning the language. In
addition to this, there’s a lot of documentation available, and Python’s
community is always there to help out and give advice.
3. Flexibility-
Python for machine learning is a great choice, as this language is very
flexible:
It offers an option to choose either to use OOPs or scripting.
There’s also no need to recompile the source code, developers can
implement any changes and quickly see the results.
Programmers can combine Python and other languages to
reach their goals.
4. Good Visualization Options-
For AI developers, it’s important to highlight that in artificial intelligence, deep
learning, and machine learning, it’s vital to be able to represent data in a human-
readable format. Libraries like Matplotlib allow data scientists to build charts,
histograms ,and plots for better data
comprehension,effective presentation, and visualization. Different application prog
ramming interfaces also simplify thevisualization process and make it easier to
create clear reports.
5. Community Support-
It’s always very helpful when there’s strong community support built around the
programming language. Python is an open-source language which means that
there’s a bunch of resources open for programmers starting from beginners
and ending with pros. A lot of Python documentation is available online as well as
in Python communities and forums, where programmers and machine learning
developers discuss errors, solve problems, and help each other out. Python
programming language is absolutely free as is the variety of useful libraries and
tools.
6. Growing Popularity-As a result of the advantages discussed above, Python
is becoming more and more popular among data scientists. According to
Stack Overflow, the popularity of Python is predicted to grow until
2020, atleast. This means it’s easier to search for developers and replace
team players if required. Also, the cost of their work maybe not as high as
when using a less popular programming language.
Data Preprocessing, Analysis & Visualization
Machine Learning algorithms don’t work so well with processing raw data. Before
we can feed such data to an ML algorithm, we must preprocess it. We must apply
some transformations on it. With data preprocessing, we convert raw data into a
clean data set. To perform data this, there are 7 techniques –
1. Rescaling Data –
For data with attributes of varying scales, we can rescale attributes to possess
the same scale. We rescale attributes into the range 0 to 1 and call it
normalization. We use the MinMaxScaler class from scikit- learn. This gives us
values between 0 and 1.
2. Standardizing Data –
Standardization refers to shifting the distribution of each attribute to have
a mean of zero and a standard deviation of one (unit variable). It is useful
to standardization attributes for a model that relies on the distribution of
attributes such as Gaussian processes.
3. Normalizing Data –
This is used to rescale each row of data to have a length of 1. It is mainly
useful in Sparse dataset where we have lots of zeros. We can rescale the
data with the help of Normalizer class of scikit-learn Python library .
4. Binarizing Data –
This is the technique with the help of which we can make our data binary.
We can use a binary threshold for making our data binary. The values
above that threshold value will be converted to 1 and below that threshold
will be converted to 0. For example, if we choose threshold value = 0.5,
then the dataset value above it will become 1 and below this will become 0.
That is why we can call it binarizing the data or thresholding the data. This
technique is useful when we have probabilities in our dataset and want to
convert them into crisp values.
We can binarize the data with the help of Binarizer class of scikit-learn
Python library.
5. Mean Removal-
We can remove the mean from each feature to center it on zero.
6. One Hot Encoding –
When dealing with few and scattered numerical values, we may not need to
store these. Then, we can perform One Hot Encoding. For k distinct values, we
can transform the feature into a k-dimensional vector with one value of 1 and
0 as the rest values.
7. Label Encoding -–
Some labels can be words or numbers. Usually, training data is labelled with
words to make it readable. Label encoding converts word labels into numbers
to let algorithms work on them.
Machine Learning Algorithms
There are many types of Machine Learning Algorithms specific to different use
cases. As we work with datasets, a machine learning algorithm works in two
stages. We usually split the data around 20%-80% between testing and training
stages. Under supervised learning, we split a dataset into a training data and test
data in Python ML. Followings are the Algorithms of Python Machine Learning –
1. Linear Regression-
Linear regression may be defined as the statistical model that analyzes the
linear relationship between a dependent variable with given set of
independent variables. Linear relationship between variables means that
when the value of one or more independent variables will change (increase
or decrease), the value of dependent variable will also change accordingly
(increase or decrease).
Mathematically the relationship can be represented with the help of
𝑌 = 𝑚𝑋 + b
following equation:
𝑌 is the dependent variable we are trying to predict 𝑋 is the dependent
Here,
𝑚 is the slop of the regression line which represents the effect 𝑋 has on 𝑌
variable we are using to make predictions.
𝑏 is a constant, known as the 𝑌-intercept.
If 𝑋 = 0, 𝑌 would be equal to 𝑏.
2. Logistic Regression –
Logistic regression is a supervised classification is unique Machine Learning
algorithms in Python that find sits use in estimating discrete values like 0/1,
yes/no, and true/false. This is based on a given set of independent
variables. We use a logistic function to predict the probability of an event
and this gives us an output between 0 and 1. Although it says ‘regression’,
this is actually a classification algorithm. Logistic regression fits data into a
logit function and is also called logit regression.
3. Decision Tree –
A decision tree falls under supervised Machine Learning Algorithms in
Python and comes of use for both classification and regression- although
mostly for classification. This model takes an instance, traverses the tree,
and compares important features with a determined conditional
statement. Whether it descends to the left child branch or the right
depends on the result. Usually, more important features are closer to the
root. Decision Tree, a Machine Learning algorithm in Python can work on
both categorical and continuous dependent variables. Here, we split a
population into two or more homogeneous sets. Tree models where the
target variable can take a discrete set of values are called classification
trees; in these tree structures, leaves represent class labels and branches
represent conjunctions of features that lead to those class labels. Decision
trees where the target variable can take continuous values (typically real
numbers) are called regression trees.
4. Support Vector Machine (SVM)-
SVM is a supervised classification is one of the most important Machines
Learning algorithms in Python, that plots a line that divides different
categories of your data. In this ML algorithm, we calculate the vector to
optimize the line. This is to ensure that the closest point in each group
lies farthest from each other. While you will almost always find this to be a
linear vector, it can be other than that. An SVM model is are presentation
of the examples as points in space, mapped so that the examples of the
separate categories are divided by a clear gap that is as wide as possible. In
addition to performing linear classification, SVMs can efficiently perform a
non-linear classification using what is called the kernel trick, implicitly
mapping their inputs into high-dimensional feature spaces. When data are
unlabeled, supervised learning is not possible, and an unsupervised
learning approach is required, which attempts to find natural clustering of
the data to groups, and then map new data to these formed groups.
5. Naïve Bayes Algorithm –
Naive Bayes is a classification method which is
based on Bayes’ theorem. This assumes independence between predictors. A
Naive Bayes classifier will assume that a feature in a class is unrelated to
any other. Consider a fruit. This is an apple if it is round, red, and 2.5 inches
in diameter. A Naive Bayes classifier will say these
characteristics independently contribute to the probability of the fruit being
an apple. This is even if features depend on each other. For very large data
sets, it is easy to build a Naive Bayesian model. Not only is this model very
simple, it performs better than many highly sophisticated classification
methods. Naïve Bayes classifiers are highly scalable, requiring a number of
parameters linear in the number of variables(features/predictors) in a
learning problem. Maximum-likelihood training can be done by evaluating a
closed-form expression, which takes linear time, rather than by expensive
iterative approximation as used for many other types of classifiers.
6. kNN Algorithm –
This is a Python Machine Learning algorithm for classification and regression-
mostly for classification. This is a supervised learning algorithm that considers
different centroids and uses a usually Euclidean function to compare distance.
Then, it analyzes the results and classifies each point to the group to
optimize it to place with all closest points to it. It classifies new cases using a
majority vote of k of its neighbors. The case it assigns to a class is the one most
common among its K nearest neighbors. For this, it uses a distance function.
k -NN is a type of instance-based learning, or lazy learning, where the function
is only approximated locally and all computation is deferred until classification.
k -NN is a special case of a variable- bandwidth, kernel density "balloon"
estimator with a uniform kernel.
7. K-Means Algorithm –
k-Means is an unsupervised algorithm that solves the problem of clustering.
It classifies data using a number of clusters. The data points inside a class
are homogeneous and heterogeneous to peer groups. k-means clustering is
a method of vector quantization, originally from signal processing, that is
popular for cluster analysis in data mining. k-means clustering aims to
partition N observations into k clusters in which each observation belongs
to the cluster with the nearest mean, serving as a prototype of the cluster.
k-means clustering is rather easy to apply to even large data sets,
particularly when using heuristics such as Lloyd's algorithm. It often is used
as a preprocessing step for other algorithms, for example to find a starting
configuration. The problem is computationally difficult (NP-hard). k-means
originates from signal processing, and still finds use in this domain. In
cluster analysis, the k-means algorithm can be used to partition the input
data set into K partitions (clusters). k-means clustering has been used as a
feature learning (or dictionary learning) step, in either (semi-)supervised
learning or unsupervised learning.
8. Random Forest –
A random forest is an ensemble of decision trees. In order to classify every
new object based on its attributes, trees vote for class- each tree provides a
classification. The classification with the most votes wins in the forest.
Random forests or random decision forests are an ensemble learning
method for classification, regression and other tasks that operates by
constructing a multitude of decision trees at training time and outputting
the class that is the mode of the classes (classification) or mean prediction
(regression) of the individual trees.
Chapter 3
Result Discussion
Result
This training has introduced us to Machine Learning. Now, we know that Machine
Learning is a technique of training machines to perform the activities a human
brain can do, albeit bit faster and better than an average human-being. Today we
have seen that the machines can beat human champions in games such as Chess,
Mahjong, which are considered very complex. We have seen that machines can
be trained to perform human activities in several areas and can aid humans in
living better lives. Machine learning is quickly growing field in computer science. It
has applications in nearly every other field of study and is already being
implemented commercially because machine learning can solve problems too
difficult or time consuming for humans to solve. To describe machine learning in
general terms, a variety models are used to learn patterns in data and make
accurate predictions based on the patterns it observes.
Machine Learning can be a Supervised or Unsupervised. If we have a
lesser amount of data and clearly labelled data for training, we opt for Supervised
Learning. Unsupervised Learning would generally give better performance
and results for large data sets. If we have a huge data set easily available, we
go for deep learning techniques. We also have learned Reinforcement Learning
and Deep Reinforcement Learning. We now know what Neural Networks are,
their applications and limitations. Specifically, we have developed a
thought process for approaching problems that machine learning works so well at
solving. We have learnt how machine learning is different than descriptive
statistics.
Finally, when it comes to the development of machine learning models of our
own, we looked at the choices of various development languages, IDEs and
Platforms. Next thing that we need to do is start learning and practicing each
machine learning technique. The subject is vast, it means that there is width, but
if we consider the depth, each topic can be learned in a few hours. Each topic is
independent of each other. We need to take into consideration one topic at a
time, learn it, practice it and implement the algorithm/s in it using a language
choice of yours. This is the best way to start studying Machine Learning. Practicing
one topic at a time, very soon we can acquire the width that is eventually
required of a Machine Learning expert.
Chapter 4
Project Report
Objective-
Classification Model to predict Students marks
Dataset description-
Dataset Source –
https://www.kaggle.com/datasets/yasserh/student-marks-dataset
Code and output-
Link to this project-
https://www.kaggle.com/code/srishtiii22/student-marks-
prediction/notebook