KEMBAR78
Machine Learning With Python | PDF | Object Oriented Programming | Python (Programming Language)
0% found this document useful (0 votes)
66 views61 pages

Machine Learning With Python

This report explores the fundamentals of Machine Learning (ML) with Python, emphasizing its key concepts, tools, and applications. It aims to provide a comprehensive understanding of how Python facilitates the development of ML models, along with practical implementations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views61 pages

Machine Learning With Python

This report explores the fundamentals of Machine Learning (ML) with Python, emphasizing its key concepts, tools, and applications. It aims to provide a comprehensive understanding of how Python facilitates the development of ML models, along with practical implementations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

An

Industrial Training Report

On

“Machine Learning with Python”

Submitted in partial fulfilment for the award of degree of

Bachelor of Technology

in

Computer Science & Engineering

Submitted By: Submitted To:


SHAGUN KUMARI MANGAL MR. SUNIL SHARMA
21EJCCS806 Assistant Professor

Department of Computer Science & Engineering


Jaipur Engineering College & Research Centre
Jaipur, Rajasthan (2022-23)
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

CERTIFICATE

This is to certify that the industrial training entitled “Machine Learning with Python” is the
Bonafede work carried out by Shagun Kumari Mangal, student of B. Tech in Computer
Science & Engineering at Jaipur Engineering College and Research Centre, during the year
2022-23 in partial fulfillment of the requirements for the award of the Degree of Bachelor of
Technology in Computer Science & Engineering under my guidance.

Name of Guide : Sanam Peeyush

Designation : Trainer

Place : Jaipur, Rajasthan

Date : 15 October, 2022

i
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

VISION OF INSTITUTE
To become renowned Centre of outcome based leaning and work towards academic,
professional, cultural and social enrichments of the lives of individual and communities

MISSION OF INSTITUTE
1. Focus on evaluation of learning outcomes and motivate students to inculcate research
aptitude by project-based learning.
2. Identify areas of focus and provide platform to gain knowledge and solutions based on
informed perception of Indian, regional and global needs.
3. Offer opportunities for interaction between academia and industry.
4. Develop human potential to its fullest extent so that intellectually capable and imaginatively
gifted leaders can emerge in a range of professions.

VISION OF CSE DEPARTMENT


To become renowned Centre of excellence in computer science and engineering and make
competent engineers & professionals with high ethical values prepared for lifelong learning.

MISSION OF CSE DEPARTMENT

1. To impart outcome-based education for emerging technologies in the field of computer


science and engineering.
2. To provide opportunities for interaction between academia and industry.
3. To provide platform for lifelong learning by accepting the change in technologies
4. To develop aptitude of fulfilling social responsibilities.

ii
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

PROGRAM OUTCOMES (POs)

1. Engineering knowledge: Apply the knowledge of mathematics, science, engineering


fundamentals, and Computer Science & Engineering specialization to the solution of complex
Computer Science & Engineering problems.
2. Problem analysis: Identify, formulate, research literature, and analyse complex Computer
Science and Engineering problems reaching substantiated conclusions using first principles of
mathematics, natural sciences, and engineering sciences.
3. Design/development of solutions: Design solutions for complex Computer Science and
Engineering problems and design system components or processes that meet the specified
needs with appropriate consideration for the public health and safety, and the cultural, societal,
and environmental considerations.
4. Conduct investigations of complex problems: Use research-based knowledge and research
methods including design of Computer Science and Engineering experiments, analysis and
interpretation of data, and synthesis of the information to provide valid conclusions.
5. Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools including prediction and modelling to complex Computer Science
Engineering activities with an understanding of the limitations.
6. The engineer and society: Apply reasoning informed by the contextual knowledge to assess
societal, health, safety, legal and cultural issues and the consequent responsibilities relevant to
the professional Computer Science and Engineering practice.
7. Environment and sustainability: Understand the impact of the professional Computer
Science and Engineering solutions in societal and environmental contexts, and demonstrate the
knowledge of, and need for sustainable development.
8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the Computer Science and Engineering practice.
9. Individual and team work: Function effectively as an individual, and as a member or leader
in diverse teams, and in multidisciplinary settings in Computer Science and Engineering.
10.Communication: Communicate effectively on complex Computer Science and
Engineering activities with the engineering community and with society at large, such as, being
able to comprehend and write effective reports and design documentation, make effective
presentations, and give and receive clear instructions.
11. Project management and finance: Demonstrate knowledge and understanding of the
Computer Science and Engineering and management principles and apply these to one’s own
work, as a member and leader in a team, to manage projects and in multi-disciplinary
environments.
12. Life-long learning: Recognize the need for, and have the preparation and ability to engage
in independent and life-long learning in the broadest context of technological change in Computer
Science and Engineering.

iii
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

PROGRAM EDUCATIONAL OBJECTIVES (PEOs)

The PEOs of the B. Tech (CSE) program are:

1. To produce graduates who are able to apply computer engineering knowledge to


provide turn-key IT solutions to national and international organizations.
2. To produce graduates with the necessary background and technical skills to work
professionally in one or more of the areas like – IT solution design development and
implementation consisting of system design, network design, software design and
development, system implementation and management etc. Graduates would be able to
provide solutions through logical and analytical thinking.
3. To able graduates to design embedded systems for industrial applications.
4. To inculcate in graduates effective communication skills and team work skills to enable
them to work in multidisciplinary environment.
5. To prepare graduates for personal and professional success with commitment to their
ethical and social responsibilities.

PROGRAM SPECIFIC OUTCOMES (PSOs)

PSO1 Ability to interpret and analyse network specific, cyber security issues, automation in real
world environment.
PSO2 Ability to design and develop mobile and web-based applications under realistic constraints.

iv
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

COURSE OUTCOMES (COs)


On completion of Industrial Training Graduates will be able to-
• CO1: Generate the report based on the projects carried out for demonstrating the ability
to apply the knowledge of engineering field during training
• CO2: Demonstrate Competency in relevant engineering fields through problem
identification, formulation and solution.

MAPPING: CO’s & PO’s

Program Outcomes (POs)


Subject Code Cos
PO- PO- PO- PO- PO- PO- PO- PO- PO- PO- PO- PO-
1 2 3 4 5 6 7 8 9 10 11 12

CO-1 3 3 2 2 2 1 1 2 2 3 3 3
3CS7-30
Industrial Training
CO-2 3 3 3 3 3 1 1 2 2 3 3 3

v
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

ACKNOWLEDGEMENT

It has been a great honour and privilege to undergo training at UPFLAIRS Pvt Ltd, Jaipur. I
am very grateful to MR. SANAM PEEYUSH for giving his valuable time and constructive
guidance in preparing the report for training. It would not have been possible to complete this
report in short period of time without their kind encouragement and valuable guidance.

I wish to express our deep sense of gratitude to our Industrial Training Guide MR. SUNIL
SHARMA, Assistant Professor, Jaipur Engineering College and Research Centre, Jaipur for
guiding us from the inception till the completion of the industrial training. We sincerely
acknowledge him for giving his valuable guidance, support for literature survey, critical
reviews and comments for our industrial training.

I would like to first of all express our thanks to MR. ARPIT AGRAWAL, Director of JECRC,
for providing us such a great infrastructure and environment for our overall development.

I express sincere thanks to DR. V. K. CHANDNA, Principal of JECRC, for his kind
cooperation and extendible support towards the completion of our industrial training.

Words are inadequate in offering our thanks to DR. SANJAY GAUR, Head of Department of
Computer Science Engineering, for consistent encouragement and support for shaping our
industrial training in the presentable form.

Also, our warm thanks to Jaipur Engineering College and Research Centre, who provided
us this opportunity to carryout, this prestigious industrial training and enhance our learning in
various technical fields.

SHAGUN KUMARI MANGAL


21EJCCS806

vi
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

ABSTRACT
Name : Shagun Kumari Mangal
RTU Roll no : 21EJCCS806
Branch : Computer Science and Engineering
Training Industry : Upflairs
Training Industry Address : JECRC Foundation
Trainer : SANAM PEEYUSH
Technology Name/Project Name : Machine Learning with python
Training mode : Offline
Training Duration : 12th September - 15th October
Training Status : Complete
Description: -
Python is one of the most preferred languages for scientific computing, data science, and
machine learning, boosting both performance and productivity by enabling the use of low-
level libraries. In this training we also learned about one more library of python which helps
in building GUI, this library is called ‘tkinter’. With the help of this library, we can give our
Machine Learning Model a nice User interface.

For Machine Learning we use many different Libraries of Python which helps us in building
a nice and much accurate model of Machine learning. These libraries are:
• numpy
• pandas
• sklearn
• seaborn
• matplotlib
All above information are true and right in my knowledge.

SHAGUN KUMARI MANGAL

21EJCCS806

vii
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

List of Figures

Fig. Description Page No.


1. Fig. 1.1 Data Type in Python 4
2. Fig. 2.1 Machine learning 11
3. Fig. 2.2 Deep learning 13
4. Fig 2.3 Working of Machine Learning 13
5. Fig. 3.1 NumPy 14
6. Fig. 3.2 Pandas 16
7. Fig. 3.3 Matplotlib 16
8. Fig. 3.4 Bar plot 17
9. Fig. 3.5 Line plot 17
10. Fig. 3.6 Histogram 17
11. Fig. 3.7 Box plot 17
12. Fig. 3.8 Scikit Learn 18
13. Fig. 4.1 Loading of Dataset 20
14. Fig. 4.2 Project Data Analysis 20
15. Fig. 4.3 Data Collection 21
16. Fig. 4.4 Statical Measure of the dataset 21

17. Fig. 4.5 Data visualization 22

18. Fig. 4.6 Data Distribution 22

19. Fig. 4.7 Gender Visualization 22

20. Fig. 4.8 Gender Analysis 23

21. Fig. 4.9 BMI Distribution 23


22. Fig. 4.10 Children column 24

23. Fig. 4.11smoker column analysis 24


24. Fig. 4.12 Analysis of region column 25

viii
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

25. Fig. 4.13 charge distribution 25


26. Fig. 5.1 Data processing 26
27. Fig. 5.2 Encoding of data 27
28. Fig. 5.3 training and splitting of data 27
29. Fig. 5.4 Data Splitting 28
30. Fig. 6.1 linear regression 30
31. Fig. 6.2 Regression Analysis 31

32. Fig. 6.3 Prediction on training data 32


33. Fig. 6.4 Prediction on testing data 32
34. Fig. 6.5 training of data 33
35. Fig. 6.6 splitting of data 33
36. Fig. 7.1 Classification in machine learning 34
37. Fig. 7.2 Email spam detector 35
38. Fig. 7.3 Logistic regression 36
39. Fig. 8.1 classifier in machine learning 37
40. Fig. 8.2 Decision tree 38
41. Fig. 8.3 Decision tree implementation 39
42. Fig. 8.4 Random Forest tree 40

43. Fig. 8.5 Random Forest algorithm 42


44. 44
Fig. 9.1 methods of cross validation
45. Fig. 9.2 cross validation 45

ix
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

List of Table

Fig. Figure Name Page No.

1. Table1: Data type in python 4


2. Table2: Python Keyword 6

3. Table3: Arithmetic operator 7


4. Table 4: Assignment operator 7
5. Table 5: Comparison operator 8
6. Table 6: Logical operator 8

7. Table 7: Identity operator 8

8. Table 8: Membership operator 9

9. Table 9: Bitwise operator 9

x
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

TABLE OF CONTENTS

S. No. Topic Name Page No.


1. Certificate i
2. Vision and Mission ii
3. Program Outcomes (POs) iii
4. Program Educational Objectives (PEOs) iv
5. Program Specific Outcomes (PSOs) iv
6. Course Outcome v
7. Mapping: Cos and POs v
8. Acknowledgement vi
9. Abstract vii
10. Python overview in machine learning 1
11. History of python 1
12. Introduction of python 1
13. Data types in python 4
14. Keywords in python 5
15. Operators in python 6
16. Loops in python 9
17. Machine learning 11
18. Types of Machine Learning 12
19. Machine Learning Working Process 13
20. Application of Machine Learning 14
21. Various Libraries used in Machine Learning 14
22. Project (Medical Insurance Cost Prediction) 20
23. Data processing 26
24. Regression 29
25. Classification 34
26. Classifier 37
27. Cross validation concept 43

xi
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

28. Future scope 46


29. Conclusion 47
30. References 48

xii
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

PYTHON OVERVIEW IN MACHINE LEARNING:

HISTORY OF PYTHON:
Python was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde &
Informatica (CWI) in the Netherlands as a successor to the ABC programming language, which
was inspired by SETL, capable of exception handling and interfacing with the Amoeba
operating system. Its implementation began in December 1989.

INTRODUCTION:
Python is a dynamic, interpreted (bytecode-compiled) language. There are no type declarations
of variables, parameters, functions, or methods in source code. This makes the code short and
flexible, and you lose the compile-time type checking of the source code. Python is a widely
used general-purpose, high level programming language. It was created by Guido van
Rossum in 1991 and further developed by the Python Software Foundation. It was designed
with an emphasis on code readability, and its syntax allows programmers to express their
concepts in fewer lines of code. Python is a programming language that lets you work quickly
and integrate systems more efficiently.

Python as a object oriented language:


In Python, object-oriented Programming (OOPs) is a programming paradigm that uses
objects and classes in programming. It aims to implement real-world entities like inheritance,
polymorphisms, encapsulation, etc. in the programming. The main concept of OOPs is to
bind the data and the functions that work on that together as a single unit so that no other part
of the code can access this data

Main Concepts of Object-Oriented Programming (OOPs)


• Class
• Objects
• Polymorphism
• Encapsulation
• Inheritance
The concept of OOP in Python focuses on creating reusable code. This concept is also
known as DRY (Don't Repeat Yourself).

1
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

Multipurpose language:

Python support both Object Oriented and Procedural Programming language as it


is a high-level programming language designed for general purpose programming.
Python are multi-paradigm, you can write programs or libraries that are largely
procedural, object-oriented, or functional in all of these languages. It depends on what
you mean by functional. Python does have some features of a functional language. we
can create procedural program through python using loops, for, while etc .and control
structure.

Why learn python:


1. Python is the Easiest Programming Language to Learn.
2.Learning python will improve your job prospects.
3.Free and open source.
4.Portability.
5.Python is highly versatile.
6.Python is very productive language.
7.Python skills can command high salary.

Disadvantages:
• Poor Memory Efficiency. To make it simple for the developer, Python needs a lot of
memory space; this can be a tad problematic if you want to develop apps where you
need to optimize memory.
• Slow Speed. ...
• Database Access. ...
• Weak in Mobile Computing. ...
• Runtime Error

2
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

Uses of python:

Python is commonly used for developing websites and software, task automation, data
analysis, and data visualization. Python is also used in data analysis and machine learning.
python provide a built-in library called pygame,
Which is used to develop the game. google is also using python for their data analysis, machine
learning, artificial intelligence

Application of python:
• Web Development.
• Game Development.
• Machine Learning and Artificial Intelligence.
• Data Science and Data Visualization.
• Desktop GUI.
• Web Scraping Applications.
• Business Applications.
• Audio and Video Applications.

Why use python in machine learning:


Using this programming language in ML and AI has tons of particular benefits to
consider. Python is truly great with its frameworks, libraries, and community support. As a
programming language, it is fast, easy to learn, with clear code and amazing compatibility.

3
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

DATA TYPE IN PYTHON

Data types are the classification or categorization of data items. It represents the kind of value
that tells what operations can be performed on a particular data. Since everything is an object
in Python programming, data types are actually classes and variables are instance (object) of
these classes.
Following are the standard or built-in data type of Python:
• Numeric
• Sequence type
• Boolean
• Set
• Dictionary

Fig. 1.1 data type in python

Text Type: Str


Numeric Types: int, float, complex
Sequence Types: list, tuple, range
Mapping Type: Dict
Set Types: set, frozen set
Boolean Type: Bool
Binary Types: Bytes, byte array, memory
view
None Type: None Type

Table 1: Data type in python

4
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

KEYWORD IN PYTHON:

Every programming language has special reserved words, or keywords, that have specific
meanings and restrictions around how they should be used. Python is no different. Python
keywords are the fundamental building blocks of any Python program.

Keyword Description

and A logical operator

as To create an alias

assert For debugging

break To break out of a loop

class To define a class

continue To continue to the next iteration of a loop

def To define a function

del To delete an object

5
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

elif Used in conditional statements, same as else if

else Used in conditional statements

except Used with exceptions, what to do when an exception occurs

False Boolean value, result of comparison operations

finally Used with exceptions, a block of code that will be executed


no matter if there is an exception or not

for To create a for loop

Table 2: python keywords

OPERATOR IN PYTHON:

Python divides the operators in the following groups:

• Arithmetic operators
• Assignment operators
• Comparison operators
• Logical operators
• Identity operators
• Membership operators
• Bitwise operators

6
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

Arithmetic operators

Arithmetic operators are used with numeric values to perform common mathematical
operations. Some of them are:

Operator Name Example


+ Addition x+y
- Subtraction x-y
* Multiplication x*y
/ Division x/y
% Modulus x%y
** Power x**y
// Floor division x//y

Table 3: Arithmetic operator

Assignment operators

Operator Example Same as


= x=5 x=5
+= x+=3 x=x+3
-= x-=3 x=x-3
*= x*=3 x=x*3
/= x/=3 x=x/3
%= x%=3 x=x%3
//= x//=3 x=x//3
**= x**=3 x=x**3
&= x&=3 x=x&3
|= x|=3 x=x|3
^= x^=3 x=x^3
>>= x>>=3 x=x>>3
<<= x<<=3 x=x<<3

Table 4: Assignment operator

7
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

Comparison operators

Comparison operators are used to compare two values. These are listed below:

Operator Name Example


== Equal x==y
!= Not equal x!=y
> Greater than x>y
< Less than x<y
>= Greater than equal to x>=y
<= Less than equal to x<=y

Table 5: Comparison operator

Logical operators

Logical operators are used to combine conditional statements. These are listed below:

Operator Description Example


And True if both statements are x<5 and x<10
true
Or True if one of statements is x<5 or x<10
true
Not Reverses the result not(x<5 or x<10)

Table 6: Logical operator

Identity operators

Identity operators are used to compare the objects, not if they are equal, but if they are
actually the same object, with the same memory location. These are listed below:

Operator Description Example


Is Returns True if both x is y
variables are same object
is not Returns True if both x is not y
variables are not the same
object

Table 7: Identity operator

8
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

Membership operators

Operator Description Example


In True if sequence with x in y
specified value is present in
object
not in True if sequence with x not in y
specified object is not
present in object

Table 8: Membership operator

Bitwise operators

Bitwise operators are used to compare (binary) numbers. These are listed below:

Operator Name Description


& AND Sets each bit to 1 if both bits
are 1
| OR Sets each bit to 1 if one of
two bits is 1
^ XOR Sets each bit to 1 only if one
of two bits is 1
~ NOT Inverts all bits

Table 9: Bitwise operator

LOOPS IN PYTHON

Python programming language provides the following types of loops to handle looping
requirements. Python provides three ways for executing the loops. While all the ways
provide similar basic functionality, they differ in their syntax and condition checking time.

Types of loops:
1.while loop
2.for loop

9
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

1.) While loop


Syntax:

while expression:
statement(s)
2.) For loop
Syntax:
for iterator Var in sequence:
statements(s)

• It can be used to iterate over a range and iterators.

Advantages:
• Easy to learn, read and write. python is a high-level programming language that has
English-like syntax.
• Interpreted language
• Python has very simple syntax.
• User friendly data structure.
• Object oriented and procedural programming language
• Dynamically typed language.

10
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

MACHINE LEARNING:

Fig.2.1 Machine learning

Machine learning, on the other hand, is an automated process that enables machines to solve
problems with little or no human input, and take actions based on past observations.

While artificial intelligence and machine learning are often used interchangeably, they are two
different concepts. AI is the broader concept – machines making decisions, learning new skills,
and solving problems in a similar way to humans – whereas machine learning is a subset of AI
that enables intelligent systems to autonomously learn new things from data.

Instead of programming machine learning algorithms to perform tasks, you can feed them
examples of labelled data (known as training data), which helps them make calculations,
process data, and identify patterns automatically.

Put simply, Google’s Chief Decision Scientist describes machine learning as a fancy labelling
machine. After teaching machines to label things like apples and pears, by showing them
examples of fruit, eventually they will start labelling apples and pears without any help –
provided they have learned from appropriate and accurate training examples.

Machine learning can be put to work on massive amounts of data and can perform much more
accurately than humans. It can help you save time and money on tasks and analyses,
like solving customers pain point to improve customer satisfaction, support ticket automation
and data mining from internal sources and all over the internet.

11
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

TYPE OF MACHINE LEARNING:

1.Supervised learning

2.Unsupervised learning

3.Semi supervised learning

4. Reinforcement learning

5.Deep learning

1.) Supervised learning

Supervised learning algorithms and supervised learning models make predictions based on
labelled training data. Each training sample includes an input and a desired output. A
supervised learning algorithm analyses this sample data and makes an inference – basically, an
educated guess when determining the labels for unseen data.

2.) Unsupervised learning

Unsupervised learning algorithms uncover insights and relationships in unlabelled data. In this
case, models are fed input data but the desired outcomes are unknown, so they have to make
inferences based on circumstantial evidence, without any guidance or training. The models are
not trained with the “right answer,” so they must find patterns on their own.

3.) Semi supervised learning

In semi supervised machine learning, training data is split into two. A small amount of labelled
data and a larger set of unlabelled data.

In this case, the model uses labelled data as an input to make inferences about the unlabelled
data, providing more accurate results than regular supervised-learning models.

4.) Reinforcement learning

Reinforcement learning is concerned with how a software agent (or computer program) ought
to act in a situation to maximize the reward. In short, reinforced machine learning models
attempt to determine the best possible path they should take in a given situation. They do this
through trial and error. Since there is no training data, machines learn from their own
mistakes and choose the actions that lead to the best solution or maximum reward

12
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

5.) Deep learning

Deep learning models can be supervised, semi-supervised, or unsupervised (or a combination


of any or all of the three). They’re advanced machine learning algorithms used by tech giants,
like Google, Microsoft, and Amazon to run entire systems and power things, like self-driving
cars and smart assistants.

Fig.2.2 Deep learning

MACHINE LEARNING WORKING PROCESS:

In machine learning process data is given to machine in form of data in form of 0’s and 1’s and
then machine process some tasks and give a machine model in output. Machine learning uses
two types of techniques: supervised learning, which trains a model on known input and output
data so that it can predict future outputs, and unsupervised learning, which finds hidden patterns
or intrinsic structures in input data. Machine Learning is making the computer learn from
studying data and statistics. Machine Learning is a step into the direction of artificial
intelligence (AI). Machine Learning is a program that analyses data and learns to predict the
outcome.

Fig.2.3 working of machine learning

13
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

APPLICATION OF MACHINE LEARNING:


1.Machine learning helps increase your efficiency. ...
2. You can understand your customers better. ...
3. You can personalize your marketing campaigns. ...
4. Machine learning recommends products to your customers. ...
5. Machine learning helps to detect fraud.

VARIOUS LIBRARIES USED IN MACHINE LEARNING:

NumPy:

Fig. 3.1 NumPy

NumPy stands for ‘Numerical Python’. It is an open-source Python library used to perform
various mathematical and scientific tasks. It contains multi-dimensional arrays and matrices,
along with many high-level mathematical functions that operate on these arrays and matrices.
NumPy is the fundamental package for scientific computing in Python. It is a Python library
that provides a multidimensional array object, various derived objects (such as masked arrays
and matrices), and an assortment of routines for fast operations on arrays, including
mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms,
basic linear algebra, basic statistical operations, random simulation and much more.

At the core of the NumPy package, is the ndarray object. This encapsulates n-dimensional
arrays of homogeneous data types, with many operations being performed in compiled code
for performance. There are several important differences between NumPy arrays and the
standard Python sequences:

14
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

• NumPy arrays have a fixed size at creation, unlike Python lists (which can grow
dynamically). Changing the size of an ndarray will create a new array and delete the
original.
• The elements in a NumPy array are all required to be of the same data type, and thus
will be the same size in memory. The exception: one can have arrays of (Python,
including NumPy) objects, thereby allowing for arrays of different sized elements.
• NumPy arrays facilitate advanced mathematical and other types of operations on large
numbers of data. Typically, such operations are executed more efficiently and with less
code than is possible using Python’s built-in sequences.
• A growing plethora of scientific and mathematical Python-based packages are using
NumPy arrays; though these typically support Python-sequence input, they convert
such input to NumPy arrays prior to processing, and they often output NumPy arrays.
In other words, in order to efficiently use much (perhaps even most) of today’s
scientific/mathematical Python-based software, just knowing how to use Python’s
built-in sequence types is insufficient - one also needs to know how to use NumPy
arrays.

Installing NumPy:

pip install numpy

How to import NumPy?

import numpy as

15
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

Pandas:

Fig. 3.2 Pandas

Pandas is an open-source library that is built on top of NumPy library. It is a Python


package that offers various data structures and operations for manipulating numerical data
and time series. It is mainly popular for importing and analysing data much easier. Pandas
is fast and it has high-performance & productivity for users.

Getting Started

The first step of working in pandas is to ensure whether it is installed in the Python folder
or not. If not then we need to install it in our system using pip command. Type cmd
command in the search box and locate the folder using cd command where python-pip
file has been installed. After locating it, type the command:

pip install pandas


After the pandas have been installed into the system, you need to import the library. This
module is generally imported as :
import pandas as pd
Matplotlib:

Fig. 3.3 matplotlib

16
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

Matplotlib is easy to use and an amazing visualizing library in Python. It is built on


NumPy arrays and designed to work with the broader SciPy stack and consists of several
plots like line, bar, scatter, histogram, etc.
We can import matplotlib simply as:

. import matplotlib.pyplot as plt

Basic plots in Matplotlib :


Matplotlib comes with a wide variety of plots. Plots helps to understand trends, patterns, and
to make correlations. They’re typically instruments for reasoning about quantitative
information. Some of the sample plots are covered here.
Bar plot: Line plot:

fig.3.4 Bar plot fig.3.5 line plot

Histogram: boxplot:

fig.3.6 Histogram fig.3.7Box plot

17
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

Scikit learn:

Fig. 3.8 scikit learn


Scikit learn is an open-source Python library that implements a range of machine learning,
pre-processing, cross-validation, and visualization algorithms using a unified interface.
Important features of scikit-learn:
• Simple and efficient tools for data mining and data analysis. It features various
classification, regression and clustering algorithms including support vector
machines, random forests, gradient boosting, k-means, etc.
• Accessible to everybody and reusable in various contexts.
• Built on the top of NumPy, SciPy, and matplotlib.
• Open source, commercially usable – BSD license.

Installation:
pip install -U scikit-learn
Seaborn is an amazing visualization library for statistical graphics plotting in Python. It
provides beautiful default styles and colour palettes to make statistical plots more attractive.
It is built on the top of matplotlib library and also closely integrated to the data structures
from pandas.
Seaborn aims to make visualization the central part of exploring and understanding data. It
provides dataset-oriented APIs, so that we can switch between different visual
representations for same variables for better understanding of dataset.

Different categories of plot in Seaborn


Plots are basically used for visualizing the relationship between variables. Those variables
can be either be completely numerical or a category like a group, class or division. Seaborn
divides plot into the below categories –

• Relational plots: This plot is used to understand the relation between two
variables.

18
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

• Categorical plot: This plot deals with categorical variables and how they can be
visualized.

• Distribution plot: This plot is used for examining univariate and bivariate
distributions
• Regression plot: The regression plots in seaborn are primarily intended to add a
visual guide that helps to emphasize patterns in a dataset during exploratory data
analyses.
• Matrix plot: A matrix plot is an array of scatterplots.
• Multi-plot grids: It is a useful approach is to draw multiple instances of the
same plot on different subsets of the dataset.

Important features of scikit-learn:


• Simple and efficient tools for data mining and data analysis. It features various
classification, regression and clustering algorithms including support vector
machines, random forests, gradient boosting, k-means, etc.
• Accessible to everybody and reusable in various contexts.
• Built on the top of NumPy, SciPy, and matplotlib.
• Open source, commercially usable – BSD license.

Conclusion:
The growth and popularity of Machine Learning language call for efficient tools, and sklearn
in Python serves the need for beginners as well as those solving supervised learning problems.
Efficiency and versatility of use make scikit-learn one of the prime choices of academic and
industrial organizations for performing various operations.

19
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

Project (Medical Cost Insurance)

About the dataset: The Dataset contains 1339 entries and 7 details such as age, sex, bmi,
children , smoker, region.

Table: 11 Dataset
Loading the dataset:

Fig. 4.1 loading of dataset


Data Analysis:

Fig. 4.2 project data analysis

20
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

Collection Information:

Fig. 4.3 Data Collection

Now to use describe method in pandas, just type the below statement:

Fig. 4.4 statical measure of the data set

21
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

Data visualisation:

Fig. 4.5 Data Visualisation

Fig. 4.6 Data Distribution


Gender visualisation:

Fig. 4.7 Gender visualisation

22
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

Fig. 4.8 Gender Analysis

BMI Distribution:

Fig. 4.9 BMI Distribution

23
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

Count plot for Number of children’s:

Fig. 4.10 Children column


Count plot for smoker column:

Fig. 4.11 smoker column analysis

24
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

Count plot for region column:

Fig. 4.12 Analysis of region column

Charges Distribution:

Fig. 4.13 charge distribution

25
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

DATA PROCESSING

Data pre-processing is a process of preparing the raw data and making it suitable for a machine
learning model. It is the first and crucial step while creating a machine learning model.

Why do we need Data Pre-processing?

A real-world data generally contains noises, missing values, and maybe in an unusable format
which cannot be directly used for machine learning models. Data pre-processing is required
tasks for cleaning the data and making it suitable for a machine learning model which also
increases the accuracy and efficiency of a machine learning model.

Steps of encoding:

o Getting the dataset


o Importing libraries
o Importing datasets
o Finding Missing Data
o Encoding Categorical Data
o Splitting dataset into training and test set
o Feature scaling

Fig. 5.1 Data processing

26
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

After encoding of categorical values:

Fig. 5.2 Encoding of data


Training and splitting of data:
Removing unused column from dataset

Fig. 5.3 training and splitting of data

27
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

The train-test split procedure is used to estimate the performance of machine learning
algorithms when they are used to make predictions on data not used to train the model.

It is a fast and easy procedure to perform, the results of which allow you to compare the
performance of machine learning algorithms for your predictive modelling problem. Although
simple to use and interpret, there are times when the procedure should not be used, such as
when you have a small dataset and situations where additional configuration is required, such
as when it is used for classification and the dataset is not balanced.

Splitting of data:

Fig. 5.4 Data Splitting

28
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

Model training:

The process involved in training a linear regression model is similar in many ways to how
other machine learning models are trained. We need to work on a training data set and model
the relationship of its variables in a way that doesn’t impact the ability of the model to predict
new data samples. Model is trained to improve your prediction equation continuously.
It is done by iteratively looping through the given dataset. Every time you repeat this action,
you simultaneously update the bias and weight value in the direction that the gradient or cost
function indicates. The stage of the completion of training is reached when an error threshold
is touched or when there is no reduction in cost with the training iterations that follow.

REGRESSION:

The learning technique is used to serve the objective of reproducing output values. In other
words, it is used in situations in which we need to fit data to a specific value. For example, it
is often used to estimate the price of different items. Regression can be used to predict more
things than you can possibly imagine.

Linear Regression:
It is one of the machine learning techniques that fall under supervised learning. The rise in the
demand and use of machine learning techniques is behind the sudden upsurge in the use of
linear regression in several areas.
When do we Use linear regression:
The most important of these conditions is the existence of a linear relationship between the
variables of your data set. This allows them to be easily plotted. You need to see the difference
that exists between the predicted values and achieved value in real are constant. The predicted
values should still be independent, and the correlation between predictors should be too close
for comfort.

29
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

Fig. 6.1 linear regression

Uses of linear regression:


The simplicity by which linear aggression makes interpretations at the molecular level easier
is one of its biggest advantages. Linear regression can be applied to all those data sets where
variables have a linear relationship.
Linear regression can also be used at different stages of the sourcing and production of a
product. These models are widely used in academic, scientific, and medical fields. For instance,
farmers can model a system that allows them to use environmental conditions to their benefit.
This will help them in working with the elements in such a way that they cause the minimum
damage to their crop yield and profit.
In addition to these, it can be used in healthcare, archaeology, and labour amongst other areas.

Formula of calculating linear regression:

30
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

It performs a regression task. Regression models a target prediction value based on


independent variables. It is mostly used for finding out the relationship between variables
and forecasting.

Fig .6.2 Regression analysis

Logistic Regression:

Logistic regression is used to explain the relationship between one dependent binary variable

and one or more nominal, ordinal, interval or ratio-level independent variables.

31
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

Evaluation and prediction of model:

Prediction on training data:

Fig. 6.3 Prediction on training data


Prediction on testing data:

Fig. 6.4 Prediction on testing data

32
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

Training data is the initial dataset you use to teach a machine learning application to
recognize patterns or perform to your criteria, while testing or validation data is used to
evaluate your model’s accuracy. You’ll need a new dataset to validate the model because it
already “knows” the training data.

Fig. 6.5 training of data

Splitting of dataset:

Splitting the dataset into train and test sets is one of the important parts of data pre-processing,
as by doing so, we can improve the performance of our model and hence give better
predictability. We can understand it as if we train our model with a training set and then test it
with a completely different test dataset, and then our model will not be able to understand the
correlations between the features.

Fig. 6.6 splitting of data

33
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

Therefore, if we train and test the model with two different datasets, then it will decrease the
performance of the model. Hence it is important to split a dataset into two parts, i.e., train and test
set.

In this way, we can easily evaluate the performance of our model. Such as, if it performs well
with the training data, but does not perform well with the test dataset, then it is estimated that
the model may be overfitted.

For splitting the dataset, we can use the train_test_split function of scikit-learn.

o x_train: It is used to represent features for the training data


o x_test: It is used to represent features for testing data
o y_train: It is used to represent dependent variables for training data
o y_test: It is used to represent independent variable for testing data
o In the train_test_split() function, we have passed four parameters. Which first two are
for arrays of data, and test_size is for specifying the size of the test set.

CLASSIFICATION :
Classification is a process of categorizing a given set of data into classes, It can be performed
on both structured or unstructured data. The process starts with predicting the class of given
data points. The classes are often referred to as target, label or categories.

The classification predictive modelling is the task of approximating the mapping function from
input variables to discrete output variables. The main goal is to identify which class/category
the new data will fall into.

Fig. 7.1 Classification in machine learning

34
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

Classification Terminologies in Machine Learning

• Classifier – It is an algorithm that is used to map the input data to a specific category.
• Classification Model – The model predicts or draws a conclusion to the input data
given for training, it will predict the class or category for the data.
• Feature – A feature is an individual measurable property of the phenomenon being
observed.
• Binary Classification – It is a type of classification with two outcomes, for e.g. – either
true or false.
• Multi-Class Classification – The classification with more than two classes, in multi-
class classification each sample is assigned to one and only one label or target.
• Multi-label Classification – This is a type of classification where each sample is
assigned to a set of labels or targets.
• Initialize – It is to assign the classifier to be used for the
• Train the Classifier – Each classifier in sci-kit learn uses the fit(X, y) method to fit the
model for training the train X and train label y.
• Predict the Target – For an unlabelled observation X, the predict(X) method returns
predicted label y.
• Evaluate – This basically means the evaluation of the model i.e classification report,
accuracy score, etc.

Example :
The best example of an ML classification algorithm is Email Spam Detector. The main goal
of the Classification algorithm is to identify the category of a given dataset, and these
algorithms are mainly used to predict the output for the categorical data. Classification
algorithms can be better understood using the below diagram.

Fig. 7.2 Email spam detector

35
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

LOGISTIC REGRESSION:

It is a classification algorithm in machine learning that uses one or more independent variables
to determine an outcome. The outcome is measured with a dichotomous variable meaning it
will have only two possible outcomes.

The goal of logistic regression is to find a best-fitting relationship between the dependent
variable and a set of independent variables. It is better than other binary classification
algorithms like nearest neighbour since it quantitatively explains the factors leading to
classification.

Fig. 7.3 Logistic regression

Advantages and Disadvantages

Logistic regression is specifically meant for classification, it is useful in understanding how a


set of independent variables affect the outcome of the dependent variable.

The main disadvantage of the logistic regression algorithm is that it only works when the
predicted variable is binary, it assumes that the data is free of missing values and assumes that
the predictors are independent of each other.

Use Cases

• Identifying risk factors for diseases


• Word classification
• Weather Prediction

36
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

CLASSIFIER:

It is an algorithm that is used to map the input data to a specific category.

Fig. 8.1 classifier in machine learning


Decision tree classifier:

o Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems.
It is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further
branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.

37
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

o A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
o Below diagram explains the general structure of a decision tree:

Fig. 8.2 Decision tree


Why use Decision Trees?

There are various algorithms in Machine learning, so choosing the best algorithm for the given
dataset and problem is the main point to remember while creating a machine learning model.
Below are the two reasons for using the Decision tree:

o Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.

38
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

Decision Tree Terminologies

Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.

Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the
child node.

Fig. 8.3 Decision tree implementation

Advantages of the Decision Tree

o It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

39
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

Disadvantages of the Decision Tree

o The decision tree contains lots of layers, which makes it complex.


o It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
o For more class labels, the computational complexity of the decision tree may increase.

Random forest classifier: Random Forest is a popular machine learning algorithm that
belongs to the supervised learning technique. It can be used for both Classification and
Regression problems in ML. It is based on the concept of ensemble learning, which is a
process of combining multiple classifiers to solve a complex problem and to improve the
performance of the model.

As the name suggests, "Random Forest is a classifier that contains a number of decision
trees on various subsets of the given dataset and takes the average to improve the predictive
accuracy of that dataset." Instead of relying on one decision tree, the random forest takes the
prediction from each tree and based on the majority votes of predictions, and it predicts the
final output.

The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.

The below diagram explains the working of the Random Forest algorithm:

Fig. 8.4 Random Forest tree

40
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

Assumptions for Random Forest:

Since the random forest combines multiple trees to predict the class of the dataset, it is possible
that some decision trees may predict the correct output, while others may not. But together, all
the trees predict the correct output. Therefore, below are two assumptions for a better Random
Forest classifier:

o There should be some actual values in the feature variable of the dataset so that the
classifier can predict accurate results rather than a guessed result.
o The predictions from each tree must have very low correlations.

Why use Random Forest?

Below are some points that explain why we should use the Random Forest algorithm:

o It takes less training time as compared to other algorithms.


o It predicts output with high accuracy, even for the large dataset it runs efficiently.
o It can also maintain accuracy when a large proportion of data is missing.

How does Random Forest algorithm work?

Random Forest works in two-phase first is to create the random forest by combining N decision
tree, and second is to make predictions for each tree created in the first phase.

The Working process can be explained in the below steps and diagram:

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points (Subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.

Step-5: For new data points, find the predictions of each decision tree, and assign the new data
points to the category that wins the majority votes.

Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is
given to the Random Forest classifier. The dataset is divided into subsets and given to each
decision tree. During the training phase, each decision tree produces a prediction result, and
when a new data point occurs, then based on the majority of results, the Random Forest
classifier predicts the final decision. Consider the below image:

41
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

Fig. 8.5 Random Forest algorithm

Applications of Random Forest

There are mainly four sectors where Random Forest mostly used:

1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease can
be identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.

Advantages of Random Forest

o Random Forest is capable of performing both Classification and Regression tasks.


o It is capable of handling large datasets with high dimensionality.
o It enhances the accuracy of the model and prevents the overfitting issue.

42
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

Disadvantages of Random Forest

o Although random forest can be used for both classification and regression tasks, it is
not more suitable for Regression tasks.

CROSS VALIDATION CONCEPT:

Cross-validation is a technique for validating the model efficiency by training it on the subset
of input data and testing on previously unseen subset of the input data. We can also say that it
is a technique to check how a statistical model generalizes to an independent dataset.

In machine learning

there is always the need to test the stability of the model. It means based only on the training
dataset; we can't fit our model on the training dataset. For this purpose, we reserve a particular
sample of the dataset, which was not part of the training dataset. After that, we test our model
on that sample before deployment, and this complete process comes under cross-validation.
This is something different from the general train-test split.

Hence the basic steps of cross-validations are:

o Reserve a subset of the dataset as a validation set.


o Provide the training to the model using the training dataset.
o Now, evaluate model performance using the validation set. If the model performs well
with the validation set, perform the further step, else check for the issues.

Methods used for Cross-Validation

There are some common methods that are used for cross-validation. These methods are given
below:

1. Validation Set Approach


2. Leave-P-out cross-validation
3. Leave one out cross-validation
4. K-fold cross-validation
5. Stratified k-fold cross-validation

43
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

Fig. 9.1 methods of cross validation

Comparison of Cross-validation to train/test split in Machine Learning

o Train/test split: The input data is divided into two parts, that are training set and test
set on a ratio of 70:30, 80:20, etc. It provides a high variance, which is one of the biggest
disadvantages.
o Training Data: The training data is used to train the model, and the dependent
variable is known.
o Test Data: The test data is used to make the predictions from the model that is
already trained on the training data. This has the same features as training data
but not the part of that.
o Cross-Validation dataset: It is used to overcome the disadvantage of train/test split by
splitting the dataset into groups of train/test splits, and averaging the result. It can be
used if we want to optimize our model that has been trained on the training dataset for
the best performance. It is more efficient as compared to train/test split as every
observation is used for the training and testing both.

Limitations of Cross-Validation

There are some limitations of the cross-validation technique, which are given below:

o For the ideal conditions, it provides the optimum output. But for the inconsistent data,
it may produce a drastic result. So, it is one of the big disadvantages of cross-validation,
as there is no certainty of the type of data in machine learning.

44
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

o In predictive modelling, the data evolves over a period, due to which, it may face the
differences between the training set and validation sets. Such as if we create a model
for the prediction of stock market values, and the data is trained on the previous 5 years
stock values, but the realistic future values for the next 5 years may drastically different,
so it is difficult to expect the correct output for such situations.

Applications of Cross-Validation

o This technique can be used to compare the performance of different predictive


modelling methods.
o It has great scope in the medical research field.
o It can also be used for the meta-analysis, as it is already being used by the data scientists
in the field of medical statistics.

Fig. 9.2 cross validation

45
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

Future Scope

• Creating the model with additional parameters such as Work Experience, Technical
Papers Written, and Content of Letter of Recommendation etc.
• Creating a model based on the graph of admitted vs enrolled students of previous years
to predict the increase or decrease in cut-off scores among applicants.
• Comparing different universities based on applied vs admitted data.

• The scope of Machine Learning is not limited to the investment sector.

• it is expanding across all fields such as banking and finance, information technology,
media & entertainment, gaming, and the automotive industry.

• As the Machine Learning scope is very high, there are some areas where researchers
are working toward revolutionizing the world for the future.
• self-driving cars are built using Machine Learning, IoT sensors, high-definition
cameras, voice recognition systems, etc.
• In robotics, inventions were possible with the help of Machine Learning and Artificial
Intelligence.

• The scope of Machine Learning in India, as well as in other parts of the world, is high
in comparison to other career fields when it comes to job opportunities.
• The progress in the field of Artificial Intelligence and Machine Learning has made it
possible to achieve the goal of computer vision faster.
• Machine Learning will accelerate the processing power of the automation system used
in various technologies.
• This gives the benefit to the organization for making effective business strategies as per
the predictions of the ML algorithms.
• The most fascinating and accurate development in the field of Machine Learning has to
be Quantum Computers. Experts believe that quantum computing has a scope to boost
the potential of Machine Learning and increase its manifolds. As an interesting fact,
Google’s quantum processor in 2019 performed a task in 200 sec.

46
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

Conclusion

• After the Final Submission of test data, model’s accuracy score was 87%.
• Graphical representation of the data provided useful insights and lead to choosing better
model
• Linear Regression worked best for this dataset because the data was linearly correlated.
• Machine learning is a powerful tool for making predictions from data.
• The aim of machine learning is to automate analytical model building and enable
computers to learn from data without being explicitly programmed to do so.
• it is important to remember that machine learning is only as good as the data that is
used to train the algorithms. In order to make accurate predictions, it is important to use
high-quality data that is representative of the real-world data that the algorithm will be
used on.
• Machine Learning is a technique of training machines to perform the activities a human
brain can do, albeit bit faster and better than an average human-being.
• Today we have seen that the machines can beat human champions in games such as
Chess, AlphaGO, which are considered very complex.
• machines can be trained to perform human activities in several areas and can aid
humans in living better lives.
• lesser amount of data and clearly labelled data for training, opt for Supervised
Learning. Unsupervised Learning would generally give better performance and results
for large data sets.
• looked at the choices of various development languages, IDEs and Platforms.

47
Jaipur Engineering College and Research
Academic Year- 2022-2023
Centre, Shri Ram ki Nangal, via Sitapura

RIICO Jaipur- 302022.

References:

1. www.google.com
2. www.Kaggle.com
3. www.geeksforgeeks.com
4. www.madewithml.com
5. www.acte.in
6. intellipaat.com
7. www.alibabacloud.com
8. www.tutorialspoint.com
9. www.udemy.com
10.www.javatpoint.com

48

You might also like