KEMBAR78
Data Analysis Using R Syllabus | PDF | Computers | Technology & Engineering
0% found this document useful (0 votes)
68 views205 pages

Data Analysis Using R Syllabus

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views205 pages

Data Analysis Using R Syllabus

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 205

MCAD2226 DATA ANALYSIS USING R

SYLLABUS

UNIT 1: INTRODUCTION TO DATA SCIENCE

What is Data Science, Scenarios on Data Science, Data Science and Organization, Different
types of data, Structured data, Unstructured data, Machine generated data, Understanding on
Data Science Process, Explain on Research Goal, Data Processing on Data Science, Getting
Start With R, Overview of R, Why R for Data Science, Eclipse, Live-R, Project Workspace
Setup, Understanding on R Packages, Load Libraries and Installed Packages.

UNIT 2: WORKING WITH R PROGRAMMING

Data Types and Syntax, Processing on Variables, Data Items on Structure, Classes and
Manipulate Objects, Control statements IF, ELSE, SWITCH, Loop statements, FOR, WHILE,
REPEAT, Working with String and Date, Understanding on Vector, List, Data Frames,
Working with Arrays, Read and Write data from CSV, Tabular Data and Database.

UNIT 3: CLASSIFICATION IN R

Classification - Introduction, Types of Classification, Application of Classification, Overview


of DT, Naïve Bayes, KNN, Random forest, Introduction – DT, DT Algorithm, Example of DT
with R, Introduction – Naïve bayes, Naïve Bayes Algorithm, , Example of Naïve Bayes with
R, Introduction - KNN, KNN Algorithm, Example of KNN with R, Introduction – Random
Forest, Random Forest Algorithm, Example of Random Forest with R

UNIT 4: CLUSTERING IN R

Clustering - introduction, Types of Clustering, Application of Clustering, Overview of K-


means, Hierarchical, Medoids, DBSCAN, Packages, Introduction – K-means, K-means
Algorithm, Example of K-means with R, Introduction – Hierarchical, Hierarchical Algorithm,
Example of Hierarchical with R.

UNIT 5: DATA VISUALIZATION IN R

Overview of Data Visualization, Packages, Interactive Graphics, Plotting, Scatter plot , Box
plot, Bar plot, Pie chart, Histogram, XKD-Style Plots, Heat Maps, Introduction to predictive
models, What is Model and how to build a model.

TEXTBOOK:

1. R for Data Science by Hadley Wickham


2. Introduction to Data Science, R. Irizarry

REFERENCE:

1. R Programming for Data Science, Roger D Peng


2. Data Visualization: A practical introduction, by Kieran Healy
CONTENTS

MODULE -1

1.1 Introduction to Data Science


1.2 Components of Data Science
1.3 Applications of Data Science
1.4 Tools for Data Science
1.5 Data
1.6 Structured Data
1.7 Unstructured Data
1.8 Machine Generated Data

MODULE- 2

2.1 Data Science Process


2.2 Introduction to R
2.3 Applications of R
2.4 R Installation
2.5 Packages in R
2.6 Working With R Environment

MODULE- 3

3.1 Variables in R Programming


3.2 Data Types
3.3 Control Statements in R
3.4 R Switch Statement
3.5 For Loop
3.6 Repeat Loop
3.7 While Loop
3.8 R – Strings

3.9 Date & Time


MODULE-4

4.1 Data Structures in R Programming


4.1.1 R Vectors
4.1.2 R List
4.1.3 R Data Frame
4.1.4 R Arrays
4.1.5 R Factors
4.2 R CSV Files

MODULE -5

5.1 Classification Algorithm


5.2 Types of Classification Algorithms
5.3 Decision Tree Classification
5.4 Naïve Bayes Classifier

MODULE -6

6.1 K-Nearest Neighbor(KNN) Algorithm

6.2 Random Forest Algorithm

MODULE -7

7.1 Clustering In Machine Learning


7.2 K-Means Clustering

MODULE-8

8.1 DBScan Clustering


8.2 Hierarchical Clustering
MODULE 9

9.1 Data Visualization


9.2 R Visualization Packages
9.3 Interactive R Graphics
9.4 Data Visualization Graphs in R

MODULE-10

10.1 Introduction to Predictive Models


10.2 Process of Predictive Model
10.3 Types of Predictive Models
10.4 Applications of Predictive Modelling
MCAD2226 DATA ANALYSIS USING R

MODULE 1

1.1 Introduction to Data Science

1.2 Components of Data Science

1.3 Applications of Data Science

1.4 Tools for Data Science

1.5 Data

1.6 Structured Data

1.7 Unstructured Data

1.8 Machine Generated Data

1 SRMIST DDE Self Learning Material


1.1 INRODUCTION TO DATA SCIENCE

Data science is a deep study of the massive amount of data, which involves
extracting meaningful insights from raw, structured, and unstructured data
that is processed using the scientific method, different technologies, and
algorithms.

It is a multidisciplinary field that uses tools and techniques to manipulate the


data so that you can find something new and meaningful.

1.2 COMPONENTS OF DATA SCIENCE

The main components of Data Science are given below:

1. Statistics: Statistics is one of the most important components of data


science. Statistics is a way to collect and analyze the numerical data in a large
amount and finding meaningful insights from it.

2. Domain Expertise: In data science, domain expertise binds data science


together. Domain expertise means specialized knowledge or skills of a
particular area. In data science, there are various areas for which we need
domain experts.

3. Data engineering: Data engineering is a part of data science, which


involves acquiring, storing, retrieving, and transforming the data. Data
engineering also includes metadata (data about data) to the data.

4. Visualization: Data visualization is meant by representing data in a visual


context so that people can easily understand the significance of data. Data
visualization makes it easy to access the huge amount of data in visuals.

5. Advanced computing: Heavy lifting of data science is advanced


computing. Advanced computing involves designing, writing, debugging, and
maintaining the source code of computer programs.

SRMIST DDE Self Learning Material


6. Mathematics: Mathematics is the critical part of data science. Mathematics
involves the study of quantity, structure, space, and changes. For a data
scientist, knowledge of good mathematics is essential.

7. Machine learning: Machine learning is backbone of data science. Machine


learning is all about to provide training to a machine so that it can act as a
human brain. In data science, we use various machine learning algorithms to
solve the problems.

1.3 APPLICATIONS OF DATA SCIENCE

o Image recognition and speech recognition:


Data science is currently using for Image and speech recognition. The
image on Facebook and start getting the suggestion to tag to friends.
This automatic tagging suggestion uses image recognition algorithm,
which is part of data science. When you say something using, "Ok
Google, Siri, Cortana", etc., and these devices respond as per voice
control, so this is possible with speech recognition algorithm.

o Gaming world:
In the gaming world, the use of Machine learning algorithms is
increasing day by day. EA Sports, Sony, Nintendo, are widely using
data science for enhancing user experience.

o Internet Search:
While searching on the internet, using different types of search
engines such as Google, Yahoo, Bing, Ask, etc. All these search engines
use the data science technology to make the search experience better,
and you can get a search result with a fraction of seconds.

o Automation in Transport Industry:


Transport industries also using data science technology to create self-
driving cars. With self-driving cars, it will be easy to reduce the
number of road accidents.

3 SRMIST DDE Self Learning Material


o Healthcare:
In the healthcare sector, data science is providing lots of benefits. Data
science is being used for tumor detection, drug discovery, medical
image analysis, virtual medical bots, etc.

o Recommendation Systems:
Most of the companies, such as Amazon, Netflix, Google Play, etc., are
using data science technology for making a better user experience
with personalized recommendations.

o Risk Detection:
Finance industries always had an issue of fraud and risk of losses, but
with the help of data science, this can be rescued. Most of the finance
companies are looking for the data scientist to avoid risk and any type
of losses with an increase in customer satisfaction.

1.4 TOOLS FOR DATA SCIENCE

Following are some tools required for data science:

o Data Analysis tools: R, Python, Statistics, SAS, Jupyter, R Studio,


MATLAB, Excel, RapidMiner.

o Data Warehousing: ETL, SQL, Hadoop, Informatica/Talend, AWS


Redshift

o Data Visualization tools: R, Jupyter, Tableau, Cognos.

o Machine learning tools: Spark, Mahout, Azure ML studio.

1.5 DATA

Data is a collection of facts, such as numbers, words, measurements,


observations or just descriptions of things.

SRMIST DDE Self Learning Material


1.5.1 Types of Data

Data is classified into

1) Qualitative Data
2) Quantitative Data

Fig:1 Types of Data

1.5.1.1Qualitative Data :

They represent some characteristics or attributes. They depict descriptions


that may be observed but cannot be computed or calculated.

The other examples of qualitative data are :

• What language do you speak


• Favourite holiday destination
• Opinion on something (agree, disagree, or neutral)
• Colors

5 SRMIST DDE Self Learning Material


The Qualitative data are further classified into two parts :

1. Nominal Data

2.Ordinal Data

1. Nominal Data

Nominal Data is used to label variables without any order or


quantitative value. These data don’t have any meaningful order; their values
are distributed into distinct categories.

Examples of Nominal Data :

• Colour of hair (Blonde, red, Brown, Black, etc.)


• Marital status (Single, Widowed, Married)
• Nationality (Indian, German, American)
• Gender (Male, Female, Others)
• Eye CAolor (Black, Brown, etc.)
2.Ordinal Data

• Ordinal data have natural ordering where a number is present in


some kind of order by their position on the scale.
• These data are used for observation like customer satisfaction,
happiness, etc.,
• Ordinal data is qualitative data for which their values have some kind
of relative position. These kinds of data can be considered “in-
between” qualitative and quantitative data.
• The ordinal data only shows the sequences and cannot use for
statistical analysis. Compared to nominal data, ordinal data have some
kind of order that is not present in nominal data.

SRMIST DDE Self Learning Material


Examples of Ordinal Data :

• When companies ask for feedback, experience, or satisfaction on a


scale of 1 to 10
• Letter grades in the exam (A, B, C, D, etc.)
• Ranking of people in a competition (First, Second, Third, etc.)
• Economic Status (High, Medium, and Low)
• Education Level (Higher, Secondary, Primary)

Difference between Nominal and Ordinal Data

Nominal Data Ordinal Data

Nominal data can’t be quantified, Ordinal data gives some kind of


neither they have any intrinsic sequential order by their position on
ordering the scale

Nominal data is qualitative data or Ordinal data is said to be “in-between”


categorical data qualitative data and quantitative data

They don’t provide any quantitative They provide sequence and can assign
value, neither can we perform any numbers to ordinal data but cannot
arithmetical operation perform the arithmetical operation

Ordinal data can help to compare one


Nominal data cannot be used to
item with another by ranking or
compare with one another
ordering

Examples: Eye color, housing style, Examples: Economic status, customer


gender, hair color, religion, marital satisfaction, education level, letter
status, ethnicity, etc grades, etc

7 SRMIST DDE Self Learning Material


1.5.1.2 Quantitative Data

Quantitative data can be measured and not simply observed. They can be
numerically represented and calculations can be performed on them

It answers the questions like “how much,” “how many,” and “how often.” For
example, the price of a phone, the computer’s ram, the height or weight of a
person, etc., falls under quantitative data.

Quantitative data can be used for statistical manipulation.

Examples of Quantitative Data :

• Height or weight of a person or object


• Room Temperature
• Scores and Marks (Ex: 59, 80, 60, etc.)
• Time
The Quantitative data are further classified into two parts :

1. Discrete Data

2. Continuous Data

1. Discrete Data

• The discrete data contain the values that fall under integers or whole
numbers.
• These data can’t be broken into decimal or fraction values.
• The discrete data are countable and have finite values; their
subdivision is not possible.
• These data are represented mainly by a bar graph, number line, or
frequency table.
Examples of Discrete Data :

• Total numbers of students present in a class

SRMIST DDE Self Learning Material


• Cost of a cell phone
• Numbers of employees in a company
• The total number of players who participated in a competition
• Days in a week

2.Continuous Data

• Continuous data are in the form of fractional numbers.


• Continuous data represents information that can be divided into
smaller levels.
• The continuous variable can take any value within a range.

Examples of Continuous Data :

• Height of a person
• Speed of a vehicle
• “Time-taken” to finish the work
• Wi-Fi Frequency
• Market share price

Difference between Discrete and Continuous Data

Discrete Data Continuous Data

Discrete data are countable and


Continuous data are measurable; they are
finite; they are whole numbers or
in the form of fractions or decimal
integers

Discrete data are represented mainly Continuous data are represented in the
by bar graphs form of a histogram

9 SRMIST DDE Self Learning Material


Discrete Data Continuous Data

The values cannot be divided into The values can be divided into
subdivisions into smaller pieces subdivisions into smaller pieces

Discrete data have spaces between Continuous data are in the form of a
the values continuous sequence

Examples: Total students in a class,


Example: Temperature of room, the
number of days in a week, size of a
weight of a person, length of an object, etc
shoe, etc

1.6 STRUCTURED DATA


Structured data refers to data that is organized and formatted in a specific
way to make it easily readable and understandable by both humans and
machines. This is typically achieved through the use of a well-defined schema
or data model, which provides a structure for the data.

1.6.1 Characteristics of Structured Data:


• Data conforms to a data model and has easily identifiable structure
• Data is stored in the form of rows and columns Example : Database
• Data is well organized so, Definition, Format and Meaning of data is
explicitly known
• Data resides in fixed fields within a record or file
• Similar entities are grouped together to form relations or classes
• Entities in the same group have same attributes
• Easy to access and query, So data can be easily used by other programs
• Data elements are addressable, so efficient to analyse and process

1.6.2 Sources of Structured Data:

SRMIST DDE Self Learning Material


• SQL Databases
• Spreadsheets such as Excel
• OLTP Systems
• Online forms
• Sensors such as GPS or RFID tags
• Network and Web server logs
• Medical devices

1.6.3 Advantages:

• Easy to understand and use: Structured data has a well-defined schema or


data model, making it easy to understand and use. This allows for easy
data retrieval, analysis, and reporting.
• Consistency: The well-defined structure of structured data ensures
consistency and accuracy in the data, making it easier to compare and
analyze data across different sources.
• Efficient storage and retrieval: Structured data is typically stored in
relational databases, which are designed to efficiently store and retrieve
large amounts of data. This makes it easy to access and process data
quickly.
• Enhanced data security: Structured data can be more easily secured than
unstructured or semi-structured data, as access to the data can be
controlled through database security protocols.
• Clear data lineage: Structured data typically has a clear lineage or history,
making it easy to track changes and ensure data quality.

1.6.4 Disadvantages:

1. Inflexibility: Structured data can be inflexible in terms of accommodating


new types of data, as any changes to the schema or data model require
significant changes to the database.

11 SRMIST DDE Self Learning Material


2. Limited complexity: Structured data is often limited in terms of the
complexity of relationships between data entities. This can make it
difficult to model complex real-world scenarios.
3. Limited context: Structured data often lacks the additional context and
information that unstructured or semi-structured data can provide,
making it more difficult to understand the meaning and significance of the
data.
4. Expensive: Structured data requires the use of relational databases and
related technologies, which can be expensive to implement and maintain.
5. Data quality: The structured nature of the data can sometimes lead to
missing or incomplete data, or data that does not fit cleanly into the
defined schema, leading to data quality issues.
1.7 UNSTRUCTURED DATA
Unstructured data is the data which does not conforms to a data
model and has no easily identifiable structure such that it can not be used by
a computer program easily. Unstructured data is not organised in a pre-
defined manner or does not have a pre-defined data model, thus it is not a
good fit for a mainstream relational database.
1.7.1 Characteristics of Unstructured Data:
• Data neither conforms to a data model nor has any structure.
• Data can not be stored in the form of rows and columns as in Databases
• Data does not follows any semantic or rules
• Data lacks any particular format or sequence
• Data has no easily identifiable structure
• Due to lack of identifiable structure, it can not used by computer
programs easily

1.7.2 Sources of Unstructured Data:


• Web pages
• Images (JPEG, GIF, PNG, etc.)
• Videos
• Memos
• Reports

SRMIST DDE Self Learning Material


• Word documents and PowerPoint presentations
1.7.3 Advantages of Unstructured Data:
• Its supports the data which lacks a proper format or sequence
• The data is not constrained by a fixed schema
• Very Flexible due to absence of schema.
• Data is portable
• It is very scalable
• It can deal easily with the heterogeneity of sources.
• These type of data have a variety of business intelligence and analytics
applications.
1.7.4 Disadvantages Of Unstructured data:
• It is difficult to store and manage unstructured data due to lack of schema
and structure
• Indexing the data is difficult and error prone due to unclear structure and
not having pre-defined attributes. Due to which search results are not
very accurate.
• Ensuring security to data is difficult task.

1.8 MACHINE GENERATED DATA


Machine generated data is information that is produced by
mechanical or digital devices without human intervention.

Machine generated data can often used to describe data which has
been generated by an organization’s industrial control systems and
mechanical devices that are designed to carry out a single function

1.8.1 Types of Machine Data

The most common types of machine data are as follows:

13 SRMIST DDE Self Learning Material


1. Sensor Data

Sensors are typically put in critical elements of machinery


for monitoring and maintenance purposes, such as compressors, conveyors,
and pumps. Sensors are also frequently found in devices that require them to
function in the first place, such as smart home security systems and
automatic thermostats.

Sensors work together to continuously monitor, measure and gather Machine


Data (e.g., movements, temperatures, pressures, and rotational speeds).
Further review and analysis of this data are possible, allowing for the
extraction of insights and the implementation of action plans.

2.Computer or System Log Data

Computers generate log files that include information about the system's
operation. A log file is made up of a series of log lines that show various
system actions, such as saving or deleting a file, connecting to a Wi-Fi
network, installing new software, opening an application, attaching a
Bluetooth device, emptying a recycle bin, and more.

Some types of computer log data are shared with the manufacturers of
computers, operating systems, applications, and programs, while others are
kept locally and confidentially.

3. Geotag Data

Geotagging is the process of adding geographical metadata to a media type


based on the location of the device that created it. Geotags, which can include
timestamps and other contextual information, can be generated
automatically for photos, videos, text messages, and other types of media.

Details such as latitude and longitude coordinates, altitude, bearing, and


more would be included in this form of Machine Data.

SRMIST DDE Self Learning Material


4. Call Log Data

The Machine Data connected with telephone calls is referred to as a call log
or call detail record. The automated process of gathering, recording, and
evaluating data regarding phone calls is known as call logging.

The call duration, start and finish times of the call, the caller and recipient's
locations, as well as the network utilized, are all recorded in the logs.

5. Web Log Data

A weblog is an automatic record of a user's online activity, as opposed to


computer log data, which records actions that occur during the functioning of
a system.

Clickstream data, IP addresses, timestamps, access requests, bytes


transferred, referral URLs, downloads, submissions, and other types of
Machine Data are all examples of this sort of Machine Data.

6.Application Log Data

An application log is a file that keeps track of the activities that occur within a
software application. Despite the fact that human users initiate the actions,
the Machine Data referred to here is generated automatically rather than
being manually entered.

The application utilized, timestamps, problems, downtimes, access requests,


user IDs, file sizes uploaded or downloaded, and more are all included in this
data. These records can be used to assess and prevent recurrences of errors,
as well as to follow the activity of various people.

15 SRMIST DDE Self Learning Material


MODULE 2

2.1 Data Science Process

2.2 Introduction to R

2.3 Applications of R

2.4 R Installation
2.5 Packages in R
2.6 Working With R Environment

SRMIST DDE Self Learning Material


2.1 DATA SCIENCE PROCESS

• Data Science is all about a systematic process used by Data Scientists


to analyze, visualize and model large amounts of data.
• A data science process helps data scientists use the tools to find
unseen patterns, extract data, and convert information to actionable
insights that can be meaningful to the company.
• This aids companies and businesses in making decisions that can help
in customer retention and profits.
• Further, a data science process helps in discovering hidden patterns of
structured and unstructured raw data.
• The process helps in turning a problem into a solution by treating the
business problem as a project.

2.1.1 Steps in Data Science Process


1. Frame the problem
2. Collect the raw data needed for your problem
3. Process the data for analysis
4. Explore the data
5. Perform in-depth analysis
6. Communicate results of the analysis

Step 1: Framing the Problem


Before solving a problem, the pragmatic thing to do is to know what exactly
the problem is. Data questions must be first translated to actionable business
questions. A great way to go through this step is to ask questions like:
• Who the customers are?
• How to identify them?
• What is the sale process right now?
• Why are they interested in your products?
• What products they are interested in?

17 SRMIST DDE Self Learning Material


Step 2: Collecting the Raw Data for the Problem
After defining the problem, collect the requisite data to derive
insights and turn the business problem into a probable solution.
Example, Many companies store the sales data they have in customer
relationship management (CRM) systems. The CRM data can be easily
analyzed by exporting it to more advanced tools using data pipelines.
Step 3: Processing the Data to Analyze
After the first and second steps, the data need to process it before
going further and analyzing it. Data can be messy if it has not been
appropriately maintained, leading to errors that easily corrupt the analysis.
These issues can be values set to null when they should be zero or the exact
opposite, missing values, duplicate values, and many more.
The most common errors that you can encounter and should look out for are:
1. Missing values
2. Corrupted values like invalid entries
3. Time zone differences
4. Date range errors like a recorded sale before the sales even started
Step 4: Exploring the Data
In this step, develop ideas that can help identify hidden patterns and insights.
It is necessary to find more interesting patterns in the data, such as why
sales of a particular product or service have gone up or down.
Step 5: Performing In-depth Analysis
This step, the data science tools are used to crunch the data
successfully and discover every insight. This helps to prepare a predictive
model that can help most of the marketing done nowadays is on social
media. At last combine the quantitative and qualitative data that have move
them into action.
Step 6: Communicating Results of this Analysis
After all these steps, it is vital to convey the insights and findings to
the sales head and make them understand their importance. It will help if you
communicate appropriately to solve the problem you have been given.
Proper communication will lead to action.

SRMIST DDE Self Learning Material


2.2 INTRODUCTION TO R

• R is a popular programming language used for statistical computing


and graphical presentation.
• R is used for Data Analysis and Visualization.
• It is easy to draw graphs in R, like pie charts, histograms, box plot,
scatter plot, etc
• It works on different platforms (Windows, Mac, Linux)
• It is open-source and free
• It has many packages (libraries of functions) that can be used to solve
different problems

2.2.1 FEATURES OF R PROGRAMMING LANGUAGE


Statistical Features of R:
• Basic Statistics: The most common basic statistics terms are the mean,
mode, and median. These are all known as “Measures of Central
Tendency.” So using the R language we can measure central tendency
very easily.
• Static graphics: R is rich with facilities for creating and developing
interesting static graphics. R contains functionality for many plot types
including graphic maps, mosaic plots, biplots, and the list goes on.
• Probability distributions: Probability distributions play a vital role in
statistics and by using R we can easily handle various types of
probability distribution such as Binomial Distribution, Normal
Distribution, Chi-squared Distribution and many more.
• Data analysis: It provides a large, coherent and integrated collection of
tools for data analysis.
Programming Features of R:
• R Packages: One of the major features of R is it has a wide availability of
libraries. R has CRAN(Comprehensive R Archive Network), which is a
repository holding more than 10, 0000 packages.

19 SRMIST DDE Self Learning Material


• Distributed Computing: Distributed computing is a model in which
components of a software system are shared among multiple computers
to improve efficiency and performance. Two new packages ddR and
multidplyr used for distributed programming in R were released in
November 2015.
Advantages of R:
• R is the most comprehensive statistical analysis package. As new
technology and concepts often appear first in R.
• As R programming language is an open source. Thus, you can run R
anywhere and at any time.
• R programming language is suitable for GNU/Linux and Windows
operating system.
• R programming is cross-platform which runs on any operating system.
• In R, everyone is welcome to provide new packages, bug fixes, and code
enhancements.
Disadvantages of R:
• In the R programming language, the standard of some packages is less
than perfect.
• Although, R commands give little pressure to memory management. So
R programming language may consume all available memory.
• In R basically, nobody to complain if something doesn’t work.
• R programming language is much slower than other programming
languages such as Python and MATLAB.

2.3 APPLICATIONS OF R:
• Tech giants like Google, Facebook, bing, Twitter, Accenture, Wipro
and many more using R nowadays.
• The Consumer Financial Protection Bureau uses R for data analysis
• Bank of America uses R for reporting.
• R is part of technology stack behind Foursquare’s famed
recommendation engine.

SRMIST DDE Self Learning Material


• ANZ, the fourth largest bank in Australia, using R for credit risk
analysis.
• Google uses R to predict Economic Activity

2.4 R INSTALLATION

To download R, just follow the steps below:


Install R on Windows OS:

1. Go to www.cran.r-project.org website

2. Click on “Download R for Windows”

21 SRMIST DDE Self Learning Material


3. Click on "install R for the first time" link to download the R
executable (.exe) file.

Run the R executable file to start installation, and allow the app to
make changes to your device.

4. Select the Installation Language

5. Follow the installation instructions.

SRMIST DDE Self Learning Material


6. Click on "Finish" to exit the installation setup.

23 SRMIST DDE Self Learning Material


R has now been successfully installed on your Windows OS. Open
the R GUI to start writing R codes.

SRMIST DDE Self Learning Material


2.4.1 Installing RStudio Desktop:

To install RStudio Desktop on your computer, do the following:

1. Go to www.posit.com website:

2. Click on "DOWNLOAD" in the top-right corner.


3. Click on "DOWNLOAD" under the "RStudio Open Source License".
4. Download RStudio Desktop recommended for your computer.

5. Run the RStudio Executable file (.exe) for Windows OS.

25 SRMIST DDE Self Learning Material


6. Follow the installation instructions to complete RStudio Desktop
installation.

RStudio is now successfully installed on your computer. The RStudio


Desktop IDE interface is shown in the figure below:

SRMIST DDE Self Learning Material


2.5 PACKAGES IN R
Packages in R Programming language are a set of R functions, compiled
code, and sample data.
These are stored under a directory called “library” within the R
environment. By default, R installs a group of packages during installation.
Once start the R console, only the default packages are available by default.
Other packages that are already installed need to be loaded explicitly to be
utilized by the R program that’s getting to use them.

Repositories:
A repository is a place where packages are located and stored ,the packages
from install from it. Organizations and Developers have a local repository,
typically they are online and accessible to everyone. Some of the most
popular repositories for R packages are:

• CRAN: Comprehensive R Archive Network(CRAN) is the official


repository, it is a network of ftp and web servers maintained by the R

27 SRMIST DDE Self Learning Material


community around the world. The R community coordinates it, and for a
package to be published in CRAN, the Package needs to pass several
tests to ensure that the package is following CRAN policies.
• Bioconductor: Bioconductor is a topic-specific repository, intended for
open source software for bioinformatics. Similar to CRAN, it has its own
submission and review processes, and its community is very active
having several conferences and meetings per year in order to maintain
quality.
• Github: Github is the most popular repository for open source projects.
It’s popular as it comes from the unlimited space for open source, the
integration with git, a version control software, and its ease to share and
collaborate with others.
Install an R-Packages
There are multiple ways to install R Package, some of them are,

• Installing Packages From CRAN: For installing Package from CRAN we


need the name of the package and use the following command:
install.packages("package name")

• Installing Package from CRAN is the most common and easiest way as
we just have to use only one command. In order to install more than a
package at a time, we just have to write them as a character vector in the
first argument of the install.packages() function:

Example:
install.packages(“ggplot”)

Update, Remove and Check Installed Packages In R


To check what packages are installed on the computer, type this command:
installed.packages()

To update all the packages, type this command:


update.packages()

SRMIST DDE Self Learning Material


To update a specific package, type this command:
install.packages("PACKAGE NAME")

Installing Packages Using RStudio UI


In R Studio goto Tools -> Install Package, and get a pop-up window to
type the package to install:

Under Packages, type, and search Package which we want to install and
then click on install button.

Loading a Package

A package can be loaded once it has been installed


using library() command.

Syntax;
library(“package_name”)

Example:
library("ggplot2")

29 SRMIST DDE Self Learning Material


2.6 WORKING WITH R ENVIRONMENT
• The most common form of interaction with R is through the command
line in the console.
• The user enters the command in the console.
o The user types a command:
print('Testing')

Output:

• After pressing the Enter key, the R interpreter executes and returns the
answer to the user.

Output:

SRMIST DDE Self Learning Material


• It is also possible to store a sequence of commands in a file. Use .R
extension and then ask R to execute all commands in the file that has .R
extension.
• We may also use the console as a simple calculator.
Using R as a Calculator
• Users type expressions to the R interpreter.
• R responds by computing and printing the answers.
> 1+2

> 1/2

> 11^2

Output:

31 SRMIST DDE Self Learning Material


MODULE 3

3.1 Variables in R Programming


3.2 Data Types
3.3 Control Statements in R
3.4 R Switch Statement
3.5 For Loop
3.6 Repeat Loop
3.7 While Loop
3.8 R – Strings

3.9 Date & Time

SRMIST DDE Self Learning Material


3.1 VARIABLES IN R PROGRAMMING

• Variables are used to store the information to be manipulated and


referenced in the R program. The R variable can store an atomic
vector, a group of atomic vectors, or a combination of many R objects.
• R is a dynamically typed, means it check the type of data type when
the statement is run.
• A valid variable name contains letter, numbers, dot and underlines
characters.
• A variable name should start with a letter or the dot not followed by a
number.

Name of Validity Reason for valid and invalid


variable

_var_name Invalid Variable name can't start with an underscore(_).

var_name, Valid Variable can start with a dot, but dot should not
var.name be followed by a number. In this case, the
variable will be invalid.

var_name% Invalid In R, we can't use any special character in the


variable name except dot and underscore.

2var_name Invalid Variable name cant starts with a numeric digit.

.2var_name Invalid A variable name cannot start with a dot which is


followed by a digit.

33 SRMIST DDE Self Learning Material


var_name2 Valid The variable contains letter, number and
underscore and starts with a letter.

Assignment of variable

In R programming, there are three operators which can use to assign the
values to the variable. Use leftward, rightward, and equal_to operator for this
purpose.

There are two functions which are used to print the value of the variable i.e.,
print() and cat(). The cat() function combines multiples values into a
continuous print output.

1. # Assignment using equal operator.


2. variable.1 = 124
3. # Assignment using leftward operator.
4. variable.2 <- "Learn R Programming"
5. # Assignment using rightward operator.
6. 133L -> variable.3
7. print(variable.1)

3.2 DATA TYPES

Variables can store data of different types, and different types can do
different things.

In R, variables do not need to be declared with any particular type, and can
even change type after they have been set:

my_var <- 50 # my_var is type of numeric

my_var <- "HELLO" # my_var is now of type character (aka string)

my_var # print my_var

SRMIST DDE Self Learning Material


3.2.1 Basic Data Types

Basic data types in R can be divided into the following types:

• numeric - (10.5, 55, 787)


• integer - (1L, 55L, 100L, where the letter "L" declares this as an
integer)
• complex - (9 + 3i, where "i" is the imaginary part)
• character (a.k.a. string) - ("k", "R is exciting", "FALSE", "11.5")
• logical (a.k.a. boolean) - (TRUE or FALSE)

The class() function used to check the data type of a variable:

Example

# numeric
x <- 10.5
class(x)

# integer
x <- 1000L
class(x)

# complex
x <- 9i + 3
class(x)

# character/string
x <- "R is exciting"
class(x)

# logical/boolean
x <- TRUE
class(x)

35 SRMIST DDE Self Learning Material


Output

[1] "numeric"
[1] "integer"
[1] "complex"
[1] "character"
[1] "logical"

3.3 CONTROL STATEMENTS IN R

Control statements are expressions used to control the execution and flow
of the program based on the conditions provided in the statements.

R if Statement

The if statement consists of the Boolean expressions followed by one or more


statements. The if statement is the simplest decision-making statement
which helps us to take a decision on the basis of the condition.

The if statement is a conditional programming statement which performs the


function and displays the information if it is proved true.

The block of code inside the if statement will be executed only when the
boolean expression evaluates to be true. If the statement evaluates false, then
the code which is mentioned after the condition will run.

The syntax of if statement in R is as follows:

1. if(boolean_expression) {
2. // If the boolean expression is true, then statement(s) will be executed.
3. }

SRMIST DDE Self Learning Material


Flow Chart

Let see some examples to understand how if statements work and perform a
certain task in R.

Example 1

1. x <-24L
2. y <- "shubham"
3. if(is.integer(x))
4. {
5. print("x is an Integer")
6. }

Output:

If-else statement

It is similar to if condition but when the test expression in if condition fails,


then statements in else condition are executed.

37 SRMIST DDE Self Learning Material


The basic syntax of If-else statement is as follows:

1. if(boolean_expression) {
2. // statement(s) will be executed if the boolean expression is true.
3. } else {
4. // statement(s) will be executed if the boolean expression is false.
5. }

Flow Chart

Example 1
1. # local variable definition
2. a<- 100
3. #checking boolean condition
4. if(a<20){
5. # if the condition is true then print the following
6. cat("a is less than 20\n")
7. }else{
8. # if the condition is false then print the following
9. cat("a is not less than 20\n")
10. }
11. cat("The value of a is", a)

Output:

SRMIST DDE Self Learning Material


R if else if statement

This statement is also known as nested if-else statement. The if statement is


followed by an optional else if..... else statement. This statement is used to test
various condition in a single if......else if statement. There are some key points
which are necessary to keep in mind when we are using the if.....else if.....else
statement. These points are as follows:

1. if statement can have either zero or one else statement and it must
come after any else if's statement.

2. if statement can have many else if's statement and they come before
the else statement.

3. Once an else if statement succeeds, none of the remaining else


if's or else's will be tested.

The basic syntax of If-else statement is as follows:

1. if(boolean_expression 1) {
2. // This block executes when the boolean expression 1 is true.
3. } else if( boolean_expression 2) {
4. // This block executes when the boolean expression 2 is true.
5. } else if( boolean_expression 3) {
6. // This block executes when the boolean expression 3 is true.
7. } else {
8. // This block executes when none of the above condition is true.
9. }

39 SRMIST DDE Self Learning Material


Flow Chart

Example 1
1. age <- readline(prompt="Enter age: ")
2. age <- as.integer(age)
3. if(age<18)
4. print("You are child")
5. else if(age>30)
6. print("You are old guy")
7. else
8. print("You are adult")

Output:

SRMIST DDE Self Learning Material


3.4 R SWITCH STATEMENT

A switch statement is a selection control mechanism that allows the value of


an expression to change the control flow of program execution via map and
search.

The switch statement is used in place of long if statements which compare a


variable with several integral values. It is a multi-way branch statement
which provides an easy way to dispatch execution for different parts of code.
This code is based on the value of the expression.

This statement allows a variable to be tested for equality against a list of


values. A switch statement is a little bit complicated. To understand it, we
have some key points which are as follows:

o If expression type is a character string, the string is matched to the


listed cases.

o If there is more than one match, the first match element is used.

o No default case is available.

o If no case is matched, an unnamed case is used.

There are basically two ways in which one of the cases is selected:

1) Based on Index

If the cases are values like a character vector, and the expression is evaluated
to a number than the expression's result is used as an index to select the case.

2) Based on Matching Value

When the cases have both case value and output value like
["case_1"="value1"], then the expression value is matched against case
values. If there is a match with the case, the corresponding value is the
output.

41 SRMIST DDE Self Learning Material


The basic syntax of If-else statement is as follows:

switch(expression, case1, case2, case3....)

Flow Chart

Example 1
1. x <- switch( 3, "Shubham", "Nishka", "Gunjan", "Sumit" )
2. print(x)

Output:

SRMIST DDE Self Learning Material


3.5 FOR LOOP

In R, a for loop is a way to repeat a sequence of instructions under certain


conditions. It allows us to automate parts of our code which need repetition.
In simple words, a for loop is a repetition control structure. It allows us to
efficiently write the loop that needs to execute a certain number of time.

SYNTAX:

1. for (value in vector)


2. {
3. statements
4. }

Flowchart

43 SRMIST DDE Self Learning Material


Example 1: .

1. # Create fruit vector


2. fruit <- c('Apple', 'Orange',"Guava", 'Pinapple', 'Banana','Grapes')
3. # Create the for statement
4. for ( i in fruit){
5. print(i)
6. }

Output

3.6 REPEAT LOOP

A repeat loop is used to iterate a block of code. It is a special type of loop in


which there is no condition to exit from the loop. For exiting, we include a
break statement with a user-defined condition. This property of the loop
makes it different from the other loops.

A repeat loop constructs with the help of the repeat keyword in R. It is very
easy to construct an infinite loop in R.

The basic syntax of the repeat loop is as follows:

1. repeat {
2. commands
3. if(condition) {
4. break
5. }
6. }

SRMIST DDE Self Learning Material


Flowchart

1. First, initialize the variables than it will enter into the Repeat loop.

2. This loop will execute the group of statements inside the loop.

3. After that, use any expression inside the loop to exit.

4. It will check for the condition. It will execute a break statement to exit
from the loop

5. If the condition is true.

6. The statements inside the repeat loop will be executed again if the
condition is false.

Example 1:

1. v <- c("Hello","repeat","loop")
2. cnt <- 2
3. repeat {
4. print(v)
5. cnt <- cnt+1
6. if(cnt > 5) {
7. break
8. }
9. }

45 SRMIST DDE Self Learning Material


Output

3.7 WHILE LOOP

A while loop is a type of control flow statements which is used to iterate a


block of code several numbers of times. The while loop terminates when the
value of the Boolean expression will be false.

In while loop, firstly the condition will be checked and then after the body of
the statement will execute. In this statement, the condition will be checked
n+1 time, rather than n times.

The basic syntax of while loop is as follows:

1. while (test_expression) {
2. statement
3. }

SRMIST DDE Self Learning Material


Flowchart

Example 1:

1. v <- c("Hello","while loop","example")


2. cnt <- 2
3. while (cnt < 7) {
4. print(v)
5. cntcnt = cnt + 1
6. }}

47 SRMIST DDE Self Learning Material


Output

3.8 R – STRINGS
• Strings are a bunch of character variables. It is a one-dimensional
array of characters.

• One or more characters enclosed in a pair of matching single or


double quotes can be considered a string in R.

• Strings represent textual content and can contain numbers, spaces,


and special characters.

• An empty string is represented by using “.

• Strings are always stored as double-quoted values in R. Double quoted


string can contain single quotes within it. Single quoted strings can’t
contain single quotes. Similarly, double quotes can’t be surrounded by
double quotes.

Creation of String

Strings can be created by assigning character values to a variable. These


strings can be further concatenated by using various functions and methods
to form a big string.

SRMIST DDE Self Learning Material


Example:
# creating a string with double quotes
str1 <- "OK1"
cat ("String 1 is : ", str1)
# creating a string with single quotes
str2 <- 'OK2'
cat ("String 2 is : ", str2)
str3 <- "This is 'acceptable and 'allowed' in R"
cat ("String 3 is : ", str3)
str4 <- 'Hi, Wondering "if this "works"'
cat ("String 4 is : ", str4)
str5 <- 'hi, ' this is not allowed'
cat ("String 5 is : ", str5)

Output:
String 1 is: OK1
String 2 is: OK2
String 3 is: This is 'acceptable and 'allowed' in R
String 4 is: Hi, Wondering "if this "works"
Error: unexpected symbol in " str5 <- 'hi, ' this"
Execution halted

Length of String

The length of strings indicates the number of characters present in the string.
The function str_length() belonging to the ‘string’ package
or nchar() inbuilt function of R can be used to determine the length of strings
in R.
Example 1: Using the str_length() function

# Importing package
library(stringr)

49 SRMIST DDE Self Learning Material


# Calculating length of string
str_length("hello")

Output:
5

Example 2: Using nchar() function


# R program to find length of string
# Using nchar() function
nchar("hel'lo")

Output:
6

Accessing portions of a string

The individual characters of a string can be extracted from a string by using


the indexing methods of a string. There are two R’s inbuilt functions in order
to access both the single character as well as the substrings of the string.

substr() or substring() function in R extracts substrings out of a string


beginning with the start index and ending with the end index. It also replaces
the specified substring with a new set of characters.
Syntax:
substr(..., start, end)

or

substring(..., start, end)

Example 1: Using substr() function

# R program to access

# characters in a string

SRMIST DDE Self Learning Material


# Accessing characters

# using substr() function

substr("Learn Code Tech", 1, 1)

Output:
"L"

If the starting index is equal to the ending index, the corresponding character
of the string is accessed. In this case, the first character, ‘L’ is printed.

Example 2: Using substring() function

# R program to access characters in string

str <- "Learn Code"

# counts the characters in the string

len <- nchar(str)

# Accessing character using

# substring() function

print (substring(str, len, len))

# Accessing elements out of index

print (substring(str, len+1, len+1))

Output:
[1] "e"

51 SRMIST DDE Self Learning Material


The number of characters in the string is 10. The first print statement prints
the last character of the string, “e”, which is str[10]. The second print
statement prints the 11th character of the string, which doesn’t exist, but the
code doesn’t throw an error and print “”, that is an empty character.

The following R code indicates the mechanism of String Slicing, where in the
substrings of a string are extracted:

# R program to access characters in string

str <- "Learn Code"

# counts the number of characters of str = 10

len <- nchar(str)

print(substr(str, 1, 4))

print(substr(str, len-2, len))

Output:
[1]"Lear"

[1]"ode"

The first print statement prints the first four characters of the string. The
second print statement prints the substring from the indexes 8 to 10, which
is “ode”.

3.9 DATE & TIME

• R programming language provides several functions that deal with


date and time.

• These functions are used to format and convert the date from one
form to another form.

SRMIST DDE Self Learning Material


• R provides a format function that accepts the date objects and also
format parameter that allows us to specify the format of the date
needed.

• R provides various format specifiers which are mentioned below in


table-

Specifier Description

%a Abbreviated weekday

%A Full weekday

%b Abbreviated month

%B Full month

%C Century

%y Year without century

%Y Year with century

%d Day of month (01-31)

53 SRMIST DDE Self Learning Material


Specifier Description

%j Day in Year (001-366)

%m Month of year (01-12)

%D Data in %m/%d/%y format

%u Weekday (01-07) Starts on Monday

Note:To get the Today date, R provides a method called sys.Date() which
returns the today date.

Weekday:

In this, look into the %a, %A, and %u specifiers which give the abbreviated
weekday, full weekday, and numbered weekday starting from Monday.

Example:

# today date

date<-Sys.Date()

# abbreviated month

format(date,format="%a")

# fullmonth

format(date,format="%A")

# weekday

SRMIST DDE Self Learning Material


format(date,format="%u")

Output
[1] "Sat"

[1] "Saturday"

[1] "6"

Date:

Let’s look into the day, month, and year format specifiers to represent dates
in different formats.

Example:

# today date

date<-Sys.Date()

# default format yyyy-mm-dd

date

# day in month

format(date,format="%d")

# month in year

format(date,format="%m")

55 SRMIST DDE Self Learning Material


# abbreviated month

format(date,format="%b")

# full month

format(date,format="%B")

# Date

format(date,format="%D")

format(date,format="%d-%b-%y")

Output
[1] "2022-04-02"

[1] "02"

[1] "04"

[1] "Apr"

[1] "April"

[1] "04/02/22"

[1] "02-Apr-22"

Year:

The year can format in different forms. %y, %Y, and %C are the few format
specifiers that return the year without century, a year with century, and
century of the given date respectively.

Example:

SRMIST DDE Self Learning Material


# today date

date<-Sys.Date()

# year without century

format(date,format="%y")

# year with century

format(date,format="%Y")

# century

format(date,format="%C")
Output
[1] "22"

[1] "2022"

[1] "20"

57 SRMIST DDE Self Learning Material


Module 4

4.1 Data Structures in R Programming


4.1.1 R Vectors
4.1.2 R List
4.1.3 R Data Frame
4.1.4 R Arrays

4.1.5 R Factors

4.2 R CSV Files

SRMIST DDE Self Learning Material


4.1 DATA STRUCTURES IN R PROGRAMMING
A data structure is a particular way of organizing data in a computer so that
it can be used effectively. The idea is to reduce the space and time
complexities of different tasks. Data structures in R programming are tools
for holding multiple values.

R’s base data structures are often organized by their dimensionality (1D,
2D, or nD) and whether they’re homogeneous (all elements must be of the
identical type) or heterogeneous (the elements are often of various types).
This gives rise to the six data types which are most frequently utilized in
data analysis.

The most essential data structures used in R include:

1. Vector

2. List

3. Array

4. Matrices

5. Data Frame

59 SRMIST DDE Self Learning Material


6. Factors

4.1.1 R VECTORS

A vector is the basic data structure in R that stores data of similar types.

For example,

To record the age of 5 employees. Instead of creating 5 separate variables,


simply create a vector.

Create a Vector in R

In R, use the c() function to create a vector.

For example,
# create vector of string types
employees <- c("Max", "James", "Stacy")
print(employees)

Output

In the above example, a vector named employees with elements: Max, James,
and Stacy.
Here, the c() function creates a vector by combining three different elements
of employees together.

Access Vector Elements in R

SRMIST DDE Self Learning Material


In R, each element in a vector is associated with a number. The number is
known as a vector index. The elements of a vector accessed using the index
number (1, 2, 3 …). For example,

# a vector of string type

languages <- c("Swift", "Java", "R")

# access first element of languages

print(languages[1]) # "Swift"

# access third element of languages

print(languages[3]). # "R"

In the above example, a vector named languages. Each element of the vector
is associated with an integer number.

Vector Indexing in R
Here, the vector index to access the vector elements

• languages[1] - access the first element "Swift"


• languages[3] - accesses the third element "R"

Note: In R, the vector index always starts with 1. Hence, the first element of a
vector is present at index 1, second element at index 2 and so on.

61 SRMIST DDE Self Learning Material


Modify Vector Element

To change a vector element, simply reassign a new value to the specific


index. For example,

dailyActivities <- c("Eat","Repeat")

cat("Initial Vector:", dailyActivities)

# change element at index 2

dailyActivities[2] <- "Sleep"

cat("\nUpdated Vector:", dailyActivities)

Output

Here, the vector element at index 2 from "Repeat" to "Sleep" by simply


assigning a new value.

Numeric Vector in R

The c() function to create a numeric vector. For example,

# a vector with number sequence from 1 to 5


num <- c(1, 2, 3, 4, 5)
print(num)

SRMIST DDE Self Learning Material


Output

Here, the C() function to create a vector of numeric sequence called numbers.

However, there is an efficient way to create a numeric sequence.


The : operator instead of C().

Create a Sequence of Number in R

In R, the : operator to create a vector with numerical values in sequence. For


example,
# a vector with number sequence from 1 to 10
num <- 1:10
print(num)

Output

Here, we have used the : operator to create the vector named num with
numerical values in sequence i.e. 1 to 10.

Repeat Vectors in R

In R, use the rep() function to repeat elements of vectors. For example,

Syntax:
rep(numbers, times=2)
Here,

63 SRMIST DDE Self Learning Material


• numbers - vector whose elements to be repeated
• times = 2 - repeat the vector two times

Let's see an example


# repeat sequence of vector 2 times
numbers <- rep(c(1,5,10), times = 2)
cat("Using times argument:", numbers)

Output

In the above example, a numeric vector is created with elements 1, 5, 10.


Here, each = 2 - repeats each element of vector two times

Loop Over a R Vector

The elements of the vector by using a for loop. For example,

numbers <- c(1, 2, 3)

# iterate through each elements of numbers

for (number in numbers) {

print(number)

Output

SRMIST DDE Self Learning Material


Length of Vector in R

The length() function to find the number of elements present inside the
vector. For example,
lang <- c("R", "Swift", "Python", "Java")
# find total elements in languages using length()
cat("Total Elements:", length(lang))

Output

Here, the length() to find the length of the languages vector.

4.1.2 R LIST
A List is a collection of similar or different types of data. In R,
the list() function to create a list. For example,

# list with similar type of data


list1 <- list(25, 22, 35, 38)
# list with different type of data
list2 <- list("Ranjy", 38, TRUE

Here,

• list1 - list of integers


• list2 - list containing string, integer, and boolean value

65 SRMIST DDE Self Learning Material


Access List Elements in R

In R, each element in a list is associated with a number. The number is known


as a list index.

Access elements of a list using the index number (1, 2, 3 …). For example,

list1 <- list(26, "James", 5.4, "Nepal")


# access 1st item
print(list1[1]) # 26
# access 4th item
print(list1[4]) # Nepal

In the above example,a list is created with the named list1.


Here, the vector index to access the vector elements

• list1[1] - access the first element 26


• list1[4] - accesses the third element "Nepal"

Note: In R, the list index always starts with 1. Hence, the first element of a list
is present at index 1, second element at index 2 and so on.

Modify a List Element in R

To change a list element, simply reassign a new value to the specific index.
For example,

list1 <- list(24, "Sabby", 5.4, "Nepal")

# change element at index 2

list1[2] <- "Cathy"

SRMIST DDE Self Learning Material


# print updated list

print(list1)

Output

Add Items to R List

The append() function to add an item at the end of the list. For example,
list1 <- list(26, "Sam")
# using append() function
append(list1, 3.14)

Output

In the above example, we have created a list named list1. Notice the line,

append(list1, 3.14)

Here, append() adds 3.14 at the end of the list.

Remove Items From a List in R

67 SRMIST DDE Self Learning Material


R allows us to remove items for a list, by accessing elements using a list index
and add negative sign - to indicate to delete the item. For example,
• [-1] - removes 1st item
• [-2] - removes 2nd item and so on.

list1 <- list(23, "Sabby", 5.4, "Bhutan")


# remove 4th item
print(list1[-4]) # Bhutan

Output

Here, list[-4] removes the 4th item of list1.

Length of R List

In R, use the length() function to find the number of elements present inside
the list. For example,

list1 <- list(24, "Sabby", "Nepal")

# find total elements in list1 using length()


cat("Total Elements:", length(list1))

Output

SRMIST DDE Self Learning Material


Here, the length() function used to find the length of list1. Since there
are 3 elements in list1 so length() returns 3.

Loop Over a List

In R, the for loop used to access all the element in the list. example,
items <- list(24, "Nova", 5.4, "SriLanka")
# iterate through each elements of numbers
for (i in items) {
print(i)
}
Output

Check if Element Exists in R List

In R, use the %in% operator to check if the specified element is present in


the list or not and returns a boolean value.
• TRUE - if specified element is present in the list
• FALSE - if specified element is not present in the list
For example,

list1 <- list(24, "Sabby", 5.4, "Nepal")

"Sabby" %in% list1 # TRUE

"Kinsley" %in% list1 # FALSE

69 SRMIST DDE Self Learning Material


Output

Here,

• "Larry" is present in list1, so the method returns TRUE


• "Kinsley" is not present in list1, so the method returns FALSE

4.1.3 R DATA FRAME

A data frame is a two-dimensional data structure which can store data


in tabular format.
Data frames have rows and columns and each column can be a different
vector. And different vectors can be of different data types.

Create a Data Frame in R

In R, the data.frame() function to create a Data Frame.


The syntax of the data.frame() function is

dataframe1 <- data.frame(


first_col = c(val1, val2, ...),
second_col = c(val1, val2, ...),
...
)

Here,

• first_col - a vector with values val1, val2, ... of same data type
• second_col - another vector with values val1, val2, ... of same data type and so
on
Let's see an example,

SRMIST DDE Self Learning Material


dataframe1 <- data.frame (
Name = c("James", "Harry", "Steve"),
Age = c(25, 10, 34),
Vote = c(TRUE, FALSE, TRUE)
)
print(dataframe1)
Output

In the above example, data.frame() function to create a data frame


named dataframe1. Notice the arguments passed inside data.frame(),

data.frame (
Name = c("James", "Harry", "Steve"),
Age = c(25, 10, 34),
Vote = c(TRUE, FALSE, TRUE)
)

Here, Name, Age, and Vote are column names for vectors of String, Numeric,
and Boolean type respectively.
And finally the data is represented in tabular format are printed.

Access Data Frame Columns

There are different ways to extract columns from a data frame. Use [ ], [[ ]],
or $ to access specific column of a data frame in R. For example,

# Create a data frame


dataframe1 <- data.frame (

71 SRMIST DDE Self Learning Material


Name = c("James", "Harry", "Steve"),
Age = c(25, 10, 34),
Vote = c(TRUE, FALSE, TRUE)
)

# pass index number inside [ ]


print(dataframe1[1])

# pass column name inside [[ ]]


print(dataframe1[["Name"]])

# use $ operator and column name


print(dataframe1$Name)

Output

In the above example, a data frame named dataframe1 with three


columns Name, Age, Vote.
Here, different operators are used to access Name column of dataframe1.
Accessing with [[ ]] or $ is similar. However, it differs for [ ], [ ] will return us
a data frame but the other two will reduce it into a vector and return a vector.

Combine Data Frames

In R, the rbind() and the cbind() function to combine two data frames
together.
• rbind() - combines two data frames vertically
• cbind() - combines two data frames horizontally

SRMIST DDE Self Learning Material


Combine Vertically Using rbind()

To combine two data frames vertically, the column name of the two
data frames must be the same. For example,

# create a data frame


dataframe1 <- data.frame (
Name = c("Jack", "Alex"),
Age = c(23, 18)
)

# create another data frame


dataframe2 <- data.frame (
Name = c("Yash", "Bush"),
Age = c(56, 78)
)

# combine two data frames vertically


updated <- rbind(dataframe1, dataframe2)
print(updated)

Output

Here, the rbind() function to combine the two data


frames: dataframe1 and dataframe2 vertically.

73 SRMIST DDE Self Learning Material


Combine Horizontally Using cbind()

The cbind() function combines two or more data frames horizontally. For
example,

# create a data frame


dataframe1 <- data.frame (
Name = c("Jack", "Aliz"),
Age = c(25, 19)
)
# create another data frame
dataframe2 <- data.frame (
Hobby = c("Cricket", "Squash")
)

# combine two data frames horizontally


updated <- cbind(dataframe1, dataframe2)
print(updated)

Output

Here, cbind() used to combine two data frames horizontally.

SRMIST DDE Self Learning Material


Note: The number of items on each vector of two or more combining data
frames must be equal otherwise we will get an error: arguments imply
differing number of rows or columns.

Length of a Data Frame in R

In R, the length() function to find the number of columns in a data frame. For
example,

# Create a data frame


dataframe1 <- data.frame (
Name = c("James", "Harry", "Steve"),
Age = c(25, 10, 34),
Vote = c(TRUE, FALSE, TRUE)
)
cat("Total Elements:", length(dataframe1))

Output

Here, length() used to find the total number of columns in dataframe1. Since
there are 3 columns, the length() function returns 3.

75 SRMIST DDE Self Learning Material


4.1.4 R MATRICES

• A matrix is a two dimensional data set with columns and rows.


• A column is a vertical representation of data, while a row is a
horizontal representation of data.
• A matrix can be created with the matrix() function. Specify
the nrow and ncol parameters to get the amount of rows and columns:

Example

# Create a matrix
>>thismatrix <- matrix(c(1,2,3,4,5,6), nrow = 3, ncol = 2)
# Print the matrix
>>thismatrix

Output

[,1] [,2]

[1,] 1 4

[2,] 2 5

[3,] 3 6

A matrix with strings:

Example

>>thismatrix <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2,


ncol = 2)

SRMIST DDE Self Learning Material


>>thismatrix

Output

[,1] [,2]

[1,] "apple" "cherry"

[2,] "banana" "orange"

Access Matrix Items

The items are accessed using [ ] brackets. The first number "1" in the bracket
specifies the row-position, while the second number "2" specifies the
column-position:

Example

thismatrix <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2,


ncol = 2)
thismatrix[1, 2]

Output

[1] “cherry”

The whole row can be accessed by specifying a comma after the number in
the bracket:

Access More Than One Row

More than one row can be accessed by using the c() function:

77 SRMIST DDE Self Learning Material


Example

thismatrix <-
matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "
melon", "fig"), nrow = 3, ncol = 3)
thismatrix[c(1,2),]

Access More Than One Column

More than one column can be accessed if you use the c() function:

Example

thismatrix <-
matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "
melon", "fig"), nrow = 3, ncol = 3)
thismatrix[, c(1,2)]

Output

[,1] [,2] [,3]

[1,] "apple" "orange" "pear"

[2,] "banana" "grape" "melon"

Add Rows and Columns

Use the cbind() function to add additional columns in a Matrix:

Example

thismatrix <-
matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "
melon", "fig"), nrow = 3, ncol = 3)

SRMIST DDE Self Learning Material


newmatrix <- cbind(thismatrix, c("strawberry", "blueberry", "raspberry"))

# Print the new matrix


newmatrix

Output

[,1] [,2] [,3] [,4]

[1,] "apple" "orange" "pear" "strawberry"

[2,] "banana" "grape" "melon" "blueberry"

[3,] "cherry" "pineapple" "fig" "raspberry"

Note: The cells in the new column must be of the same length as the existing
matrix.

Use the rbind() function to add additional rows in a Matrix:

Example

thismatrix <-
matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "
melon", "fig"), nrow = 3, ncol = 3)
newmatrix <- rbind(thismatrix, c("strawberry", "blueberry", "raspberry"))
# Print the new matrix
newmatrix

Note: The cells in the new row must be of the same length as the existing
matrix.

[,1] [,2] [,3]


[1,] "apple" "orange" "pear"

79 SRMIST DDE Self Learning Material


[2,] "banana" "grape" "melon"
[3,] "cherry" "pineapple" "fig"
[4,] "strawberry" "blueberry" "raspberry"

Remove Rows and Columns

Use the c() function to remove rows and columns in a Matrix:

Example

thismatrix <-
matrix(c("apple", "banana", "cherry", "orange", "mango", "pineapple"), nrow
= 3, ncol =2)
#Remove the first row and the first column
thismatrix <- thismatrix[-c(1), -c(1)]
thismatrix

Output

[1] "mango" "pineapple"

Check if an Item Exists

To find out if a specified item is present in a matrix, use the %in% operator:

Example

Check if "apple" is present in the matrix:

thismatrix <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2,


ncol = 2)

"apple" %in% thismatrix

SRMIST DDE Self Learning Material


Output

[1] TRUE

Number of Rows and Columns

Use the dim() function to find the number of rows and columns in a Matrix:

Example

thismatrix <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2,


ncol = 2)

dim(thismatrix)

Output

[1] 2 2

Matrix Length

Use the length() function to find the dimension of a Matrix:

Example

thismatrix <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2,


ncol = 2)

length(thismatrix)

Output

[1] 4

81 SRMIST DDE Self Learning Material


Total cells in the matrix is the number of rows multiplied by number of
columns.

In the example above: Dimension = 2*2 = 4.

Loop Through a Matrix

Matrix can use a for loop. The loop will start at the first row, moving right:

Example

Loop through the matrix items and print them:

thismatrix <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2,


ncol = 2)

for (rows in 1:nrow(thismatrix)) {


for (columns in 1:ncol(thismatrix)) {
print(thismatrix[rows, columns])
}
}

Output

[1] "apple"
[1] "cherry"
[1] "banana"
[1] "orange"

Combine two Matrices

Use the rbind() or cbind() function to combine two or more matrices


together:

Example

# Combine matrices
Matrix1 <- matrix(c("apple", "banana", "cherry", "grape"), nrow = 2, ncol = 2)

SRMIST DDE Self Learning Material


Matrix2 <- matrix(c("orange", "mango", "pineapple", "watermelon"), nrow
= 2, ncol = 2)

# Adding it as a rows
Matrix_Combined <- rbind(Matrix1, Matrix2)
Matrix_Combined

# Adding it as a columns
Matrix_Combined <- cbind(Matrix1, Matrix2)
Matrix_Combined

Output

[,1] [,2]
[1,] "apple" "cherry"
[2,] "banana" "grape"
[3,] "orange" "pineapple"
[4,] "mango" "watermelon"
[,1] [,2] [,3] [,4]
[1,] "apple" "cherry" "orange" "pineapple"
[2,] "banana" "grape" "mango" "watermelon"

4.1.4 R ARRAYS

Arrays can have more than two dimensions. The array() function to create an
array, and the dim parameter to specify the dimensions.

Example

83 SRMIST DDE Self Learning Material


# An array with one dimension with values ranging from 1 to 24

thisarray <- c(1:24)

thisarray

# An array with more than one dimension

multiarray <- array(thisarray, dim = c(4, 3, 2))

multiarray

Output

[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
,,1

[,1] [,2] [,3]


[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12

,,2

[,1] [,2] [,3]


[1,] 13 17 21
[2,] 14 18 22

SRMIST DDE Self Learning Material


[3,] 15 19 23
[4,] 16 20 24

Example Explained

In the example above we create an array with the values 1 to 24.

dim=c(4,3,2)
The first and second number in the bracket specifies the amount of rows and
columns.
The last number in the bracket specifies how many dimensions we want.

Note: Arrays can only have one data type.

Access Array Items

The array elements can be accessed by referring to the index position.


The [] brackets to access the desired elements from an array.

The syntax is as follow: array[row position, column position, matrix level]

Example

thisarray <- c(1:24)


multiarray <- array(thisarray, dim = c(4, 3, 2))
multiarray[2, 3, 2]

The whole row or column from a matrix in an array,can be accessed by using


the c() function:

Output

[1] 22

Check if an Item Exists

85 SRMIST DDE Self Learning Material


To find out if a specified item is present in an array, use the %in% operator:

Example

Check if the value "2" is present in the array:

thisarray <- c(1:24)


multiarray <- array(thisarray, dim = c(4, 3, 2))

2 %in% multiarray

Output

[1] TRUE

Amount of Rows and Columns

Use the dim() function to find the amount of rows and columns in an array:

Example

thisarray <- c(1:24)


multiarray <- array(thisarray, dim = c(4, 3, 2))
dim(multiarray)

Output

[1] 4 3 2

Array Length

Use the length() function to find the dimension of an array:

Example

thisarray <- c(1:24)


multiarray <- array(thisarray, dim = c(4, 3, 2))
length(multiarray)

SRMIST DDE Self Learning Material


Output

[1] 24

Loop Through an Array

Loop through the array items by using a for loop:

Example

thisarray <- c(1:5)


multiarray <- array(thisarray, dim = c(4, 3, 2))
for(x in multiarray){
print(x)
}

Output

[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 1

87 SRMIST DDE Self Learning Material


[1] 2
[1] 3
[1] 4
[1] 5
[1] 1
[1] 2
[1] 3
[1] 4

4.5 R FACTORS

Factors are used to categorize data. Examples of factors are:

• Demography: Male/Female
• Music: Rock, Pop, Classic, Jazz
• Training: Strength, Stamina

To create a factor, use the factor() function and add a vector as argument:

Example

# Create a factor
music_genre <-
factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz"))
# Print the factor
music_genre

Output

[1] Jazz Rock Classic Classic Pop Jazz Rock Jazz

Levels: Classic Jazz Pop Rock

From the example above that that the factor has four levels (categories):
Classic, Jazz, Pop and Rock.

SRMIST DDE Self Learning Material


To only print the levels, use the levels() function:

Example

music_genre <-
factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz"))
levels(music_genre)

Output

[1] "Classic" "Jazz" "Pop" "Rock"

Factor Length

Use the length() function to find out how many items there are in the factor:

Example

music_genre <-
factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz"))

length(music_genre)

Output

[1] 8

4.2 R CSV Files

A Comma-Separated Values (CSV) file is a plain text file which contains a


list of data. These files are often used for the exchange of data between
different applications. For example, databases and contact managers mostly
support CSV files.

These files can sometimes be called character-separated values or comma-


delimited files. They often use the comma character to separate data, but
sometimes use other characters such as semicolons. The idea is that we can

89 SRMIST DDE Self Learning Material


export the complex data from one application to a CSV file, and then
importing the data in that CSV file to another application.

R allows us to read data from files which are stored outside the R
environment.

Getting and setting the working directory

In R, getwd() and setwd() are the two useful functions. The getwd() function
is used to check on which directory the R workspace is pointing. And the
setwd() function is used to set a new working directory to read and write
files from that directory.

Let's see an example to understand how getwd() and setwd() functions are
used.

Example

# Getting and printing current working directory.


print(getwd())
# Setting the current working directory.
setwd("C:/Users/ajeet")
# Getting and printingthe current working directory.
print(getwd())

Output

SRMIST DDE Self Learning Material


Creating a CSV File

A text file in which a comma separates the value in a column is known as a


CSV file. Let's start by creating a CSV file with the help of the data, which is
mentioned below by saving with .csv extension using the save As All files(*.*)
option in the notepad.

Example: record.csv

1. id,name,salary,start_date,dept
2. 1,Shubham,613.3,2012-01-01,IT
3. 2,Arpita,525.2,2013-09-23,Operations
4. 3,Vaishali,63,2014-11-15,IT
5. 4,Nishka,749,2014-05-11,HR
6. 5,Gunjan,863.25,2015-03-27,Finance
7. 6,Sumit,588,2013-05-21,IT
8. 7,Anisha,932.8,2013-07-30,Operations
9. 8,Akash,712.5,2014-06-17,Finance

Output

91 SRMIST DDE Self Learning Material


Reading a CSV file

R has a rich set of functions. R provides read.csv() function, which allows us


to read a CSV file available in our current working directory. This function
takes the file name as an input and returns all the records present on it.

Let's use our record.csv file to read records from it using read.csv() function.

Example

1. data <- read.csv("record.csv")


2. print(data)

When execute above code, it will give the following output

Output

SRMIST DDE Self Learning Material


Analyzing the CSV File

When we read data from the .csv file using read.csv() function, by default, it
gives the output as a data frame. Before analyzing data, let's start checking
the form of our output with the help of is.data.frame() function. After that,
we will check the number of rows and number of columns with the help
of nrow() and ncol() function.

Example

csv_data<- read.csv("record.csv")
print(is.data.frame(csv_data))
print(ncol(csv_data))
print(nrow(csv_data))

Output

93 SRMIST DDE Self Learning Material


From the above output, it is clear that our data is read in the form of the data
frame.

Example: Getting the maximum salary

# Creating a data frame.


csv_data<- read.csv("record.csv")
# Getting the maximum salary from data frame.
max_sal<- max(csv_data$salary)
print(max_sal)

Output

Example: Getting the details of the person who have a maximum salary

# Creating a data frame.


csv_data<- read.csv("record.csv")
# Getting the maximum salary from data frame.

SRMIST DDE Self Learning Material


max_sal<- max(csv_data$salary)
print(max_sal)
#Getting the detais of the pweson who have maximum salary
details <- subset(csv_data,salary==max(salary))
print(details)

Output

Example: Getting the details of all the persons who are working in the IT
department

# Creating a data frame.


csv_data<- read.csv("record.csv")
#Getting the detais of all the pweson who are working in IT department
details <- subset(csv_data,dept=="IT")
print(details)

Output

Example: Getting the details of the persons whose salary is greater than
600 and working in the IT department.

# Creating a data frame.


csv_data<- read.csv("record.csv")
#Getting the detais of all the pweson who are working in IT department

95 SRMIST DDE Self Learning Material


details <- subset(csv_data,dept=="IT"&salary>600)
print(details)

Output

Example: Getting details of those peoples who joined on or after 2014.

# Creating a data frame.


csv_data<- read.csv("record.csv")

#Getting details of those peoples who joined on or after 2014


details <- subset(csv_data,as.Date(start_date)>as.Date("2014-01-01"))
print(details)

Output

Writing into a CSV file

Like reading and analyzing, R also allows to write into the .csv file. For this
purpose, R provides a write.csv() function. This function creates a CSV file
from an existing data frame. This function creates the file in the current
working directory.

sExample

csv_data<- read.csv("record.csv")
#Getting details of those peoples who joined on or after 2014
details <- subset(csv_data,as.Date(start_date)>as.Date("2014-01-01"))

SRMIST DDE Self Learning Material


# Writing filtered data into a new file.
write.csv(details,"output.csv")
new_details<- read.csv("output.csv")
print(new_details)

Output

97 SRMIST DDE Self Learning Material


MODULE 5

5.1 Classification Algorithm


5.2 Types of Classification Algorithms
5.3 Decision Tree Classification
5.4 Naïve Bayes Classifier

SRMIST DDE Self Learning Material


5.1 CLASSIFICATION ALGORITHM

• The Classification algorithm is a Supervised Learning technique that is


used to identify the category of new observations on the basis of
training data.
• In Classification, a program learns from the given dataset or
observations and then classifies new observation into a number of
classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat
or dog, etc.
• Classes can be called as targets/labels or categories.

y=f(x),
where y = categorical output

• The best example of an ML classification algorithm is Email Spam


Detector.
• The main goal of the Classification algorithm is to identify the category
of a given dataset, and these algorithms are mainly used to predict the
output for the categorical data.
• Classification algorithms can be better understood using the below
diagram. In the below diagram, there are two classes, class A and Class
B. These classes have features that are similar to each other and
dissimilar to other classes.

99 SRMIST DDE Self Learning Material


The algorithm which implements the classification on a dataset is known as a
classifier. There are two types of Classifications:

o Binary Classifier: If the classification problem has only two possible


outcomes, then it is called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or
DOG, etc.

o Multi-class Classifier: If a classification problem has more than two


outcomes, then it is called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of
music.

5.1.1 Learners in Classification Problems:

In the classification problems, there are two types of learners:

1. Lazy Learners: Lazy Learner firstly stores the training dataset and
wait until it receives the test dataset. In Lazy learner case,
classification is done on the basis of the most related data stored in
the training dataset. It takes less time in training but more time for
predictions.
Example: K-NN algorithm, Case-based reasoning

2. Eager Learners:Eager Learners develop a classification model based


on a training dataset before receiving a test dataset. Opposite to Lazy
learners, Eager Learner takes more time in learning, and less time in
prediction. Example: Decision Trees, Naïve Bayes, ANN.

5.2 TYPES OF CLASSIFICATION ALGORITHMS

Classification Algorithms can be further divided into the Mainly two category:

o Linear Models

o Logistic Regression

o Support Vector Machines

o Non-linear Models

SRMIST DDE Self Learning Material


o K-Nearest Neighbours

o Kernel SVM

o Naïve Bayes

o Decision Tree Classification

o Random Forest Classification

5.2.1 Evaluating a Classification model:

Once the model is completed, it is necessary to evaluate its performance;


either it is a Classification or Regression model. So for evaluating a
Classification model, using following ways:

1. Confusion Matrix:

o The confusion matrix provides us a matrix/table as output and


describes the performance of the model.

o It is also known as the error matrix.

o The matrix consists of predictions result in a summarized form, which


has a total number of correct predictions and incorrect predictions.
The matrix looks like as below table:

Actual Positive Actual Negative

Predicted Positive True Positive False Positive

Predicted Negative False Negative True Negative

101 SRMIST DDE Self Learning Material


2. AUC-ROC curve:

o ROC curve stands for Receiver Operating Characteristics Curve and


AUC stands for Area Under the Curve.

o It is a graph that shows the performance of the classification model at


different thresholds.

o To visualize the performance of the multi-class classification model,


we use the AUC-ROC Curve.

o The ROC curve is plotted with TPR and FPR, where TPR (True Positive
Rate) on Y-axis and FPR(False Positive Rate) on X-axis.

5.2.2 Use cases of Classification Algorithms

Classification algorithms can be used in different places. Below are some


popular use cases of Classification Algorithms:

o Email Spam Detection

o Speech Recognition

o Identifications of Cancer tumor cells.

o Drugs Classification

o Biometric Identification, etc.

5.3 DECISION TREE CLASSIFICATION

Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal
nodes represent the features of a dataset, branches represent the
decision rules and each leaf node represents the outcome.

o In a Decision tree, there are two nodes, which are the Decision
Node and Leaf Node. Decision nodes are used to make any decision
and have multiple branches, whereas Leaf nodes are the output of
those decisions and do not contain any further branches.

SRMIST DDE Self Learning Material


o The decisions or the test are performed on the basis of features of the
given dataset.

o It is a graphical representation for getting all the possible


solutions to a problem/decision based on given conditions.

o It is called a decision tree because, similar to a tree, it starts with the


root node, which expands on further branches and constructs a tree-
like structure.

o In order to build a tree, we use the CART algorithm, which stands


for Classification and Regression Tree algorithm.

o A decision tree simply asks a question, and based on the answer


(Yes/No), it further split the tree into subtrees.

o Below diagram explains the general structure of a decision tree:

Note: A decision tree can contain categorical data (YES/NO) as well as


numeric data.

5.3.1 Decision Tree Terminologies

103 SRMIST DDE Self Learning Material


• Root Node: Root node is from where the decision tree starts. It
represents the entire dataset, which further gets divided into two or
more homogeneous sets.
• Leaf Node: Leaf nodes are the final output node, and the tree cannot
be segregated further after getting a leaf node.
• Splitting: Splitting is the process of dividing the decision node/root
node into sub-nodes according to the given conditions.
• Branch/Sub Tree: A tree formed by splitting the tree.
• Pruning: Pruning is the process of removing the unwanted branches
from the tree.
• Parent/Child node: The root node of the tree is called the parent
node, and other nodes are called the child nodes.

5.3.2 Steps for Decision Tree algorithm:

In a decision tree, for predicting the class of the given dataset, the algorithm
starts from the root node of the tree. This algorithm compares the values of
root attribute with the record (real dataset) attribute and, based on the
comparison, follows the branch and jumps to the next node.

For the next node, the algorithm again compares the attribute value with the
other sub-nodes and move further. It continues the process until it reaches
the leaf node of the tree. The complete process can be better understood
using the below algorithm:

o Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.

o Step-2: Find the best attribute in the dataset using Attribute


Selection Measure (ASM).

o Step-3: Divide the S into subsets that contains possible values for the
best attributes.

o Step-4: Generate the decision tree node, which contains the best
attribute.

SRMIST DDE Self Learning Material


o Step-5: Recursively make new decision trees using the subsets of the
dataset created in step -3. Continue this process until a stage is
reached where you cannot further classify the nodes and called the
final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to
decide whether he should accept the offer or Not. So, to solve this problem,
the decision tree starts with the root node (Salary attribute by ASM). The
root node splits further into the next decision node (distance from the office)
and one leaf node based on the corresponding labels. The next decision node
further gets split into one decision node (Cab facility) and one leaf node.
Finally, the decision node splits into two leaf nodes (Accepted offers and
Declined offer). Consider the below diagram:

5.3.3 Attribute Selection Measures

While implementing a Decision tree, the main issue arises that how to select
the best attribute for the root node and for sub-nodes. So, to solve such
problems there is a technique which is called as Attribute selection
measure or ASM. By this measurement, we can easily select the best
attribute for the nodes of the tree. There are two popular techniques for ASM,
which are:

105 SRMIST DDE Self Learning Material


o Information Gain

o Gini Index

1. Information Gain:

o Information gain is the measurement of changes in entropy after the


segmentation of a dataset based on an attribute.

o It calculates how much information a feature provides us about a


class.

o According to the value of information gain, we split the node and build
the decision tree.

o A decision tree algorithm always tries to maximize the value of


information gain, and a node/attribute having the highest information
gain is split first. It can be calculated using the below formula:

Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy: Entropy is a metric to measure the impurity in a given attribute. It


specifies randomness in data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

o S= Total number of samples

o P(yes)= probability of yes

o P(no)= probability of no

2. Gini Index:

o Gini index is a measure of impurity or purity used while creating a


decision tree in the CART(Classification and Regression Tree)
algorithm.

o An attribute with the low Gini index should be preferred as compared


to the high Gini index.

SRMIST DDE Self Learning Material


o It only creates binary splits, and the CART algorithm uses the Gini
index to create binary splits.

o Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2

5.3.4 Pruning: Getting an Optimal Decision tree

Pruning is a process of deleting the unnecessary nodes from a tree in order to


get the optimal decision tree.

A too-large tree increases the risk of overfitting, and a small tree may not
capture all the important features of the dataset. Therefore, a technique that
decreases the size of the learning tree without reducing accuracy is known as
Pruning. There are mainly two types of tree pruning technology used:

o Cost Complexity Pruning

o Reduced Error Pruning.

5.3.5 Advantages of the Decision Tree

o It is simple to understand as it follows the same process which a


human follow while making any decision in real-life.

o It can be very useful for solving decision-related problems.

o It helps to think about all the possible outcomes for a problem.

o There is less requirement of data cleaning compared to other


algorithms.

5.3.6 Disadvantages of the Decision Tree

o The decision tree contains lots of layers, which makes it complex.

107 SRMIST DDE Self Learning Material


o It may have an overfitting issue, which can be resolved using
the Random Forest algorithm.

o For more class labels, the computational complexity of the decision


tree may increase.

EXAMPLE:

DIABETES PREDICTION USING KNN IN R

ABOUT THE DATASET

The data used for this analysis is from the National Institute of Diabetes and
Digestive and Kidney Diseases and is made available on Kaggle.

The Dataset contains entries from only women of at least 21 years of age with
Pima Indian heritage with the following features;

• Pregnancies: Number of times pregnant.

• Glucose: Plasma glucose concentration a 2 hours in an oral glucose


tolerance test.

• BloodPressure: Diastolic blood pressure (mm Hg).

• SkinThickness: Triceps skin fold thickness (mm).

• Insulin: 2-Hour serum insulin (mu U/ml).

• BMI: Body mass index (weight in kg/(height in m)^2).

• DiabetesPedigreeFunction: Diabetes pedigree function.

• Age: Age (years).

• Outcome: Class variable (0 or 1) with the class value 1 representing


those who tested positive for diabetes.

SRMIST DDE Self Learning Material


>df=read.csv(file.choose())
> head(df)
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
1 6 148 72 35 0 33.6
2 1 85 66 29 0 26.6
3 8 183 64 0 0 23.3
4 1 89 66 23 94 28.1
5 0 137 40 35 168 43.1
6 5 116 74 0 0 25.6
DiabetesPedigreeFunction Age Outcome
1 0.627 50 1
2 0.351 31 0
3 0.672 32 1
4 0.167 21 0
5 2.288 33 1
6 0.201 30 0
> str(df)
'data.frame': 768 obs. of 9 variables:
$ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
$ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
$ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...
$ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
$ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
$ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
$ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
$ Age : int 50 31 32 21 33 30 26 29 53 54 ...
$ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...
> df$Outcome<-factor(df$Outcome, levels=c(0,1), labels=c("No", "Yes"))
> summary(df)
Pregnancies Glucose BloodPressure
Min. : 0.00 Min. : 0.00 Min. : 0.0
1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00
109 SRMIST DDE Self Learning Material
Median : 3.000 Median :117.0 Median : 72.00
Mean : 3.845 Mean :120.9 Mean : 69.11
3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00
Max. :17.000 Max. :199.0 Max. :122.00
> sapply(df,function(x) sum(is.na(x)))
Pregnancies Glucose BloodPressure
0 0 0
SkinThickness Insulin BMI
0 0 0
DiabetesPedigreeFunction Age Outcome
0 0 0
> install.packages("corrplot")
> library(corrplot)
corrplot(cor(df[, -9]), type = "lower", method = "number")

>set.seed(123)
> n <- nrow(df)
> train <- sample(n, trunc(0.70*n))
> df_train <- df[train, ]
> df_test <- df[-train, ]
> install.packages("rpart")
> library(rpart)
> model<-rpart(Outcome ~ .,data=df_train)

SRMIST DDE Self Learning Material


> plot(model, margin = 0.01)
> text(model,use.n = TRUE, pretty = TRUE, cex=0.8)

> p<-predict(model,df_test,type="class")
> library(caret)
> confusionMatrix(p,df_test$Outcome)
Confusion Matrix and Statistics

Reference
Prediction No Yes
No 129 46
Yes 21 35

Accuracy : 0.71
95% CI : (0.6468, 0.7676)
No Information Rate : 0.6494
P-Value [Acc > NIR] : 0.029993

Kappa : 0.3144

Mcnemar's Test P-Value : 0.003367

111 SRMIST DDE Self Learning Material


Sensitivity : 0.8600
Specificity : 0.4321
Pos Pred Value : 0.7371
Neg Pred Value : 0.6250
Prevalence : 0.6494
Detection Rate : 0.5584
Detection Prevalence : 0.7576
Balanced Accuracy : 0.6460

'Positive' Class : No

5.4 NAÏVE BAYES CLASSIFIER ALGORITHM

o Naïve Bayes algorithm is a supervised learning algorithm, which is


based on Bayes theorem and used for solving classification problems.

o It is mainly used in text classification that includes a high-dimensional


training dataset.

o Naïve Bayes Classifier is one of the simple and most effective


Classification algorithms which helps in building the fast machine
learning models that can make quick predictions.

o It is a probabilistic classifier, which means it predicts on the basis


of the probability of an object.

o Some popular examples of Naïve Bayes Algorithm are spam


filtration, Sentimental analysis, and classifying articles.

The Naïve Bayes algorithm is comprised of two words Naïve and Bayes,
Which can be described as:

o Naïve: It is called Naïve because it assumes that the occurrence of a


certain feature is independent of the occurrence of other features.
Such as if the fruit is identified on the bases of color, shape, and taste,
then red, spherical, and sweet fruit is recognized as an apple. Hence

SRMIST DDE Self Learning Material


each feature individually contributes to identify that it is an apple
without depending on each other.

o Bayes: It is called Bayes because it depends on the principle of Bayes'


Theorem.

Bayes' Theorem:

o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is


used to determine the probability of a hypothesis with prior
knowledge. It depends on the conditional probability.

o The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the


observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the


probability of a hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the


evidence.

P(B) is Marginal Probability: Probability of Evidence.

5.4.1 Steps for Naïve Bayes' Classifier:

Working of Naïve Bayes' Classifier can be understood with the help of the
below example:

Example the dataset of weather conditions and corresponding target


variable "Play". So using this dataset it is necessary to decide that whether

113 SRMIST DDE Self Learning Material


the student should play or not on a particular day according to the weather
conditions. So to solve this problem, follow the below steps:

1. Convert the given dataset into frequency tables.

2. Generate Likelihood table by finding the probabilities of given


features.

3. Now, use Bayes theorem to calculate the posterior probability.

Problem: If the weather is sunny, then the Player should play or not?

Solution: To solve this, first consider the below dataset:

Outlook Play

0 Rainy Yes

1 Sunny Yes

2 Overcast Yes

3 Overcast Yes

4 Sunny No

5 Rainy Yes

6 Sunny Yes

7 Overcast Yes

8 Rainy No

9 Sunny No

10 Sunny Yes

SRMIST DDE Self Learning Material


11 Rainy No

12 Overcast Yes

13 Overcast Yes

Frequency table for the Weather Conditions:

Weather Yes No

Overcast 5 0

Rainy 2 2

Sunny 3 2

Total 10 5

Likelihood table weather condition:

Learn more

Weather No Yes

Overcast 0 5 5/14= 0.35

Rainy 2 2 4/14=0.29

Sunny 2 3 5/14=0.35

All 4/14=0.29 10/14=0.71

Applying Bayes'theorem:

115 SRMIST DDE Self Learning Material


P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)

P(Sunny|Yes)= 3/10= 0.3

P(Sunny)= 0.35

P(Yes)=0.71

So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

P(Sunny|NO)= 2/4=0.5

P(No)= 0.29

P(Sunny)= 0.35

So P(No|Sunny)= 0.5*0.29/0.35 = 0.41

So as we can see from the above calculation


that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.

5.4.2 Advantages of Naïve Bayes Classifier:

o Naïve Bayes is one of the fast and easy ML algorithms to predict a


class of datasets.

o It can be used for Binary as well as Multi-class Classifications.

o It performs well in Multi-class predictions as compared to the other


Algorithms.

o It is the most popular choice for text classification problems.

5.4.3 Disadvantages of Naïve Bayes Classifier:

o Naive Bayes assumes that all features are independent or unrelated,


so it cannot learn the relationship between features.

SRMIST DDE Self Learning Material


5.4.4 Applications of Naïve Bayes Classifier:

o It is used for Credit Scoring.

o It is used in medical data classification.

o It can be used in real-time predictions because Naïve Bayes


Classifier is an eager learner.

o It is used in Text classification such as Spam filtering and Sentiment


analysis.

EXAMPLE:
NAÏVE BAYES CLASSIFIER FOR IRIS DATASET
ABOUT DATASET
Iris dataset consists of 50 samples from each of 3 species of Iris(Iris setosa,
Iris virginica, Iris versicolor)

# Loading data
data(iris)
# Structure
str(iris)
# Installing Packages
install.packages("e1071")
install.packages("caTools")
install.packages("caret")
# Loading package
library(e1071)
library(caTools)
library(caret)
# Splitting data into train
# and test data
split <- sample.split(iris, SplitRatio = 0.7)
train_cl <- subset(iris, split == "TRUE")
test_cl <- subset(iris, split == "FALSE")

117 SRMIST DDE Self Learning Material


# Feature Scaling
train_scale <- scale(train_cl[, 1:4])
test_scale <- scale(test_cl[, 1:4])
# Fitting Naive Bayes Model
# to training dataset
set.seed(120) # Setting Seed
classifier_cl <- naiveBayes(Species ~ ., data = train_cl)
classifier_cl
# Predicting on test data'
y_pred <- predict(classifier_cl, newdata = test_cl)
# Confusion Matrix
cm <- table(test_cl$Species, y_pred)
cm
# Model Evaluation
confusionMatrix(cm)
Output:
• Model classifier_cl:

SRMIST DDE Self Learning Material


• The Conditional probability for each feature or variable is created by
model separately. The apriori probabilities are also calculated which
indicates the distribution of our data.
• Confusion Matrix:

• So, 20 Setosa are correctly classified as Setosa. Out of 16 Versicolor, 15


Versicolor are correctly classified as Versicolor, and 1 are classified as
virginica. Out of 24 virginica, 19 virginica are correctly classified as
virginica and 5 are classified as Versicolor.

119 SRMIST DDE Self Learning Material


• Model Evaluation:

• The model achieved 90% accuracy with a p-value of less than 1. With
Sensitivity, Specificity, and Balanced accuracy, the model build is good.

SRMIST DDE Self Learning Material


MODULE -6

6.1 K-Nearest Neighbor (KNN) Algorithm


6.2 RandomForest Algorithm

6.1 K-NEAREST NEIGHBOR(KNN) ALGORITHM

o K-Nearest Neighbor is one of the simplest Machine Learning


algorithms based on Supervised Learning technique.

o K-NN algorithm assumes the similarity between the new case/data


and available cases and put the new case into the category that is most
similar to the available categories.

o K-NN algorithm stores all the available data and classifies a new data
point based on the similarity. This means when new data appears then
it can be easily classified into a well suite category by using K- NN
algorithm.

o K-NN algorithm can be used for Regression as well as for Classification


but mostly it is used for the Classification problems.

o K-NN is a non-parametric algorithm, which means it does not make


any assumption on underlying data.

o It is also called a lazy learner algorithm because it does not learn


from the training set immediately instead it stores the dataset and at
the time of classification, it performs an action on the dataset.

o KNN algorithm at the training phase just stores the dataset and when
it gets new data, then it classifies that data into a category that is much
similar to the new data.

121 SRMIST DDE Self Learning Material


o Example: Suppose, we have an image of a creature that looks similar
to cat and dog, but we want to know either it is a cat or dog. So for this
identification, we can use the KNN algorithm, as it works on a
similarity measure. Our KNN model will find the similar features of
the new data set to the cats and dogs images and based on the most
similar features it will put it in either cat or dog category.

6.1.1 Need for a K-NN Algorithm

Suppose there are two categories, i.e., Category A and Category B, and we
have a new data point x1, so this data point will lie in which of these
categories. To solve this type of problem, we need a K-NN algorithm. With the
help of K-NN, we can easily identify the category or class of a particular
dataset. Consider the below diagram:

SRMIST DDE Self Learning Material


6.1.2 Steps for K-NN:

The K-NN working can be explained on the basis of the below algorithm:

o Step-1: Select the number K of the neighbors

o Step-2: Calculate the Euclidean distance of K number of neighbors

o Step-3: Take the K nearest neighbors as per the calculated Euclidean


distance.

o Step-4: Among these k neighbors, count the number of the data points
in each category.

o Step-5: Assign the new data points to that category for which the
number of the neighbor is maximum.

o Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required
category. Consider the below image:

123 SRMIST DDE Self Learning Material


o Firstly, we will choose the number of neighbors, so we will choose the
k=5.

o Next, we will calculate the Euclidean distance between the data


points. The Euclidean distance is the distance between two points,
which we have already studied in geometry. It can be calculated as:

o By calculating the Euclidean distance we got the nearest neighbors, as


three nearest neighbors in category A and two nearest neighbors in
category B. Consider the below image:

SRMIST DDE Self Learning Material


o As we can see the 3 nearest neighbors are from category A, hence this
new data point must belong to category A.

Advantages of KNN Algorithm:

o It is simple to implement.

o It is robust to the noisy training data

o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:

o Always needs to determine the value of K which may be complex some


time.

o The computation cost is high because of calculating the distance


between the data points for all the training samples.

EXAMPLE:

DIABETES PREDICTION USING KNN IN R

ABOUT THE DATASET

The data used for this analysis is from the National Institute of Diabetes and
Digestive and Kidney Diseases and is made available on Kaggle.

125 SRMIST DDE Self Learning Material


The Dataset contains entries from only women of at least 21 years of age with
Pima Indian heritage with the following features;

• Pregnancies: Number of times pregnant.

• Glucose: Plasma glucose concentration a 2 hours in an oral glucose


tolerance test.

• BloodPressure: Diastolic blood pressure (mm Hg).

• SkinThickness: Triceps skin fold thickness (mm).

• Insulin: 2-Hour serum insulin (mu U/ml).

• BMI: Body mass index (weight in kg/(height in m)^2).

• DiabetesPedigreeFunction: Diabetes pedigree function.

• Age: Age (years).

• Outcome: Class variable (0 or 1) with the class value 1 representing


those who tested positive for diabetes.

Importing and Exploring the dataset

Loading the packages

library(dplyr)

library(tidyr)

library(forcats)

library(ggplot2)

library(janitor)

library(gmodels)

library(class)

library(corrplot)

Loading the dataset

SRMIST DDE Self Learning Material


diabetes_df <- read.csv("diabetes.csv")

Exploring the structure of the dataset

str(diabetes_df)

## 'data.frame': 768 obs. of 9 variables:

## $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...

## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...

## $ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...

## $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...

## $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...

## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...

## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...

## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...

## $ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...

The dataset contains data entries from 768 patients, with all the
features/attributes being numeric values.

Checking for duplicate entries in the dataset

get_dupes(diabetes_df)

## No variable names specified - using all columns.

## No duplicate combinations found of: Pregnancies, Glucose, BloodPressure,


SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age, Outcome

## [1] Pregnancies Glucose BloodPressure

## [4] SkinThickness Insulin BMI

## [7] DiabetesPedigreeFunction Age Outcome

## [10] dupe_count

127 SRMIST DDE Self Learning Material


## <0 rows> (or 0-length row.names)

There are no duplicate entries in the dataset

Cleaning and Transforming the dataset

diabetes_df$Outcome <- as.character(diabetes_df$Outcome)

diabetes_df <- diabetes_df %>%

mutate(Outcome = fct_recode(Outcome, "Diabetic" = "1", "Non Diabetic" = "0


"))

diabetes_df$Outcome <- factor(diabetes_df$Outcome,

levels = c("Diabetic", "Non Diabetic"),

labels = c("Diabetic", "Non Diabetic"))

Here, the values of the outcome attributes are transformed from 1’s and 0’s
to “Diabetic” and “Non Diabetic” to make the outcomes/diagnosis more
clearer to understand.

Checking for correlation in the dataset

diabetes_correlation_df <- diabetes_df[-9]

diabetes_correlation_df <- cor(diabetes_correlation_df)

corrplot(diabetes_correlation_df, method = "color", type = "lower", addCoef.c


ol = "black", col = COL2("RdYlBu"), number.cex = 0.8, tl.cex = 0.8)

There are moderate positive correlations between the Age and Pregnancy,
and the Insulin and Skin Thickness attributes. This indicates that as the age of
the patients increased so did the number of pregnancies, also as the quantity

SRMIST DDE Self Learning Material


of insulin administered to the patients increased; the skin thickness
increased likewise.

Weak positive correlations can also be observed in the following attributes of


the dataset; Insulin & Glucose, BMI & Skin Thickness, Blood Pressure & BMI,
Age & Blood Pressure e.t.c…

Age, BMI and Blood Pressure

ggplot(data = diabetes_df, aes(x = Age)) + geom_histogram(bins = 30, color =


"blue", fill = "lightblue") + facet_wrap(~Outcome) + theme_dark() + ylab("Nu
mber of Patients") + labs(title = "Age(s) of Patients")

The ages of the patients are skewed to the right with most of the patients
being between the ages of 20 to 40.

ggplot(data = diabetes_df, aes(x = BMI)) + geom_histogram(bins = 30, color =


"blue", fill = "lightblue") + facet_wrap(~Outcome) + theme_dark() + ylab("Nu
mber of Patients") + labs(title = "BMI of Patients")

From the histogram above, the BMI attribute is symmetric but it is quite
visible that outliers exist in the dataset having BMI’s with 0 values. To have a
BMI of Zero(0) is impossible, indicating that there might be an error in this
field.

ggplot(data = diabetes_df, aes(x = BloodPressure)) + geom_histogram(bins =


30, color = "blue", fill = "lightblue") + facet_wrap(~Outcome) + theme_dark()
+ ylab("Number of Patients") + labs(title = "Patient Blood Pressure")

129 SRMIST DDE Self Learning Material


Just as the previous chart indicated; outliers are also present in the blood
pressure attribute. With the outlier being 0(Zero) it is clear to see that there
must be an error as the human blood pressure can not drop to absolute
0(Zero).

Preparing the data

Creating a normalize function

normalize <- function(x){

(x-min(x))/(max(x)-min(x))

diabetes_df_n <- as.data.frame(lapply(diabetes_df[1:8],normalize))

k-Nearest Neighbors uses the Euclidean Distance(which is the distance one


would measure if you could use a ruler to connect two points) to classify, so
we normalize the dataset to re-scale the value of the features to ensure each
value is contributing equally to the distance formula.

Seperating the dataset into the Train and Test Data

diabetes_df_train <- diabetes_df_n[1:668, ]

diabetes_df_test <- diabetes_df_n[669:768, ]

diabetes_train_labels <- diabetes_df[1:668, 9]

diabetes_test_labels <- diabetes_df[669:768, 9]

Finally, the dataset is then split into two where the larger half will be used to
train the model and the second half utilized to test the accuracy of the model.

SRMIST DDE Self Learning Material


Training the Model

diabetes_prediction <- knn(train = diabetes_df_train, test = diabetes_df_test, c


l = diabetes_train_labels, k = 27)

The kNN factor is utilized above to train the model, the value used for k is the
square-root of the total sample size used for the analysis(768).

Evaluating the Model Performance

CrossTable(y = diabetes_prediction, x = diabetes_test_labels, prop.chisq = FAL


SE)
## Cell Contents
## |-------------------------|
## | N|
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
## Total Observations in Table: 100
## | diabetes_prediction
## diabetes_test_labels | Diabetic | Non Diabetic | Row Total |
## ---------------------|--------------|--------------|--------------|
## Diabetic | 22 | 15 | 37 |
## | 0.595 | 0.405 | 0.370 |
## | 0.815 | 0.205 | |
## | 0.220 | 0.150 | |
## ---------------------|--------------|--------------|--------------|
## Non Diabetic | 5| 58 | 63 |
## | 0.079 | 0.921 | 0.630 |
## | 0.185 | 0.795 | |
## | 0.050 | 0.580 | |
## ---------------------|--------------|--------------|--------------|

131 SRMIST DDE Self Learning Material


## Column Total | 27 | 73 | 100 |
## | 0.270 | 0.730 | |
## ---------------------|--------------|--------------|--------------|

The CrossTable function is used above to determine the accuracy of the


model by comparing the known values to the values predicted by the model.
There were 37 diabetic patients and 67 non diabetic patients, the model was
able to predict 21 diabetic patients and 59 non diabetic patients leading to
an Accuracy of 80%.

Improving Model Performance

Removing Outliers

diabetes_df <- diabetes_df%>% filter(BMI > 0) %>% filter(BloodPressure > 0


)

diabetes_df_n <- as.data.frame(lapply(diabetes_df[1:8],normalize))

diabetes_df_train <- diabetes_df_n[1:629, ]

diabetes_df_test <- diabetes_df_n[630:729, ]

diabetes_train_labels <- diabetes_df[1:629, 9]

diabetes_test_labels <- diabetes_df[630:729, 9]

diabetes_prediction <- knn(train = diabetes_df_train, test = diabetes_df_test, c


l = diabetes_train_labels, k = 27)

CrossTable(y = diabetes_prediction, x = diabetes_test_labels, prop.chisq = FAL


SE)

SRMIST DDE Self Learning Material


##
##
## Cell Contents
## |-------------------------|
## | N|
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
## | diabetes_prediction
## diabetes_test_labels | Diabetic | Non Diabetic | Row Total |
## ---------------------|--------------|--------------|--------------|
## Diabetic | 19 | 20 | 39 |
## | 0.487 | 0.513 | 0.390 |
## | 0.826 | 0.260 | |
## | 0.190 | 0.200 | |
## ---------------------|--------------|--------------|--------------|
## Non Diabetic | 4| 57 | 61 |
## | 0.066 | 0.934 | 0.610 |
## | 0.174 | 0.740 | |
## | 0.040 | 0.570 | |
## ---------------------|--------------|--------------|--------------|
## Column Total | 23 | 77 | 100 |
## | 0.230 | 0.770 | |
## ---------------------|--------------|--------------|--------------|

The accuracy of the model was not hampered by the presence of the outliers,
as the accuracy reduced to 76% with the removal of outliers.

133 SRMIST DDE Self Learning Material


6.2 RANDOM FOREST ALGORITHM

• Random Forest is a popular machine learning algorithm that belongs


to the supervised learning technique.
• It can be used for both Classification and Regression problems in ML.
• It is based on the concept of ensemble learning, which is a process
of combining multiple classifiers to solve a complex problem and to
improve the performance of the model.
• As the name suggests, "Random Forest is a classifier that contains a
number of decision trees on various subsets of the given dataset
and takes the average to improve the predictive accuracy of that
dataset." Instead of relying on one decision tree, the random forest
takes the prediction from each tree and based on the majority votes of
predictions, and it predicts the final output.

The greater number of trees in the forest leads to higher accuracy and
prevents the problem of overfitting.

The below diagram explains the working of the Random Forest algorithm:

SRMIST DDE Self Learning Material


6.2.1 Steps for Random Forest algorithm:

The Working process can be explained in the below steps and diagram:

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points
(Subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.

Step-5: For new data points, find the predictions of each decision tree, and
assign the new data points to the category that wins the majority votes.

The working of the algorithm can be better understood by the below


example:

Example: Suppose there is a dataset that contains multiple fruit images. So,
this dataset is given to the Random forest classifier. The dataset is divided
into subsets and given to each decision tree. During the training phase, each
decision tree produces a prediction result, and when a new data point occurs,
then based on the majority of results, the Random Forest classifier predicts
the final decision. Consider the below image:

135 SRMIST DDE Self Learning Material


6.2.3 Applications of Random Forest

There are mainly four sectors where Random forest mostly used:

1. Banking: Banking sector mostly uses this algorithm for the


identification of loan risk.

2. Medicine: With the help of this algorithm, disease trends and risks of
the disease can be identified.

3. Land Use: We can identify the areas of similar land use by this
algorithm.

4. Marketing: Marketing trends can be identified using this algorithm.

6.2.4 Advantages of Random Forest

o Random Forest is capable of performing both Classification and


Regression tasks.

o It takes less training time as compared to other algorithms.

SRMIST DDE Self Learning Material


o It predicts output with high accuracy, even for the large dataset it runs
efficiently.

o It can also maintain accuracy when a large proportion of data is


missing.

o It is capable of handling large datasets with high dimensionality.

o It enhances the accuracy of the model and prevents the overfitting


issue.

6.2.5 Disadvantages of Random Forest

• Although random forest can be used for both classification and


regression tasks, it is not more suitable for Regression tasks.

DIABETES PREDICTION USING RANDOM FOREST IN R

ABOUT THE DATASET

The Pima Indians Diabetes database to predict the onset of diabetes based
on diagnostic
measures: https://www.kaggle.com/hconner2/diabetes/data

About Dataset

Pregnancies: Number of times pregnant

Glucose: Plasma glucose concentration a 2 hours in an oral glucose


tolerance test

BloodPressure: Diastolic blood pressure (mm Hg)

SkinThickness: Triceps skin fold thickness (mm)

Insulin: 2-Hour serum insulin (mu U/ml)

BMI: Body mass index (weight in kg/(height in m)^2)

DiabetesPedigreeFunction: Diabetes pedigree function


137 SRMIST DDE Self Learning Material
Age: Age (years)

Outcome: Class variable (0 or 1)

Step 1. Collecting data. Exploring and preparing the data.

diabetes = read.csv("C:/ diabetes.csv", sep = "," , dec = ".", header = TRUE)

head(diabetes)

## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI

## 1 6 148 72 35 0 33.6

## 2 1 85 66 29 0 26.6

## 3 8 183 64 0 0 23.3

## 4 1 89 66 23 94 28.1

## 5 0 137 40 35 168 43.1

## 6 5 116 74 0 0 25.6

## DiabetesPedigreeFunction Age Outcome

## 1 0.627 50 1

## 2 0.351 31 0

## 3 0.672 32 1

## 4 0.167 21 0

## 5 2.288 33 1

## 6 0.201 30 0

summary(diabetes)

## Pregnancies Glucose BloodPressure SkinThickness

## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00

## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00

SRMIST DDE Self Learning Material


## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00

## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54

## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00

## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00

## Insulin BMI DiabetesPedigreeFunction Age

## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00

## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00

## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00

## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24

## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00

## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00

## Outcome

## Min. :0.000

## 1st Qu.:0.000

## Median :0.000

## Mean :0.349

## 3rd Qu.:1.000

## Max. :1.000

str(diabetes)

## 'data.frame': 768 obs. of 9 variables:

## $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...

## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...

## $ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...

## $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...

139 SRMIST DDE Self Learning Material


## $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...

## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...

## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...

## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...

## $ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...

Here, we have to be careful about how the data has been coded. First, we
see that the Outcome is numeric while it should be categorical:

diabetes$Outcome = factor(diabetes$Outcome)

summary(diabetes$Outcome)

## 0 1

## 500 268

levels(diabetes$Outcome) = c('negative', 'positive')

Secondly, that in many variables there are zeros that do not make sense,
for example, BloodPressure, BMI, etc. Assume that the zeros represent
the NA values and need to be recoded correctly:

diabetes$Glucose[diabetes$Glucose == 0 ] = NA

diabetes$BloodPressure[diabetes$BloodPressure == 0 ] = NA

diabetes$SkinThickness[diabetes$SkinThickness == 0 ] = NA

diabetes$Insulin[diabetes$Insulin == 0 ] = NA

diabetes$BMI[diabetes$BMI == 0 ] = NA

diabetes = na.omit(diabetes)

summary(diabetes)

## Pregnancies Glucose BloodPressure SkinThickness

## Min. : 0.000 Min. : 56.0 Min. : 24.00 Min. : 7.00

SRMIST DDE Self Learning Material


## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.:21.00

## Median : 2.000 Median :119.0 Median : 70.00 Median :29.00

## Mean : 3.301 Mean :122.6 Mean : 70.66 Mean :29.15

## 3rd Qu.: 5.000 3rd Qu.:143.0 3rd Qu.: 78.00 3rd Qu.:37.00

## Max. :17.000 Max. :198.0 Max. :110.00 Max. :63.00

## Insulin BMI DiabetesPedigreeFunction Age

## Min. : 14.00 Min. :18.20 Min. :0.0850 Min. :21.00

## 1st Qu.: 76.75 1st Qu.:28.40 1st Qu.:0.2697 1st Qu.:23.00

## Median :125.50 Median :33.20 Median :0.4495 Median :27.00

## Mean :156.06 Mean :33.09 Mean :0.5230 Mean :30.86

## 3rd Qu.:190.00 3rd Qu.:37.10 3rd Qu.:0.6870 3rd Qu.:36.00

## Max. :846.00 Max. :67.10 Max. :2.4200 Max. :81.00

## Outcome

## negative:262

## positive:130

Dimention of Data 392 rows and 9 columns.

Step 2. Creating training and testing datasets

Divide the data into two different sets: a training dataset that will be used
to build the model and a test dataset that will be used to estimate the
predictive accuracy of the model.

The dataset will be divided into training (70%) and testing (30%) sets, we
create the data sets using the caret package:

library(caret)

set.seed(123)

141 SRMIST DDE Self Learning Material


train_ind= createDataPartition(y = diabetes$Outcome,p = 0.7,list = FALSE
)

train = diabetes[train_ind,]

head(train)

## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI

## 4 1 89 66 23 94 28.1

## 5 0 137 40 35 168 43.1

## 14 1 189 60 23 846 30.1

## 15 5 166 72 19 175 25.8

## 19 1 103 30 38 83 43.3

## 20 1 115 70 30 96 34.6

## DiabetesPedigreeFunction Age Outcome

## 4 0.167 21 negative

## 5 2.288 33 positive

## 14 0.398 59 positive

## 15 0.587 51 positive

## 19 0.183 33 negative

## 20 0.529 32 positive

test = diabetes[-train_ind,]

head(test)

## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI

## 7 3 78 50 32 88 31.0

## 9 2 197 70 45 543 30.5

## 17 0 118 84 47 230 45.8

## 21 3 126 88 41 235 39.3

SRMIST DDE Self Learning Material


## 25 11 143 94 33 146 36.6

## 26 10 125 70 26 115 31.1

## DiabetesPedigreeFunction Age Outcome

## 7 0.248 26 positive

## 9 0.158 53 positive

## 17 0.551 31 positive

## 21 0.704 27 negative

## 25 0.254 51 positive

## 26 0.205 41 positive

The training set has 275 samples, and the testing set has 117 samples.

Step 3. Training a model on the data

Use the function randomForest in the package randomForest.

The function radomForest has the following important parameters:

-ntree:number of trees

-mtry: Number of variables randomly sampled as candidates at each split.


Note that the default values are different for classification (sqrt(p) where
p is number of variables in x) and regression (p/3)

#install.packages('randomForest')

library(randomForest)

rf =randomForest(Outcome~., data = train, ntree= 50, mtry=sqrt(7))

rf

## Call:

## randomForest(formula = Outcome ~ ., data = train, ntree = 50, mtry


= sqrt(7))

143 SRMIST DDE Self Learning Material


## Type of random forest: classification

## Number of trees: 50

## No. of variables tried at each split: 3

## OOB estimate of error rate: 22.91%

## Confusion matrix:

## negative positive class.error

## negative 159 25 0.1358696

## positive 38 53 0.4175824

plot(rf)

SRMIST DDE Self Learning Material


Step 4. Evaluating model performance

predictions = predict(rf, test, type= "response")

head(predictions)

## 7 9 17 21 25 26

## negative positive positive negative positive negative

## Levels: negative positive

library(caret)

confu1 =confusionMatrix(predictions, test$Outcome , positive = 'positive'


)

confu1

## Confusion Matrix and Statistics

## Reference

## Prediction negative positive

## negative 68 15

## positive 10 24

## Accuracy : 0.7863

## 95% CI : (0.7009, 0.8567)

## No Information Rate : 0.6667

## P-Value [Acc > NIR] : 0.003138

##

## Kappa : 0.5033

## Mcnemar's Test P-Value : 0.423711

##

145 SRMIST DDE Self Learning Material


## Sensitivity : 0.6154

## Specificity : 0.8718

## Pos Pred Value : 0.7059

## Neg Pred Value : 0.8193

## Prevalence : 0.3333

## Detection Rate : 0.2051

## Detection Prevalence : 0.2906

## Balanced Accuracy : 0.7436

## 'Positive' Class : positive

The accuracy of the model is 78.63 %, whit an error rate of 21.37 %.

The kappa statistic of the model is 0.50331.

The sensitivity of the model is 0.61538,and the especificity of the model is


0.87179.

The precision of the model is 0.70588,and the recall of the model is


0.61538.

The value of the F-measure of the model is 0.6575.

Step 5. Improving model performance (ntree= 1000)

rf =randomForest(Outcome~., data = train, ntree= 1000, mtry=sqrt(7))

rf

## Call:

## randomForest(formula = Outcome ~ ., data = train, ntree = 1000, m


try = sqrt(7))

SRMIST DDE Self Learning Material


## Type of random forest: classification

## Number of trees: 1000

## No. of variables tried at each split: 3

## OOB estimate of error rate: 23.27%

## Confusion matrix:

## negative positive class.error

## negative 157 27 0.1467391

## positive 37 54 0.4065934

plot(rf)

predictions =predict(rf, test, type= "response")

head(predictions)

## 7 9 17 21 25 26

147 SRMIST DDE Self Learning Material


## negative positive positive negative positive negative

## Levels: negative positive

library(caret)

confu2 =confusionMatrix(predictions, test$Outcome , positive = 'positive'


)

confu2

## Confusion Matrix and Statistics

S## Reference

## Prediction negative positive

## negative 70 13

## positive 8 26

## Accuracy : 0.8205

## 95% CI : (0.7388, 0.8853)

## No Information Rate : 0.6667

## P-Value [Acc > NIR] : 0.0001599

## Kappa : 0.5828

## Mcnemar's Test P-Value : 0.3827331

## Sensitivity : 0.6667

## Specificity : 0.8974

## Pos Pred Value : 0.7647

## Neg Pred Value : 0.8434

## Prevalence : 0.3333

## Detection Rate : 0.2222

## Detection Prevalence : 0.2906

## Balanced Accuracy : 0.7821

SRMIST DDE Self Learning Material


## 'Positive' Class : positive

The accuracy of the model is 82.05%, whit an error rate of 17.95 %.

The kappa statistic of the model is 0.58278.

The sensitivity of the model is 0.66667,and the especificity of the model is


0.89744.

The precision of the model is 0.76471,and the recall of the model is


0.66667.

The value of the F-measure of the model is 0.7123.

MODULE-7

7.1 Clustering In Machine Learning

7.2 K-Means Clustering

149 SRMIST DDE Self Learning Material


7.1CLUSTERING IN MACHINE LEARNING

• Clustering or cluster analysis is a machine learning technique, which


groups the unlabelled dataset.

SRMIST DDE Self Learning Material


• It can be defined as "A way of grouping the data points into different
clusters, consisting of similar data points. The objects with the
possible similarities remain in a group that has less or no similarities
with another group."
• It does it by finding some similar patterns in the unlabelled dataset
such as shape, size, color, behavior, etc., and divides them as per the
presence and absence of those similar patterns.
• It is an unsupervised learning method, hence no supervision is
provided to the algorithm, and it deals with the unlabeled dataset.
• The clustering technique is commonly used for statistical data
analysis.

The below diagram explains the working of the clustering algorithm. We can
see the different fruits are divided into several groups with similar
properties.

Application of clustering technique:

o Market Segmentation

o Statistical data analysis

o Social network analysis

o Image segmentation

o Anomaly detection, etc.

151 SRMIST DDE Self Learning Material


7.2 K-MEANS CLUSTERING

• K-Means Clustering is an Unsupervised Learning algorithm, which


groups the unlabeled dataset into different clusters. Here K defines
the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be
three clusters, and so on.
• It is an iterative algorithm that divides the unlabeled dataset into k
different clusters in such a way that each dataset belongs only one
group that has similar properties.

• It allows us to cluster the data into different groups and a convenient


way to discover the categories of groups in the unlabeled dataset on
its own without the need for any training.
• It is a centroid-based algorithm, where each cluster is associated with
a centroid. The main aim of this algorithm is to minimize the sum of
distances between the data point and their corresponding clusters.
• The algorithm takes the unlabeled dataset as input, divides the dataset
into k-number of clusters, and repeats the process until it does not
find the best clusters. The value of k should be predetermined in this
algorithm.

The k-means clustering algorithm mainly performs two tasks:

• Determines the best value for K center points or centroids by an


iterative process.
• Assigns each data point to its closest k-center. Those data points
which are near to the particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away
from other clusters.

The below diagram explains the working of the K-means Clustering


Algorithm:

SRMIST DDE Self Learning Material


Steps for K-Means Algorithm:

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input
dataset).

Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the
new closest centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

The two variables M1 and M2. The x-y axis scatter plot of these two variables
is given below:

153 SRMIST DDE Self Learning Material


o Let's take number k of clusters, i.e., K=2, to identify the dataset and to
put them into different clusters. It means here group these datasets
into two different clusters.

o Choose some random k points or centroid to form the cluster. These


points can be either the points from the dataset or any other point. So,
here select the below two points as k points, which are not the part of
our dataset. Consider the below image:

o Now assign each data point of the scatter plot to its closest K-point or
centroid. Consider the below image:

SRMIST DDE Self Learning Material


From the above image, it is clear that points left side of the line is near to the
K1 or blue centroid, and points to the right of the line are close to the yellow
centroid. Let's color them as blue and yellow for clear visualization.

o To find the closest cluster, repeat the process by choosing a new


centroid. To choose the new centroids, we will compute the center of
gravity of these centroids, and will find new centroids as below:

o Next, reassign each datapoint to the new centroid. For this, repeat the
same process of finding a median line. The median will be like below

155 SRMIST DDE Self Learning Material


image:

From the above image, we can see, one yellow point is on the left side of the
line, and two blue points are right to the line. So, these three points will be
assigned to new centroids.

As reassignment has taken place, so again go to the step-4, which is finding


new centroids or K-points.

SRMIST DDE Self Learning Material


o Repeat the process by finding the center of gravity of centroids, so the
new centroids will be as shown in the below image:

o The new centroids so again will draw the median line and reassign the
data points. So, the image will be:

o In the above image; there are no dissimilar data points on either side
of the line, which means our model is formed. Consider the below
image:

The model is ready, so remove the assumed centroids, and the two final
clusters will be as shown in the below image:

157 SRMIST DDE Self Learning Material


EXAMPLE
THE DATASET
Iris dataset consists of 50 samples from each of 3 species of Iris(Iris
setosa, Iris virginica, Iris versicolor)
# Loading data

data(iris)

# Structure

str(iris)

# Installing Packages

install.packages("ClusterR")

install.packages("cluster")

# Loading package

library(ClusterR)

library(cluster)

# Removing initial label of

# Species from original dataset

iris_1 <-iris[, -5]

# Fitting K-Means clustering Model

SRMIST DDE Self Learning Material


# to training dataset

set.seed(240) # Setting seed

kmeans.re <-kmeans(iris_1, centers =3, nstart =20)

kmeans.re

# Cluster identification for

# each observation

kmeans.re$cluster

# Confusion Matrix

cm <-table(iris$Species, kmeans.re$cluster)

cm

# Model Evaluation and visualization

plot(iris_1[c("Sepal.Length", "Sepal.Width")])

plot(iris_1[c("Sepal.Length", "Sepal.Width")],

col =kmeans.re$cluster)

plot(iris_1[c("Sepal.Length", "Sepal.Width")],

col =kmeans.re$cluster, main ="K-means with 3 clusters")

## Plotiing cluster centers

kmeans.re$centers

kmeans.re$centers[, c("Sepal.Length", "Sepal.Width")]

159 SRMIST DDE Self Learning Material


# cex is font size, pch is symbol

points(kmeans.re$centers[, c("Sepal.Length", "Sepal.Width")], col =1:3, pch


=8, cex =3)

## Visualizing clusters

y_kmeans<-kmeans.re$cluster clusplot(iris_1[, c("Sepal.Length",


"Sepal.Width")],

y_kmeans, lines =0, shade =TRUE, color =TRUE,labels =2, plotchar


=FALSE,

span =TRUE, main =paste("Cluster iris"), xlab ='Sepal.Length',ylab


='Sepal.Width')

Output:
• Model kmeans_re:

The 3 clusters are made which are of 50, 62, and 38 sizes respectively.
Within the cluster, the sum of squares is 88.4%.

• Cluster identification:

SRMIST DDE Self Learning Material


The model achieved an accuracy of 100% with a p-value of less than 1.
This indicates the model is good.

• Confusion Matrix:

So, 50 Setosa are correctly classified as Setosa. Out of 62 Versicolor, 48


Versicolor are correctly classified as Versicolor and 14 are classified as
virginica. Out of 36 virginica, 19 virginica are correctly classified as
virginica and 2 are classified as Versicolor.

• K-means with 3 clusters plot:

161 SRMIST DDE Self Learning Material


The model showed 3 cluster plots with three different colors and with
Sepal.length and with Sepal.width.

• Plotting cluster centers:

In the plot, centers of clusters are marked with cross signs with the same
color of the cluster.

• Plot of clusters:

So, 3 clusters are formed with varying sepal length and sepal width. Hence,
the K-Means clustering algorithm is widely used in the industry.

SRMIST DDE Self Learning Material


MODULE-8

8.1DBScan Clustering

8.2 Hierarchical Clustering

163 SRMIST DDE Self Learning Material


8.1 DBSCAN CLUSTERING

• Density-Based Clustering of Applications with Noise(DBScan) is an


Unsupervised learning Non-linear algorithm.
• It does use the idea of density reachability and density connectivity.
• The data is partitioned into groups with similar characteristics or
clusters but it does not require specifying the number of those
groups in advance.
• A cluster is defined as a maximum set of densely connected points. It
discovers clusters of arbitrary shapes in spatial databases with noise.

DBScan Algorithm:

1. Randomly select a point p.


2. Retrieve all the points that are density reachable from p with regard to
Maximum radius of the neighbourhood(EPS) and minimum number of
points within eps neighborhood(Min Pts).
3. If the number of points in the neighborhood is more than Min Pts then p
is a core point.

SRMIST DDE Self Learning Material


4. For p core points, a cluster is formed. If p is not a core point, then mark
it as a noise/outlier and move to the next point.
5. Continue the process until all the points have been processed.
DBScan clustering is insensitive to order.

EXAMPLE :
THE DATASET
Iris dataset consists of 50 samples from each of 3 species of Iris(Iris setosa,
Iris virginica, Iris versicolor).

# Loading data

data(iris)

# Structure

str(iris)

# Installing Packages

install.packages("fpc")

# Loading package

library(fpc)

# Remove label form dataset

iris_1 <-iris[-5]

# Fitting DBScan clustering Model

# to training dataset

set.seed(220) # Setting seed

Dbscan_cl<-dbscan(iris_1, eps =0.45, MinPts =5)

Dbscan_cl

165 SRMIST DDE Self Learning Material


# Checking cluster

Dbscan_cl$cluster

# Table

table(Dbscan_cl$cluster, iris$Species)

# Plotting Cluster

plot(Dbscan_cl, iris_1, main ="DBScan")

plot(Dbscan_cl, iris_1, main ="Petal Width vs Sepal Length")

Output:
• Model dbscan_cl:

In the model, there are 150 Pts with Minimum points are 5 and eps is
0.5.

• Cluster identification:

The clusters in the model are shown.

SRMIST DDE Self Learning Material


• Plotting Cluster:

DBScan cluster is plotted with Sepal.Length, Sepal.Width, Petal.Length,


Petal.Width.

The plot is plotted between Petal.Width&Sepal.Length.

So, the DBScan clustering algorithm can also form unusual shapes that are
useful for finding a cluster of non-linear shapes in the industry.

8.2 HIERARCHICAL CLUSTERING IN MACHINE LEARNING

Hierarchical clustering is another unsupervised machine learning algorithm,


which is used to group the unlabeled datasets into a cluster and also known
as hierarchical cluster analysis or HCA.

In this algorithm, develop the hierarchy of clusters in the form of a tree, and
this tree-shaped structure is known as the dendrogram.

167 SRMIST DDE Self Learning Material


Sometimes the results of K-means clustering and hierarchical clustering may
look similar, but they both differ depending on how they work. As there is no
requirement to predetermine the number of clusters as we did in the K-
Means algorithm.

The hierarchical clustering technique has two approaches:

1. Agglomerative: Agglomerative is a bottom-up approach, in which


the algorithm starts with taking all data points as single clusters and
merging them until one cluster is left.

2. Divisive: Divisive algorithm is the reverse of the agglomerative


algorithm as it is a top-down approach.

Agglomerative Hierarchical clustering

The agglomerative hierarchical clustering algorithm is a popular example of


HCA. To group the datasets into clusters, it follows the bottom-up approach.
It means, this algorithm considers each dataset as a single cluster at the
beginning, and then start combining the closest pair of clusters together. It
does this until all the clusters are merged into a single cluster that contains
all the datasets.

This hierarchy of clusters is represented in the form of the dendrogram.

How the Agglomerative Hierarchical clustering Work?

The working of the AHC algorithm can be explained using the below steps:

SRMIST DDE Self Learning Material


o Step-1: Create each data point as a single cluster. Let's say there are N
data points, so the number of clusters will also be N.

o Step-2: Take two closest data points or clusters and merge them to
form one cluster. So, there will now be N-1 clusters.

o Step-3: Again, take the two closest clusters and merge them together
to form one cluster. There will be N-2 clusters.

o Step-4: Repeat Step 3 until only one cluster left. So, we will get the
following clusters. Consider the below images:

169 SRMIST DDE Self Learning Material


o Step-5: Once all the clusters are combined into one big cluster,
develop the dendrogram to divide the clusters as per the problem.

Measure for the distance between two clusters

As we have seen, the closest distance between the two clusters is crucial for
the hierarchical clustering. There are various ways to calculate the distance
between two clusters, and these ways decide the rule for clustering. These
measures are called Linkage methods. Some of the popular linkage methods
are given below:

1. Single Linkage: It is the Shortest Distance between the closest points


of the clusters. Consider the below image:

2. Complete Linkage: It is the farthest distance between the two points


of two different clusters. It is one of the popular linkage methods as it

SRMIST DDE Self Learning Material


forms tighter clusters than single-linkage.

3. Average Linkage: It is the linkage method in which the distance


between each pair of datasets is added up and then divided by the
total number of datasets to calculate the average distance between
two clusters. It is also one of the most popular linkage methods.

4. Centroid Linkage: It is the linkage method in which the distance


between the centroid of the clusters is calculated. Consider the below
image:

From the above-given approaches, we can apply any of them according to the
type of problem or business requirement.

Woking of Dendrogram in Hierarchical clustering

The dendrogram is a tree-like structure that is mainly used to store each step
as a memory that the HC algorithm performs. In the dendrogram plot, the Y-
axis shows the Euclidean distances between the data points, and the x-axis
shows all the data points of the given dataset.

The working of the dendrogram can be explained using the below diagram:

171 SRMIST DDE Self Learning Material


In the above diagram, the left part is showing how clusters are created in
agglomerative clustering, and the right part is showing the corresponding
dendrogram.

o firstly, the datapoints P2 and P3 combine together and form a cluster,


correspondingly a dendrogram is created, which connects P2 and P3
with a rectangular shape. The hight is decided according to the
Euclidean distance between the data points.

o In the next step, P5 and P6 form a cluster, and the corresponding


dendrogram is created. It is higher than of previous, as the Euclidean
distance between P5 and P6 is a little bit greater than the P2 and P3.

o Again, two new dendrograms are created that combine P1, P2, and P3
in one dendrogram, and P4, P5, and P6, in another dendrogram.

o At last, the final dendrogram is created that combines all the data
points together.

EXAMPLE :

THE DATASET

mtcars(motor trend car road test) comprise fuel consumption,


performance, and 10 aspects of automobile design for 32 automobiles. It
comes pre-installed with dplyr package in R.

# Installing the package

install.packages("dplyr")

SRMIST DDE Self Learning Material


# Loading package

library(dplyr)

# Summary of dataset in package

head(mtcars)
Output:

Performing Hierarchical clustering on Dataset

Using Hierarchical Clustering algorithm on the dataset using hclust() which


is pre-installed in stats package when R is installed.

# Finding distance matrix

distance_mat <- dist(mtcars, method = 'euclidean')

distance_mat

# Fitting Hierarchical clustering Model

# to training dataset

set.seed(240) # Setting seed

Hierar_cl <- hclust(distance_mat, method = "average")

Hiear_cl

# Plotting dendrogram

173 SRMIST DDE Self Learning Material


plot(Hierar_cl)

# Choosing no. of clusters

# Cutting tree by height

abline(h = 110, col = "green")

# Cutting tree by no. of clusters

fit <- cutree(Hierar_cl, k = 3 )

fit

table(fit)

rect.hclust(Hierar_cl, k = 3, border = "green")

Output:
• Distance matrix:

• The values are shown as per the distance matrix calculation with the
method as euclidean.

SRMIST DDE Self Learning Material


• Model Hierar_cl:

• In the model, the cluster method is average, distance is euclidean and


no. of objects are 32.

• Plot dendrogram:

• The plot dendrogram is shown with x-axis as distance matrix and y-axis
as height.
• Cutted tree:

175 SRMIST DDE Self Learning Material


• So, Tree is cut where k = 3 and each category represents its number of
clusters.
• Plotting dendrogram after cutting:

• The plot denotes dendrogram after being cut. The green lines show the
number of clusters as per the thumb rule.

MODULE 9
9.1 Data Visualization

9.2 R Visualization Packages

9.3 Interactive R Graphics


9.4 Data Visualization Graphs in R

SRMIST DDE Self Learning Material


9.1 DATA VISUALIZATION

Data visualization is a graphical representation of quantitative


information and data by using visual elements like graphs, charts, and maps.

Data visualization convert large and small data sets into visuals, which is easy
to understand and process for humans.

Data visualization tools provide accessible ways to understand outliers,


patterns, and trends in the data.

Importance of Data Visualization

⚫ Data visualization is important because of the processing of information


in human brains. Using graphs and charts to visualize a large amount of
the complex data sets is more comfortable in comparison to studying the
spreadsheet and reports.

177 SRMIST DDE Self Learning Material


⚫ Data visualization is an easy and quick way to convey concepts
universally. You can experiment with a different outline by making a
slight adjustment.

Uses of Data Visualization

1. To make easier in understand and remember.

2. To discover unknown facts, outliers, and trends.

3. To visualize relationships and patterns quickly.

4. To ask a better question and make better decisions.

5. To competitive analyze.

6. To improve insights.

Features of Data Visualization in R

1. Understanding

It can be more attractive to look at the business. And, it is easier to


understand through graphics and charts than a written document with text
and numbers. Thus, it can attract a wider range of audiences. Also, it
promotes the widespread use of business insights that come to make better
decisions.

2. Efficiency

Its applications allow us to display a lot of information in a small space.


Although, the decision-making process in business is inherently complex and
multifunctional, displaying evaluation findings in a graph can allow
companies to organize a lot of interrelated information in useful ways.

3. Location

Its app utilizing features such as Geographic Maps and GIS can be particularly
relevant to wider business when the location is a very relevant factor. We will
use maps to show business insights from various locations, also consider the

SRMIST DDE Self Learning Material


seriousness of the issues, the reasons behind them, and working groups to
address them.

9.2 R VISUALIZATION PACKAGES

R provides a series of packages for data visualization. These packages are as


follows:

1) plotly

The plotly package provides online interactive and quality graphs.

2) ggplot2

R allows us to create graphics declaratively. R provides the ggplot package


for this purpose. This package is famous for its elegant and quality graphs,
which sets it apart from other visualization packages.

3) tidyquant

The tidyquant is a financial package that is used for carrying out quantitative
financial analysis. This package adds under tidyverse universe as a financial
package that is used for importing, analyzing, and visualizing the data.

4) taucharts

Data plays an important role in taucharts. The library provides a declarative


interface for rapid mapping of data fields to visual properties.

5) ggiraph

It is a tool that allows us to create dynamic ggplot graphs. This package


allows us to add tooltips, JavaScript actions, and animations to the graphics.

6) geofacets

179 SRMIST DDE Self Learning Material


This package provides geofaceting functionality for 'ggplot2'. Geofaceting
arranges a sequence of plots for different geographical entities into a grid
that preserves some of the geographical orientation.

7) googleVis

googleVis provides an interface between R and Google's charts tools. With the
help of this package, we can create web pages with interactive charts based
on R data frames.

8) RColorBrewer

This package provides color schemes for maps and other graphics, which are
designed by Cynthia Brewer.

9) dygraphs

The dygraphs package is an R interface to the dygraphs JavaScript charting


library. It provides rich features for charting time-series data in R.

10) shiny

R allows us to develop interactive and aesthetically pleasing web apps by


providing a shiny package. This package provides various extensions with
HTML widgets, CSS, and JavaScript.

9.3 INTERACTIVE R GRAPHICS

Graphics play an important role in carrying out the important features of the
data. Graphics are used to examine marginal distributions, relationships
between variables, and summary of very large data. It is a very important
complement for many statistical and computational techniques.

Standard Graphics

R standard graphics are available through package graphics, include several


functions which provide statistical plots, like:

SRMIST DDE Self Learning Material


o Scatterplots

o Piecharts

o Boxplots

o Barplots etc.

9.4 DATA VISUALIZATION GRAPHS IN R

9.4.1 R BOXPLOT

Boxplots are a measure of how well data is distributed across a data


set. This divides the data set into three quartiles. This graph represents the
minimum, maximum, average, first quartile, and the third quartile in the data
set. Boxplot is also useful in comparing the distribution of data in a data set
by drawing a boxplot for each of them.

R provides a boxplot() function to create a boxplot. There is the following


syntax of boxplot() function:

Boxplot(x, data, notch, varwidth, names, main)

Here,

S.No Parameter Description

1. x It is a vector or a formula.

2. data It is the data frame.

3. notch It is a logical value set as true to draw a notch.

4. varwidth It is also a logical value set as true to draw the width of the box same
as the sample size.

5. names It is the group of labels that will be printed under each boxplot.

181 SRMIST DDE Self Learning Material


6. main It is used to give a title to the graph.

Example
# Giving a name to the chart file.
png(file = "boxplot.png")
# Plotting the chart.
boxplot(mpg ~ cyl, data = mtcars, xlab = "Quantity of Cylinders",
ylab = "Miles Per Gallon", main = "R Boxplot Example")

Output:

SRMIST DDE Self Learning Material


9.4.2 R BAR CHARTS

A bar chart is a pictorial representation in which numerical values of


variables are represented by length or height of lines or rectangles of equal
width. A bar chart is used for summarizing a set of categorical data. In bar
chart, the data is shown through rectangular bars having the length of the bar
proportional to the value of the variable.

In R, create a bar chart to visualize the data in an efficient manner. For this
purpose, R provides the barplot() function, which has the following syntax:

barplot(h,x,y,main, names.arg,col)

S.No Parameter Description

1. H A vector or matrix which contains numeric values used in the


bar chart.

2. xlab A label for the x-axis.

3. ylab A label for the y-axis.

4. main A title of the bar chart.

5. names.arg A vector of names that appear under each bar.

6. col It is used to give colors to the bars in the graph.

183 SRMIST DDE Self Learning Material


Example
library(RColorBrewer)
months <- c("Jan","Feb","Mar","Apr","May")
regions <- c("West","North","South")
# Creating the matrix of the values.
Values =matrix(c(21,32,33,14,95,46,67,78,39,11,22,23,94,15,16), nrow =
3, ncol = 5, byrow = TRUE)
# Giving the chart file a name
png(file = "stacked_chart.png")
# Creating the bar chart
barplot(Values, main = "Total Revenue", names.arg = months, xlab = "Mon
th", ylab = "Revenue", ccol =c("cadetblue3","deeppink2","goldenrod1"))
# Adding the legend to the chart
legend("topleft", regions, cex = 1.3, fill = c("cadetblue3","deeppink2","gold
enrod1"))
# Saving the file
dev.off()

Output:

SRMIST DDE Self Learning Material


9.4.3 R PIE CHARTS

• A pie-chart is a representation of values in the form of slices of a circle


with different colors. Slices are labeled with a description, and the
numbers corresponding to each slice are also shown in the chart.
• The Pie charts are created with the help of pie () function, which takes
positive numbers as vector input. Additional parameters are used to
control labels, colors, titles, etc.

There is the following syntax of the pie() function:

pie(X, Labels, Radius, Main, Col, Clockwise)

Here,

1. X is a vector that contains the numeric values used in the pie chart.

2. Labels are used to give the description to the slices.

3. Radius describes the radius of the pie chart.

4. Main describes the title of the chart.

5. Col defines the color palette.

6. Clockwise is a logical value that indicates the clockwise or anti-


clockwise direction in which slices are drawn.

Slice Percentage & Chart Legend

There are two additional properties of the pie chart, i.e., slice percentage and
chart legend.
The data in the form of percentage as well as we can add legends to plots in R
by using the legend() function.

There is the following syntax of the legend() function.

185 SRMIST DDE Self Learning Material


legend(x,y=NULL,legend,fill,col,bg)

Here,

o x and y are the coordinates to be used to position the legend.

o legend is the text of legend

o fill is the color to use for filling the boxes beside the legend text.

o col defines the color of line and points besides the legend text.

o bg is the background color for the legend box.

Example
# Creating data for the graph.
x <- c(20, 65, 15, 50)
labels <- c("India", "America", "Shri Lanka", "Nepal")
pie_percent<- round(100*x/sum(x), 1)
# Giving the chart file a name.
png(file = "per_pie.jpg")
# Plotting the chart.
pie(x, labels = pie_percent, main = "Country Pie Chart",col = rainbow(length(x)))

legend("topright", c("India", "America", "Shri Lanka", "Nepal"), cex = 0.8,


fill = rainbow(length(x)))
#Saving the file.
dev.off()

Output:

SRMIST DDE Self Learning Material


9.4.5 R Histogram

A histogram is a type of bar chart which shows the frequency of the number
of values which are compared with a set of values ranges.

The histogram is used for the distribution, whereas a bar chart is used for
comparing different entities. In the histogram, each bar represents the height
of the number of values present in the given range.

For creating a histogram, R provides hist() function, which takes a vector as


an input and uses more parameters to add more functionality. There is the
following syntax of hist() function:

hist(v,main,xlab,ylab,xlim,ylim,breaks,col,border)

Here,

S.No Parameter Description

187 SRMIST DDE Self Learning Material


1. v It is a vector that contains numeric values.

2. main It indicates the title of the chart.

3. col It is used to set the color of the bars.

4. border It is used to set the border color of each bar.

5. xlab It is used to describe the x-axis.

6. ylab It is used to describe the y-axis.

7. xlim It is used to specify the range of values on the x-axis.

8. ylim It is used to specify the range of values on the y-axis.

9. breaks It is used to mention the width of each bar.

Example
# Creating data for the graph.
v <- c(12,24,16,38,21,13,55,17,39,10,60)
# Giving a name to the chart file.
png(file = "histogram_chart.png")
# Creating the histogram.
hist(v,xlab = "Weight",ylab="Frequency",col = "green",border = "red")

# Saving the file.


dev.off()

Output:

SRMIST DDE Self Learning Material


Heatmap in R :

heatmap() function in R Language is used to plot a heatmap. Heatmap is


defined as a graphical representation of data using colors to visualize the
value of the matrix. In this to represent more common values or higher
activities brighter colors basically reddish colors are used and to less
common or activity values darker colors are preferred. Heatmap is also
defined by the name of the shading matrix.

R – heatmap() Function

Syntax: heatmap(data)
Parameters:
• data: It represent matrix data, such as values of rows and columns
Return: This function draws a heatmap.

Example :-

Set seed for reproducibility

set.seed(110)

# Create example data

data <- matrix(rnorm(100, 0, 5), nrow = 10, ncol = 10)

189 SRMIST DDE Self Learning Material


# Column names

colnames(data) <- paste0("col", 1:10)

rownames(data) <- paste0("row", 1:10)

# Draw a heatmap

heatmap(data)

Here, in the above example number of rows and columns are specified to
draw heatmap

SRMIST DDE Self Learning Material


In the above example heat map is drawn by using colorRampPalette to
merge two different colors.

191 SRMIST DDE Self Learning Material


MODULE-10
10.1 Introduction to Predictive Models

10.2 Process of Predictive Model

10.3 Types of Predictive Models

10.4 Applications of Predictive Modelling

SRMIST DDE Self Learning Material


10.1 INTRODUCTION TO PREDICTIVE MODELS

• Making future predictions about unknown events with the help of


techniques from data mining, statistics, machine learning, math
modeling, and artificial intelligence is known as predictive analytics.
With the help of past data, it makes predictions.
• In predictive analytics, find the factors responsible, gather data, apply
techniques from machine learning, data mining, predictive modeling,
and other analytical techniques to predict the future.
• The insights from the data include patterns, the relationship among
different factors that might be previously unknown. Unraveling those
hidden insights is of more worth.
• Businesses use predictive analytics to enhance their process and to
achieve their targets. Insights obtained from both structured and
unstructured data can be used for predictive analytics.
• Predictive modelling basically predicts which event is the most likely
to happen in the future based on past events. Once data has been
gathered, the analyst, using historical data trains and selects statistical
models. Predictive modelling is a tool used in the data-mining
technique ‘predictive analytics’.
• To define predictive modelling – It is the process of using familiar
results to generate, process, and validate a model that is used to
forecast future events and outcomes.

Features of predictive modelling:

• Data analysis & manipulation: Create new data sets, tools for data
analysis, categorize, club, merge and filter data sets.
• Visualization: This includes interactive graphics and reports.
• Statistics: To confirm and create relationships between variables in
the data.
• Hypothesis testing: Creating models, evaluating and choosing the
right models.

193 SRMIST DDE Self Learning Material


10.2 PROCESS OF PREDICTIVE MODEL

The following steps must be understood to know how to build a predictive


model?

• Data Collection- The process of data collection is acquiring the


information needed for analysis, and it entails obtaining historical
data from a reliable source to implement predictive analysis.
• Data Mining- You cleanse your data sets through data mining or data
cleaning. You delete incorrect data during the data cleansing process,
and the data mining process entails removing identical and redundant
data from your data collections.
• Exploratory Data Analysis (EDA)- Data exploration is essential for
the predictive modeling process. You gather critical data and
summarize it by recognizing patterns or trends. EDA is the final step
in your data preparation phase.
• Predictive Model Development- You will utilize various techniques
to create predictive analytics models based on the patterns you've
discovered. Use Python, R, MATLAB, other programming languages,
and standard statistical models to test your hypothesis.
• Model Evaluation- Validation is a crucial phase in predictive
analytics. You run a series of tests to see how effectively your model
can predict outcomes. Given the sample data or input sets to evaluate
the model's validity, you must assess the model's accuracy.
• Predictive Model Deployment- Deployment allows you to test your
model in a real-world scenario, which helps in practical decision
making and makes it ready for implementation.
• Model Tracking- Check the performance of your models constantly to
ensure that you are receiving the best future outcomes possible. It
involves comparing model predictions to actual data sets.

SRMIST DDE Self Learning Material


10.3 TYPES OF PREDICTIVE MODELS

The various predictive models that help make forecasts using machine
learning and data mining approaches.

1. Classification Model
The classification model is one of the most popular predictive analytics
models. These models perform categorical analysis on historical data.
Various industries adopt classification models because they can retrain these
models with current data and as a result, they obtain useful and detailed
insights that help them build appropriate solutions. Classification models are
customizable and are helpful across industries, including banking and retail.

2. Clustering Model
The clustering model gathers data and divides it into groups based on
common characteristics. Hard clustering facilitates data classification,
determining if each data point belongs to a cluster, and soft clustering
allocates a probability to each data point.

In some applications, such as marketing, the ability to partition data into


distinct datasets depending on specific features is highly beneficial. A
clustering model can help businesses plan marketing campaigns for certain
groups of customers.

3. Outliers Model
Unlike the classification and forecast models, the outlier model deals with
anomalous data items within a dataset. It works by detecting anomalous data,
either on its own or with other categories and numbers. Outlier models are
essential in industries like retail and finance, where detecting abnormalities

195 SRMIST DDE Self Learning Material


can save businesses millions of dollars. Outlier models can quickly identify
anomalies, so predictive analytics models are efficient in fraud detection.

4. Forecast Model
One of the most prominent predictive analytics models is the forecast model.
It manages metric value predictions by calculating new data values based on
historical data insights. Forecast models also generate numerical values in
historical data if none are present. One of the most powerful features of
forecast models is that they can manage multiple parameters at a time. As a
result, they're one of the most popular predictive models in the market.

Various industries can use a forecast model for different business purposes.
For example, a call center can use forecast analytics to predict how many
support calls they will receive in a day, or a retail store can forecast inventory
for the upcoming holiday sales periods, etc.

5. Time Series Model


Time series predictive models analyze datasets where the input parameter is
time sequences. The time series model develops a numerical value that
predicts trends within a specific period by combining multiple data points
(from the previous year's data). A Time Series model outperforms traditional
ways of calculating a variable's progress because it may forecast for
numerous regions or projects at once or focus on a single area or task,
depending on the organization's needs.

Time Series predictive models are helpful if organizations need to know how
a specific variable changes over time. For example, if a small business owner
wishes to track sales over the last four quarters, they will need to use a Time
Series model. It can also look at external factors like seasons or periodical
variations that could influence future trends.

6. Linear Regression
One of the simplest machine learning techniques is linear regression. A
generalized linear model simulates the relationship between one or more
independent factors and the target response (dependent variable). Linear

SRMIST DDE Self Learning Material


regression is a statistical approach that helps organizations get insights into
customer behavior, business operations, and profitability. Regular linear
regression can assess trends and generate estimations or forecasts in
business.

For example, suppose a company's sales have increased gradually every


month for the past several years. In that case, the company might estimate
sales in the coming months by linearly analyzing the sales data with monthly
sales.

7. Logistic Regression
Logistic regression is a statistical technique for describing and explaining
relationships between binary dependent variables and one or more nominal,
interval, or ratio-level independent variables. Logistic regression allows you
to predict the unknown values of a discrete target variable based on the
known values of other variables.

In marketing, the logistic regression algorithm deals with


creating probability models that forecast a customer's likelihood of making a
purchase using customer data. Giving marketers a more detailed perspective
of customers' choices offers them the knowledge they need to generate more
effective and relevant outreach.

8. Decision Trees
A decision tree is an algorithm that displays the likely outcomes of various
actions by graphing structured or unstructured data into a tree-like
structure. Decision trees divide different decisions into branches and then list
alternative outcomes beneath each one. It examines the training data and
chooses the independent variable that separates it into the most diverse
logical categories. The popularity of decision trees stems from the fact that
they are simple to understand and interpret.

197 SRMIST DDE Self Learning Material


Decision trees also work well with incomplete datasets and are helpful in
selecting relevant input variables. Businesses generally leverage decision
trees to detect the essential target variable in a dataset. They may also
employ them because the model may generate potential outcomes from
incomplete datasets.

9. Gradient Boosted Model


A gradient boosted model employs a series of related decision trees to create
rankings. It builds one tree at a time, correcting defects in the first to produce
a better second tree. The gradient boosted model resamples the data set
multiple times to get results that create a weighted average of the resampled
data set. These models allow certain businesses to predict possible search
engine results. The gradient boosted approach expresses data sets better
than other techniques; hence, it is the best technique for overall data
accuracy.

10. Neural Networks


Neural networks are complex algorithms that can recognize patterns in a
given dataset. A neural network is helpful for clustering data and defining
categories for various datasets. There are three layers in a neural network-
the input layer transfers data to the hidden layer. As the name suggests, the
hidden layer hides the functions that build predictors. The output layer
gathers data from such predictors and generates a final, accurate outcome.
You can use neural networks with other predictive models like time series or
clustering.

11. Random Forest


A random forest is a vast collection of decision trees, each making its
prediction. Random forests can perform both classification and regression.
The values of a random vector sampled randomly with the same distribution
for all trees in the random forest determine the shape of each tree. The
power of this model comes from the ability to create several trees with
various sub-features from the features. Random forest uses the bagging

SRMIST DDE Self Learning Material


approach, i.e., it generates data subsets from training samples that you can
randomly choose with replacement.

10.4 APPLICATION S OF PREDICTIVE MODELLING

1. Retail- Predictive analytics helps retailers in multiple regions with


inventory planning and dynamic pricing, evaluating the performance
of promotional campaigns, and deciding which personalized retail
offers are best for customers.

2. Healthcare- The healthcare industry employs predictive analytics and


modeling to analyze and forecast future population healthcare needs
by leveraging healthcare data. Predictive models in the healthcare
industry help identify activities that increase patient satisfaction,
resource usage, and budget control. Predictive modeling also enables
the healthcare industry to improve financial management to optimize
patient outcomes.
The Centre for Addiction and Mental Health (CAMH), Canada's leading
mental health teaching center, uses predictive modeling to streamline
treatment for ALC patients and maximize bed space.
3. Banking- The banking industry benefits from predictive analytics by
creating a credit risk-aware mindset, managing capital and liquidity,
and satisfying regulatory obligations. Predictive analytics models
provide more significant detection and protection and better control
and compliance. Predictive models allow banks and other financial
organizations to tailor each client interaction, reduce customer churn,
earn customer trust, and generate remarkable customer experiences.
OTP Bank Romania, part of the OTP Bank Group, implements
predictive analytics to govern the quality of loan issuances, yield more
precise business and risk forecasts, and meet profit goals for the
bank's credit portfolios.

199 SRMIST DDE Self Learning Material


4. Manufacturing- Manufacturing companies use predictive modeling to
forecast maintenance risks and reduce costs on sudden breakdowns.
Predictive analytics models help businesses improve their
performance and overall equipment efficiency, and also allow
companies to enhance product quality and boost consumer
experience.
SPG Dry Cooling, a prominent manufacturer of air-cooled condensers,
uses predictive modeling to acquire better insights into performance
and optimize maintenance, resulting in higher dependability and cost
reductions.

SRMIST DDE Self Learning Material

You might also like