KEMBAR78
Core Module 5 - PPT | PDF | Variance | Mean
0% found this document useful (0 votes)
24 views326 pages

Core Module 5 - PPT

Module 5 covers Business Data Analytics, focusing on the concepts and applications of business analytics, including its types, tools, and trends. It emphasizes the importance of data-driven decision-making and the role of Big Data in analytics. Additionally, it introduces descriptive and inferential statistics as essential methods for analyzing data and making predictions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views326 pages

Core Module 5 - PPT

Module 5 covers Business Data Analytics, focusing on the concepts and applications of business analytics, including its types, tools, and trends. It emphasizes the importance of data-driven decision-making and the role of Big Data in analytics. Additionally, it introduces descriptive and inferential statistics as essential methods for analyzing data and making predictions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 326

Module 5 - Business Data

Analytics
Understand business
analytics and develop
business intelligence.
(17 hours)
In this section, we will discuss:

● Introduction to business analytics and concepts of business analytics.


● Trends in business analytics.
● Introduction to Big Data Analytics
Introduction to business
analytics and Concepts of
business analytics

What is Business Analytics?

● Business analytics (BA) is the iterative,


methodical exploration of an
organization's data, with an emphasis on
statistical analysis.
● Business analytics is used by
companies that are committed to making
data-driven decisions.
Image Source: https://www.businessanalytics.com/
Introduction to business
analytics and Concepts of
business analytics

What is Business Analytics?(Contd)

● Business Analytics is "the study of data


through statistical and operations
analysis, the formation of predictive
models, application of optimization
techniques, and the communication of
these results to customers, business
partners, and college executives”.
Image Source: https://www.proschoolonline.com/certification-business-analytics-
course/what-is-b
Introduction to business
analytics and Concepts of
business analytics

What is Business Analytics?(Contd)

● It adopts quantitative methods and


evidence is required for data to build
certain models for businesses and make
profitable decisions. Thus, Business
Analytics majorly depends on and uses
Big Data( large volume of data) .
Image Source: https://www.altudo.co/resources/blogs/business-analytics-vs-
marketing-analytics-
Introduction to business
analytics and Concepts of
business analytics

Understanding Business Analytics

● Business Analytics is the procedure


through which information is dissected
after studying past performances and
issues, to devise a successful plan for
the future.
● Big Data or large amounts of data is
used to derive solutions.
Image Source: https://www.indiaeducation.net/management/streams/business-
analytics.html
Introduction to business
analytics and Concepts of
business analytics

Understanding Business Analytics

● This method of going about a business


or this outlook towards building and
sustaining a business is vital to the
economy and industries that thrive in the
economy.

Image Source: https://www.martinsights.com/?p=1049


Introduction to business
analytics and Concepts of
business analytics

Components of Business Analytics

● Define Objective
● Data Aggregation
● Data Cleaning
● Analytical Methodology
● Evaluation and Validation
● Reporting and Data Visualisation
Image Source: https://www.analytixlabs.co.in/blog/what-is-business-analytics
Introduction to business
analytics and Concepts of
business analytics
Types of Business Analytics
Methods

● Descriptive Analytics
● Diagnostic Analytics
● Predictive Analytics
● Prescriptive Analytics

Image Source: https://www.analytixlabs.co.in/blog/what-is-business-analytics/


Introduction to business
analytics and Concepts of
business analytics
Uses and Benefits of Business
Analytics

● To carry out data mining and exploring


new data to find new patterns and
relationships.
● To carry out statistical and quantitative
analysis to provide explanations for
certain occurrences.
Image Source: https://www.analytixlabs.co.in/blog/business-analytics-career/
Introduction to business
analytics and Concepts of
business analytics
Uses and Benefits of Business
Analytics

● Test previous decisions are taken with


the help of A/B testing and multivariate
testing.
● Deploy predictive modeling to predict
future outcomes

Image Source: https://www.datapine.com/blog/benefits-of-business-intelligence-


and-business-an
Introduction to business
analytics and Concepts of
business analytics

Business Analytics Tools


● SQL
● Tableau/ QlikView/ Power BI
● Birt
● Python ●R
● MS Excel
● Sisense
● Clear Analytics
● Pentaho BI
● MicroStrategy
Image Source: https://sigma4sap.com/?page_id=466
Introduction to business
analytics and Concepts of
business analytics

Applications of Business Analytics

● Marketing
● Finance
● Human Resources
● Manufacturing

Image Source: https://www.proschoolonline.com/certification-business-


analytics-course/what-is-b
Trends in Business
Analytics

Business Analytics Trends For 2021

● Data Quality Management


● Data Discovery/Visualization
● Artificial Intelligence
● Predictive and Prescriptive Analytics
● Tools
● Collaborative Business Intelligence
● Data-driven Culture
Image Source: https://www.datapine.com/blog/benefits-of-business-intelligence-
and-business-an
Trends in Business
Analytics

Business Analytics Trends For 2021

● Augmented Analytics
● Mobile BI
● Data Automation
● Embedded Analytics
● Natural language processing

Image Source: https://codeit.us/blog/top-data-and-analytics-trends


Descriptive analytics

What is Descriptive Analytics?

● Descriptive analytics is a statistical


method that is used to search and
summarize historical data in order to
identify patterns or meaning.
● Descriptive analytics are based on
standard aggregate functions in
databases
Image Source: https://www.dezyre.com/article/types-of-analytics-descriptive-
predictive-prescriptiv
Descriptive analytics

What is Descriptive Analytics?


(Contd..)

● For example, in an online learning


course with a discussion board,
descriptive analytics could determine
how many students participated in the
discussion, or how many times a
particular student posted in the
discussion forum.
Image Source: https://www.valamis.com/hub/descriptive-analytics
Descriptive analytics

How does descriptive analytics


work?

● Data aggregation and data mining are


two techniques used in descriptive
analytics to discover historical data.
● Data is first gathered and sorted by data
aggregation in order to make the
datasets more manageable by analysts.
Image Source: https://www.dataversity.net/fundamentals-descriptive-analytics/
Descriptive analytics

How does descriptive analytics


work?
(Contd..)

● Data mining describes the next step of


the analysis and involves a search of the
data to identify patterns and meaning.
● Identified patterns are analyzed to
discover the specific ways that learners
interacted with the learning content and
within the learning environment.
Image Source: hhttps://www.sisense.com/glossary/descriptive-analytics/
Descriptive analytics

Examples of descriptive analytics

● Tracking course enrolment's, course


compliance rates,
● Recording which learning resources are
accessed and how often
● Summarizing the number of times a
learner posts in a discussion board
● Tracking assignment and assessment
grades Image Source: https://www.vertical-leap.uk/blog/data-science-for-marketers-part-
2-descriptive-v-
Descriptive analytics

Examples of descriptive analytics


(Contd..)

● Comparing pre-test and post-test


assessments
● Analyzing course completion rates by
learner or by course
● Collating course survey results
● Identifying length of time that learners
took to complete a course
Image Source: https://www.vectorstock.com/royalty-free-vector/data-analytics-
icons-flat-pack-vec
Descriptive analytics

Advantages of descriptive analytics

● Quickly and easily report on the Return


on Investment (ROI) by showing how
performance achieved business or
target goals.
● Identify gaps and performance issues
early - before they become problems.
Image Source: https://forums.bsdinsight.com/threads/descriptive-predictive-and-
prescriptive-anal
Descriptive analytics

Advantages of descriptive analytics


(Contd..)

● Identify specific learners who require


additional support, regardless of how
many students or employees there are
● Identify successful learners in order to
offer positive feedback or additional
resources.
● Analyze the value and impact of course
design and learning resources. Image Source: https://econsultancy.com/analytics-approaches-every-
marketer-should-know-1-de
Introduction to Big Data
Analytics

What is Data?

● The quantities, characters, or symbols


on which operations are performed by a
computer, which may be stored and
transmitted in the form of electrical
signals and recorded on magnetic,
optical, or mechanical recording media.
Image Source: https://encrypted-
tbn0.gstatic.com/images?q=tbn:ANd9GcSCqNFP8VjcmqJX2EyEd-
2mOaHcwSqTiXQVjCP1ISvmcIxoMYvCms5tQ_9imGeKaTmuBaA&usqp=C
AU
Introduction to Big Data
Analytics

What is Big Data?

● Big Data is a collection of data that is


huge in volume, yet growing
exponentially with time. It is a data
with so large size and complexity that
none of traditional data management
tools can store it or process it efficiently.
Big data is also a data but with huge
size Image Source:
https://www.guru99.com/images/Big_Data/061114_0759_WhatIsBigDa1.jpg
Introduction to Big Data
Analytics

Example of Big Data

● The New York Stock Exchange is an


example of Big Data that generates
about one terabyte of new trade data
per day.

Image Source:
https://www.guru99.com/images/Big_Data/061114_0759_WhatIsBigDa2.jpg
Introduction to Big Data
Analytics

Example of Big Data


Social Media

● The statistic shows


that 500+terabytes of new data get
ingested into the databases of social
media site Facebook, every day. This
data is mainly generated in terms of
photo and video uploads, message
exchanges, putting comments etc.
Image Source:
https://www.guru99.com/images/Big_Data/061114_0759_WhatIsBigDa3.jpg
Introduction to Big Data
Analytics

Example of Big Data

● A single Jet engine can


generate 10+terabytes of data in 30
minutes of flight time. With many
thousand flights per day, generation of
data reaches up to many Petabytes.

Image Source:
https://www.guru99.com/images/Big_Data/061114_0759_WhatIsBigDa4.jpg
Introduction to Big Data
Analytics

Types Of Big Data

Following are the types of Big Data:

● Structured
● Unstructured
● Semi-structured

Image Source:
https://www.guru99.com/images/Big_Data/061114_0759_WhatIsBigDa4.jpg
Introduction to Big Data
Analytics

Structured Big Data

Any data that can be stored, accessed and


processed in the form of fixed format is
termed as a ‘structured’ data.

Image Source:
https://www.guru99.com/images/Big_Data/061114_0759_WhatIsBigDa4.jpg
Introduction to Big Data
Analytics

Unstructured Big Data

Any data with unknown form or the structure


is classified as unstructured data. In addition
to the size being huge, un-structured data
poses multiple challenges in terms of its
processing for deriving value out of it.

Image Source:
https://www.guru99.com/images/Big_Data/061114_0759_WhatIsBigDa4.jpg
Introduction to Big Data
Analytics

Semi-structured Big Data

Semi-structured data can contain both the


forms of data. We can see semi-structured
data as a structured in form but it is actually
not defined with e.g. a table definition in
relational DBMS.

Image Source:
https://www.guru99.com/images/Big_Data/061114_0759_WhatIsBigDa4.jpg
Introduction to Big Data
Analytics

Data Growth over the years


Introduction to Big Data
Analytics

Characteristics Of Big Data

Big data can be described by the following


characteristics:

● Volume
● Variety
● Velocity
● Variability
Image Source: https://qph.cf2.quoracdn.net/main-qimg-
b093fe2a8f3d7ed42897bd85c33a4075-lq
Statistics
(16 hours)
In this section, we will discuss:

● Views in Laravel with complete conditional and looping construct.


● Controllers and its usage.
● Complete database connectivity with DB
● Complete Database connectivity with Eloquent Model and its working.
Inferential Statistics

Introduction to Inferential statistics

● Inferential statistics is a scientific


discipline that uses mathematical tools
to make forecasts and projections by
analyzing the given data.
● This is of use to people employed in
such fields as engineering, economics,
biology, the social sciences, business,
agriculture and communications.
Image Source: /
Inferential Statistics

Advantages of Inferential statistics

● A precise tool for estimating population.


● Highly structured analytical methods.

Image Source: /
Inferential Statistics

Inferential Statistics Examples

● Regression Analysis is one of the


most popular analysis tools.
● Regression analysis is used to predict
the relationship between independent
variables and the dependent variable.

Image Source: /
Inferential Statistics

Inferential Statistics Examples

● Hypothesis testing is a statistical test


where we want to know the truth of an
assumption or opinion that is common in
society.
● Confidence interval or confidence level
is a statistical test used to estimate the
population by using samples.

Image Source: /
Inferential Statistics

Inferential Statistics Examples

● Time series analysis is one type of


statistical analysis that tries to predict an
event in the future based on pre-existing
data.
● With this method, we can estimate how
predictions a value or event that
appears in the future.

Image Source: /
Descriptive Statistics

Introduction to descriptive statistics

● It is used to describe the basic features


of data in a study.
● Descriptive statistics deals with the
processing of data without attempting to
draw any inferences from it.
● The data are presented in the form of
tables and graphs.

Image Source: /
Descriptive Statistics

Introduction to descriptive statistics

● The characteristics of the data are


described in simple terms.
● Events that are dealt with include
everyday happenings such as accidents,
prices of goods, business, incomes,
epidemics, sports data, population data.

Image Source: /
Descriptive Statistics

Types of descriptive statistics

● All descriptive statistics are either


measures of central tendency or
measures of variability, also known as
measures of dispersion.

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Descriptive Statistics

Measure of Central Tendency

● Measures of central tendency focus on


the average or middle values of data
sets.
● These measures indicate where most
values in a distribution fall and are also
referred to as the central location of a
distribution.

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Descriptive Statistics

Measure of Central Tendency


continued…

● We can think of it as the tendency of


data to cluster around a middle value.
● In statistics the three most common
measures of central tendency are the
mean, median and mode.

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Descriptive Statistics

Measure of Central Tendency


continued…

● Each of these measures calculates the


location of the central point using a
different method.
● Choosing the best measure of central
tendency depends on the type of data
we have.

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Measure of Central
Tendency

Mean

● The mean is the arithmetic average, and


it is probably the measure of central
tendency that you are most familiar.
● Calculating the mean is very simple.

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Measure of Central
Tendency

Mean continued…

● We just add up all of the values and


divide by the number of observations in
your dataset.
● x1+x2+x3+.....+xn
_______________
n

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Measure of Central
Tendency

Mean continued…

● The calculation of the mean


incorporates all values in the data.
● If you change any value, the mean
changes.
● However, the mean doesn’t always
locate the center of the data accurately.

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Measure of Central
Tendency

Mean continued…

● In a symmetric distribution, the mean


locates the center accurately.

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Measure of Central
Tendency

Mean continued…

● However, in a skewed distribution, the


mean can miss the mark.
● This problem occurs because outliers
have a substantial impact on the mean.
● Extreme values in an extended tail pull
the mean away from the center.
● As the distribution becomes more
skewed, the mean is drawn further away
from the center.
Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/
Measure of Central
Tendency

Median

● The median is the middle value.


● It is the value that splits the dataset in
half.
● To find the median, order your data from
smallest to largest, and then find the
data point that has an equal amount of
values above it and below it.

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Measure of Central
Tendency

Median continued…

● The method for locating the median


varies slightly depending on whether
your dataset has an even or odd number
of values.

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Measure of Central
Tendency

Median continued…

● In the dataset with the odd number of


observations, notice how the number 12
has six values above it and six below it.
● Therefore, 12 is the median of this
dataset.

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Measure of Central
Tendency

Median continued…

● When there is an even number of


values, you count in to the two
innermost values and then take the
average.
● The average of 27 and 29 is 28.
● Consequently, 28 is the median of this
dataset.

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Measure of Central
Tendency

Mode

● The mode is the value that occurs the


most frequently in your data set.
● On a bar chart, the mode is the highest
bar.
● If the data have multiple values that are
tied for occurring the most frequently,
you have a multimodal distribution.
● If no value repeats, the data do not have
a mode.
Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/
Measure of Central
Tendency

Mode continued….

● In the dataset, the value 5 occurs most


frequently, which makes it the mode.
● These data might represent a 5-point
Likert scale.

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Measure of Central
Tendency

Mode continued….

● Typically, you use the mode with


categorical, ordinal, and discrete data.
● In fact, the mode is the only measure of
central tendency that you can use with
categorical data—such as the most
preferred flavor of ice cream.
● However, with categorical data, there
isn’t a central value because you can’t
order the groups.
Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/
Measure of Central
Tendency

Mode continued….

● With ordinal and discrete data, the mode


can be a value that is not in the center.
● Again, the mode represents the most
common value.

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Arithmetic mean

Definition

● Arithmetic Mean in the most common


and easily understood measure of
central tendency.
● We can define mean as the value
obtained by dividing the sum of
measurements with the number of
measurements contained in the data set
and is denoted by the symbol x bar

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Arithmetic mean

Arithmetic Mean for three types of


series

● Individual Data Series


● Discrete Data Series
● Continuous Data Series

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Arithmetic mean

Individual Data Series

● When data is given on individual basis.


● Following is an example of individual
series:
Items:
5 10 20 30 40 50 60 70

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Arithmetic mean

Individual Data Series continued…

● Alternatively, we can write same formula


as follows:

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Arithmetic mean

Individual Data Series continued…

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Arithmetic mean

Individual Data Series continued…

Example:
Problem Statement:
● Calculate Arithmetic Mean for the
following individual data:
Items:
14 36 45 70 105

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Arithmetic mean

Individual Data Series continued…

Solution:
● Based on the above mentioned formula,
Arithmetic Mean x¯ will be:
● The Arithmetic Mean of the given
● numbers is 54.

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Arithmetic mean

Discrete Data Series

● When data is given along with their


frequencies. Following is an example of
discrete series:
Items : 5 10 20 30 40 50 60 70
Frequency: 2 5 1 3 12 0 5 7

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Arithmetic mean

Discrete Data Series continued…

● For discrete series, the Arithmetic Mean


can be calculated using the following
formula.
Formula

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Arithmetic mean

Discrete Data Series continued…

● Alternatively, we can write same formula


as follows:
Formula

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Arithmetic mean

Continuous Data Series

● When data is given based on ranges


along with their frequencies. Following is
an example of continuous series:
Items: 0-5 5-10 10-20 20-30 30-40
Frequency: 2 5 1 3 12

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Arithmetic mean

Continuous Data Series continued…

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Arithmetic mean

Geometric mean

● Geometric mean of n numbers is


defined as the nth root of the product of
n numbers.
Formula :

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Arithmetic mean

Geometric mean continued…

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Harmonic mean

What is Harmonic Mean ?


● Harmonic mean is a type of average that
is calculated by dividing the number of
values in a data series by the sum of the
reciprocals (1/x_i) of each value in the
data series.
● A harmonic mean is one of the three
Pythagorean means (the other two are
arithmetic mean and geometric mean).
● The harmonic mean always shows the
lowest value among the Pythagorean
means.
Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/
Harmonic mean

Formula of Harmonic Mean


● The general formula for calculating a
harmonic mean is:
● Harmonic mean = n / (Σ1/x_i)
● Where: n – the number of the values in
a dataset
● x_i – the point in a dataset
● The weighted harmonic mean can be
calculated using the following formula:
● Weighted Harmonic Mean = (Σw_i )
/(Σw_i/x_i) Where:
● w_i – the weight of the data point
● x_i – the point in a dataset. Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/
Harmonic mean

Formula of Harmonic Mean


● The general formula for calculating a
harmonic mean is:
● Harmonic mean = n / (Σ1/x_i)
● Where: n – the number of the values in
a dataset
● x_i – the point in a dataset
● The weighted harmonic mean can be
calculated using the following formula:
● Weighted Harmonic Mean = (Σw_i )
/(Σw_i/x_i) Where:
● w_i – the weight of the data point
● x_i – the point in a dataset. Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/
Harmonic mean
Example of Harmonic Mean

● Firstly, we need to find the P/E ratios of


each company. Remember that the P/E
ratio is essentially the market
capitalization divided by the earnings.
● P/E (Company A) = ($1 billion) / ($20
million) = 50
● P/E (Company B) = ($20 billion) / ($5
billion) = 4

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Harmonic mean
Example of Harmonic Mean
● We must use the weighted harmonic
mean to calculate the P/E ratio of the
index. Using the formula for the
weighted harmonic mean, the P/E ratio
of the index can be found in the
following way:
● P/E (Index) = (0.4+0.6) / (0.4/50 +
0.6/4)= 6.33
● Note that if we calculate the P/E ratio of
the index using the weighted arithmetic
mean, it would be significantly
overstated:
● P/E (Index) = 0.4×50 + 0.6×4 = 22.4 Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/
Median in Raw and
Grouped Data
Median in Raw Data

● The median of raw data is the number


which divides the observations when
arranged in an order (ascending or
descending) in two equal parts.

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Median in Raw and
Grouped Data
Method of finding median

● Take the following steps to find the


median of raw data.
● Step I: Arrange the raw data in
ascending or descending order.
● Step II: Observe the number of variates
in the data. Let the number of variates in
the data be n. Then find the median as
following.
● (i) If n is odd then [Math Processing
Error]th variate is the median
Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/
Median in Raw and
Grouped Data
Method of finding median

● (ii) If n is even then the mean of [Math


Processing Error]th and ([Math
Processing Error] + 1)th variates is the
median, i.e.,
● median = [Math Processing Error].

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Median in Raw and
Grouped Data
Solved Examples on Median of Raw
Data

● Find the median of the ungrouped data.


15, 18, 10, 6, 14
● Solution:
Arranging variates in ascending order,
we get 6, 10, 14, 15, 18.
The number of variates = 5, which is odd.
● Therefore, median = [Math Processing
Error]th variate = 3rd variate

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Median in Raw and
Grouped Data
Finding Median for Grouped Data
● Median is the value which occupies the
middle position when all the
observations are arranged in an
ascending or descending order. It is a
positional average.
● (i) Construct the cumulative frequency
distribution.
● (ii) Find (N/2)th term
● (iii) The class that contains the
cumulative frequency N/2 is called the
median class.
Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/
Median in Raw and
Grouped Data
Finding Median for Grouped Data

● (iv) Find the median by using the


formula:
● Where l = Lower limit of the median
class,
● f = Frequency of the median class
● c = Width of the median class,
● N = The total frequency (Σf)
● m = cumulative frequency of the class
preceding the median class

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Median in Raw and
Grouped Data
Solved Examples on Median of
Grouped Data
● A researcher studying the behavior of
mice has recorded the time (in seconds)
taken by each mouse to locate its food
by considering 13 different mice as
31,33, 63, 33, 28, 29, 33, 27, 27, 34,
35,28, 32. Find the median time that
mice spent in searching its food.
● 31, 33, 63, 33, 28, 29, 33, 27, 27, 34,35,
28, 32
● Ascending order of given data is 27, 27,
28, 28, 29, 31, 32, 33, 33, 33,34, 35, 63
● Middle value is 7th observation Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/
Mode in Raw and Grouped
Data
Finding the Mode in Raw Data

● To find the mode, or modal value, it is


best to put the numbers in order. Then
count how many of each number. A
number that appears most often is the
mode.
● 3, 7, 5, 13, 20, 23, 39, 23, 40, 23, 14,
12, 56, 23, 29
● In order these numbers are:

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Mode in Raw and Grouped
Data
Finding the Mode in Raw Data

● 3, 5, 7, 12, 13, 14, 20, 23, 23, 23, 23,


29, 39, 40, 56
● This makes it easy to see which
numbers appear most often.
● This makes it easy to see which
numbers appear most often.
● In this case the mode is 23.

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Mode in Raw and Grouped
Data
Finding the Mode in Grouped Data

● In some cases (such as when all values


appear the same number of times) the
mode is not useful. But we can group
the values to see if one group has more
than the others.
● Example: {4, 7, 11, 16, 20, 22, 25, 26,
33}
● Each value occurs once, so let us try to
group them.

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Mode in Raw and Grouped
Data
Finding the Mode in Grouped Data

● We can try groups of 10:


● 0-9: 2 values (4 and 7)
● 10-19: 2 values (11 and 16)
● 20-29: 4 values (20, 22, 25 and 26)
● 30-39: 1 value (33)
● In groups of 10, the "20s" appear most
often, so we could choose 25 (the
middle of the 20s group) as the mode.

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Standard Deviation

Standard Deviation Formulas

● The Standard Deviation is a measure of


how spread out numbers are.
● You might like to read this simpler page
on Standard Deviation first.
● But here we explain the formulas.
● The symbol for Standard Deviation is σ
(the Greek letter sigma).
● This is the formula for Standard
Deviation:

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Standard Deviation
Steps for Standard Deviation

● Say we have a bunch of numbers like 9,


2, 5, 4, 12, 7, 8, 11.
● To calculate the standard deviation of
those numbers:
● 1. Work out the Mean (the simple
average of the numbers)
● 2. Then for each number: subtract then
Mean and square the result
● 3. Then work out the mean of those
squared differences.
● 4. Take the square root of that and we
are done! Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/
Variance

What is Variance?

● Variance is the expected value of the


squared deviation of a random variable
from its mean.
● In short, it is the measurement of the
distance of a set of random numbers
from their collective average value.
● Variance is used in statistics as a way of
better understanding a data set's
distribution.
Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/
Variance

How does Variance work?

● Variance is calculated by finding the


square of the standard deviation of a
variable, and the covariance of the
variable with itself.
● In the formula above, u represents the
mean of the data points, x is the value of
an individual data point, and N is the
total number of data points.
Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/
Variance

How to Calculate Variance?


(Continued)

● Steps to Calculate Variance:

1. List elements of data set. The


following are ages of students
pursuing a Master’s degree:

2. Data set 1: 28,25,26,27,31,32,24


● Calculate the mean.
● (28 + 25 +26 +27 +31 +32 + 24) / 7 =
27.57 Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/
Variance

How to Calculate Variance?


(Continued)

● Find the deviation from the mean for


each data point.

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Variance
How to Calculate Variance?
(Continued)

● Square it

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Variance
=> (0.1849 + 6.6049 + 2.4649 + .3249 + 11.76 +
How to Calculate Variance? 19.6249 + 12. 4609) / 7
(Continued)
⇒ 53.4303 /7 = 7.6329

⇒ Variance=7.6329

● The average of all squared differences ⇒ Standard Deviation=sqrt of Variance


is the variance. To find it, add all
squared variances and divide the sum
by a number of elements in data set (n).
● To find the standard deviation in ages of
students pursuing Master’s, we calculate
the square root of the variance

Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/


Variance
Applications of Variance

● Variance plays a major role in


interpreting data in statistics.
● The most common application of
variance is in polls.
● For opinion polls, the data gathering
agencies cannot invest in collecting data
from the entire population.
● They set criteria for sampling the
population based on ethnicity, income
group, regions, education level, salary
and religion, so that the population is
completely represented by the samples. Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/
Properties of Variance and
standard deviation

Properties of Variance
● Variance is a numerical value that
describes the variability of observations
from its arithmetic mean.
● Variance is nothing but an average of
squared deviations.
● Variance is denoted by sigma-squared
(σ2)
● Variance is expressed in square units
which are usually larger than the values
in the given dataset.
● Variance measures how far individuals.
Image Source: https://www.educba.com/variance-vs-standard-deviation/
Properties of Variance and
standard deviation

Properties of Variance
● In statistics variance is defined as the
measure of variability that represents
how far members of a group are spread
out.
● It finds out the average degree to which
each observation varies from the mean.
● When the variance of a data set is small,
it shows the closeness of the data points
to the mean whereas a greater value of
variance represents that the
observations are very dispersed around
Image Source: https://www.educba.com/variance-vs-standard-deviation/
Median in Raw and
Grouped Data
Finding Median for Grouped Data
● Median is the value which occupies the
middle position when all the
observations are arranged in an
ascending or descending order. It is a
positional average.
● (i) Construct the cumulative frequency
distribution.
● (ii) Find (N/2)th term
● (iii) The class that contains the
cumulative frequency N/2 is called the
median class.
Image Source: https://www.geeksforgeeks.org/difference-between-descriptive-and-inferential-statistics/
Properties of Variance and
standard deviation

Properties of Standard Deviation

● Standard deviation is a measure that


quantifies the amount of dispersion of
the observations in a dataset.
● The low standard deviation is an
indicator of the closeness of the scores
to the arithmetic mean and a high
standard deviation represents.
● The scores are dispersed over a higher
range of values.
Image Source: https://www.educba.com/variance-vs-standard-deviation/
Properties of Variance and
standard deviation

Properties of Standard Deviation


● Standard deviation is a measure of the
● dispersion of observations within a data
set relative to their mean.
● The standard deviation is the root mean
square deviation.
● standard deviation is labelled as sigma
(σ).
● standard deviation which is expressed in
the same units as the values in the set
of data.
● Standard Deviation measures how much
Image Source: https://www.educba.com/variance-vs-standard-deviation/
Properties of Variance and
standard deviation
Example : To find Standard Deviation
and Variance

● Marks scored by a student in five


subjects are 60, 75, 46, 58 and 80
respectively.
● You have to find out the standard
deviation and variance.
● First of all, you have to find out the
mean,

Image Source: https://www.educba.com/variance-vs-standard-deviation/


Properties of Variance and
standard deviation
Example : To find Standard Deviation
and Variance

● Now calculate the variance


● Where, X = Observations
● A = Arithmetic Mean
● Both variance and standard deviation
are always positive.
● If all the observations in a data set are
identical, then the standard deviation
and variance will be zero.

Image Source: https://www.educba.com/variance-vs-standard-deviation/


Properties of Variance and
standard deviation

Image Source: https://www.educba.com/variance-vs-standard-deviation/


Types of Distributions

Data Distribution

● Data distribution is a function that


determines the values of a variable and
quantifies relative frequency, it
transforms raw data into graphical
methods to give valuable information.

Image Source: https://www.educba.com/variance-vs-standard-deviation/


Types of Distributions
The graph of a uniform distribution curve looks like
Uniform Distribution

● Uniform distribution can either be


discrete or continuous where each event
is equally likely to occur. It has a
constant probability constructing a
rectangular distribution.
● A variable X is said to be uniformly
distributed if the density function is:

Image Source: https://www.educba.com/variance-vs-standard-deviation/


Types of Distributions
A binomial distribution graph where the probability of
Binomial Distribution success does not equal the probability of failure looks
like
● A distribution where only two outcomes
are possible, such as success or failure,
gain or loss, win or lose and where the
probability of success and failure is
same for all the trials is called a
Binomial Distribution.
● The mathematical representation of
binomial distribution is given by:

Image Source: https://www.educba.com/variance-vs-standard-deviation/


Types of Distributions

Binomial Distribution continued..

● When probability of success =


probability of failure, in such a situation
the graph of binomial distribution looks
like

Image Source: https://www.educba.com/variance-vs-standard-deviation/


Types of Distributions

Normal Distribution

● Being a continuous distribution, the


normal distribution is most commonly
used in data science.
● A very common process of our day to
day life belongs to this distribution-
income distribution, average employees
report, average weight of a population,
etc.

Image Source: https://www.educba.com/variance-vs-standard-deviation/


Types of Distributions The graph of a random variable X ~ N (µ, σ) is shown
below.

Normal Distribution

● The PDF of a random variable X


following a normal distribution is given
by:

Image Source: https://www.educba.com/variance-vs-standard-deviation/


Types of Distributions

Normal Distribution

● A standard normal distribution is defined


as the distribution with mean 0 and
standard deviation

Image Source: https://www.educba.com/variance-vs-standard-deviation/


Types of Distributions
The graph of a Poisson distribution is shown below:
Poisson Distribution
● A distribution is called Poisson
distribution when the following
assumptions are valid:
● Any successful event should not
influence the outcome of another
successful event.
● The probability of success over a short
interval must equal the probability of
success over a longer interval.
● The probability of success in an interval
approaches zero as the interval
becomes smaller.
Image Source: https://www.educba.com/variance-vs-standard-deviation/
Types of Distributions
Poisson Distribution

● Some notations used in Poisson


distribution are:
● λ is the rate at which an event occurs,
● t is the length of a time interval,
● And X is the number of events in that
time interval.
● Here, X is called a Poisson Random
Variable and the probability distribution
of X is called Poisson distribution.
● Let µ denote the mean number of events
in an interval of length t. Then, µ = λ*t.
● The PMF of X following a Poisson
distribution is given by: Image Source: https://www.educba.com/variance-vs-standard-deviation/
Types of Distributions
Exponential Distribution

● Like the Poisson distribution,


exponential distribution has the time
element; it gives the probability of a time
duration before an event takes place.
● A random variable X is said to have an
exponential distribution with PDF:
● f(x) = { λe-λx, x ≥ 0

Image Source: https://www.educba.com/variance-vs-standard-deviation/


Sampling

What is sampling?

● Sampling is a technique of selecting


individual members or a subset of the
population to make statistical inferences
from them and estimate characteristics
of the whole population.

Image Source:
Sampling

Sampling Example

● If a drug manufacturer would like to


research the adverse side effects of a
drug on the country’s population, it is
almost impossible to conduct a research
study that involves everyone.
● In this case, the researcher decides a
sample of people from each
demographic and then researches them,
giving him/her indicative feedback on
the drug’s behavior.
Image Source:
Sampling

Types of sampling: sampling methods

● Probability sampling: Probability


sampling is a sampling technique where
a researcher sets a selection of a few
criteria and chooses members of a
population randomly.
● All the members have an equal
opportunity to be a part of the sample
with this selection parameter.

Image Source:
Sampling

Types of sampling: sampling methods

● Non-probability sampling: In non-


probability sampling, the researcher
chooses members for research at
random.
● This sampling method is not a fixed or
predefined selection process. This
makes it difficult for all elements of a
population to have equal opportunities to
be included in a sample.

Image Source:
Sampling

Types of probability sampling

● Probability sampling is a sampling


technique in which researchers choose
samples from a larger population using
a method based on the theory of
probability.
● This sampling method considers every
member of the population and forms
samples based on a fixed process.

Image Source:
Sampling

Types of probability sampling

● Simple random sampling: It is a


reliable method of obtaining information
where every single member of a
population is chosen randomly, merely
by chance.
● Cluster sampling: It sampling is a
method where the researchers divide
the entire population into sections or
clusters that represent a population.
Image Source:
Sampling

Types of probability sampling

● Systematic sampling: This method to


choose the sample members of a
population at regular intervals. It
requires the selection of a starting point
for the sample and sample size that can
be repeated at regular intervals.
● Stratified random sampling: It is a
method in which the researcher divides
the population into smaller groups that
don’t overlap but represent the entire
population.
Image Source:
Sampling

Types of non-probability sampling

● Convenience sampling: This method is


dependent on the ease of access to
subjects such as surveying customers at
a mall or passers-by on a busy street.
● Judgmental or purposive sampling:
Judgmental or purposive samples are
formed by the discretion of the
researcher. Researchers purely
consider the purpose of the study, along
with the understanding of the target
audience.
Image Source:
Sampling

Types of non-probability sampling

● Snowball sampling: It is a sampling


method that researchers apply when the
subjects are difficult to trace.
● Quota sampling: In Quota sampling,
the selection of members in this
sampling technique happens based on a
pre-set standard. .

Image Source:
Business Analytics
(17 hours)
In this section, we will discuss:

● Probability Theories
● Bayes’ Theorem
● Maximum Likelihood
● Hypothesis Testing
● Central limit theorem
● Chi-square test
Probability Theories

What Is Probability Theory?

● Probability theory is a branch of


mathematics focusing on the analysis of
random phenomena. It is an important
skill for data scientists using data
affected by chance.

Image Source:
https://i.pinimg.com/originals/c7/44/33/c74433effae9a094ff059a6d016d547b.jpg
Probability Theories

Practical Uses for Probability Theory

● Data scientists to model situations


● Business world
● Business world
● Clinical trials

Image Source:
https://i.pinimg.com/originals/c7/44/33/c74433effae9a094ff059a6d016d547b.jpg
Probability Theories

Types of Probability?

● Classical
● Relative Frequency
● Subjective Probability

Image Source:
https://slideplayer.com/slide/6379961/22/images/3/Types+of+Probability.jpg
Probability Theories

Probability Theory Examples

Probability theory is a tool employed by


researchers, businesses, investment
analysts and countless others for risk
management and scenario analysis.
● Epidemiology
● Insurance
● Small Business
● Meteorology Image Source:
https://i.pinimg.com/originals/c7/44/33/c74433effae9a094ff059a6d016d547b.jpg
Probability Theories

Advantages and Disadvantages of


Probability Theory

● Classical
● Relative Frequency
● Subjective

Image Source:
https://slideplayer.com/slide/6379961/22/images/3/Types+of+Probability.jpg
Probability Theories

What is the probability formula?

● P(A) = favorable outcomes/total


outcomes

Image Source:
https://d138zd1ktt9iqe.cloudfront.net/media/seo_landing_files/probability-formula-
1-1634878086.png
Probability Theories

How Data Scientists Use Probability


Theory

● Probability allows data scientists to


assess the certainty of outcomes of a
particular study or experiment.
● Today’s data scientists need to have an
understanding of the foundational
concepts of probability theory including
key concepts involving probability
distribution, statistical significance,
Image Source: https://luminousmen.com/media/data-science-probability_2.jpg
hypothesis testing and regression.
Bayes’ Theorem

What is the Bayes’ Theorem?

● The Bayes’ theorem is a mathematical


formula used to determine the
conditional probability of events.
Essentially, the Bayes’ theorem
describes the probability of an event
based on prior knowledge of the
conditions that might be relevant to the
event. Image Source: : https://cdn.corporatefinanceinstitute.com/assets/bayes-
theorem.png
Bayes’ Theorem

Formula for Bayes’ Theorem

Where:
● P(A|B) – the probability of event A
occurring, given event B has occurred
● P(B|A) – the probability of event B
occurring, given event A has occurred
● P(A) – the probability of event A
● P(B) – the probability of event B Image Source: : https://cdn.corporatefinanceinstitute.com/assets/bayes-
theorem1.png
Bayes’ Theorem

Formula for Bayes’ Theorem

A special case of the Bayes’ theorem is


when event A is a binary variable. In such a
case, the theorem is expressed in the
following way
Where:
•P(B|A–) – the probability of event B
occurring given that event A– has occurred
•P(B|A+) – the probability of event B
occurring given that event A+ has occurred Image Source: https://cdn.corporatefinanceinstitute.com/assets/bayes-theorem2-
600x97.png
Bayes’ Theorem

Example of Bayes’ Theorem

● Using the Bayes’ theorem, we can find


the required probability:

Image Source: https://cdn.corporatefinanceinstitute.com/assets/bayes-


theorem3.png
Maximum likelihood

What is Maximum likelihood

● Maximum likelihood is a widely used


technique for estimation with
applications in many areas including
time series modeling, panel data,
discrete data, and even machine
learning.
Image Source: https://4.bp.blogspot.com/-
FewCGHC23oU/WliHLS_APPI/AAAAAAAAAv8/gyvldpT8aG4VEPVu6DnqhIXoSa
Bg0F_YgCLcBGAs/s1600/Maximum-Likelihood-Estimation-in-Machine-
Learning.jpg
Maximum likelihood

What is Maximum Likelihood


Estimation?

● Maximum likelihood estimation is a


statistical method for estimating the
parameters of a model. In maximum
likelihood estimation, the parameters are
chosen to maximize the likelihood that
the assumed model results in the
observed data. Image Source: https://i.ytimg.com/vi/2vh98ful3_M/hqdefault.jpg
Maximum likelihood

What is Maximum Likelihood


Estimation?

In order to implement maximum likelihood


estimation we must:
● Assume a model, also known as a data
generating process, for our data.
● Be able to derive the likelihood function
for our data, given our assumed model
(we will discuss this more later). Image Source: https://i.ytimg.com/vi/2vh98ful3_M/hqdefault.jpg
Maximum likelihood

Advantages of Maximum Likelihood


Estimation

● If the model is correctly assumed, the


maximum likelihood estimator is the
most efficient estimator.
● It provides a consistent but flexible
approach which makes it suitable for a
wide variety of applications, including
cases where assumptions of other
models are violated.
● It results in unbiased estimates in larger
Image Source: https://images.slideplayer.com/35/10494584/slides/slide_7.jpg
samples.
Maximum likelihood

Disadvantages of Maximum
Likelihood Estimation

● It relies on the assumption of a model


and the derivation of the likelihood
function which is not always easy.
● Like other optimization problems,
maximum likelihood estimation can be
sensitive to the choice of starting values.

Image Source: https://images.slideplayer.com/35/10494584/slides/slide_7.jpg


Maximum likelihood

Disadvantages of Maximum
Likelihood Estimation

● Depending on the complexity of the


likelihood function, the numerical
estimation can be computationally
expensive.
● Estimates can be biased in small
samples.

Image Source: https://images.slideplayer.com/35/10494584/slides/slide_7.jpg


Maximum likelihood

What is the Likelihood Function?

● Maximum likelihood estimation hinges


on the derivation of the likelihood
function. For this reason, it is important
to have a good understanding of what
the likelihood function is and where it
comes from.
● Let's start with the very simple case
where we have one series with 10
independent observations: 5, 0, 1, 1, 0,
Image Source: http://ai.stanford.edu/~moises/tutorial/img032.GIF
3, 2, 3, 4, 1.
What is the Likelihood
Function?

The Probability Density

● The first step in maximum likelihood


estimation is to assume a probability
distribution for the data. A probability
density function measures the
probability of observing the data given a
set of underlying model parameters.

Image Source: http://ai.stanford.edu/~moises/tutorial/img032.GIF


What is the Likelihood
Function?

The Probability Density

● The Poisson probability density function


for an individual observation, is given by

● Because the observations in our sample


are independent, the probability density
of our observed sample can be found by
taking the product of the probability of
the individual observations:
Image Source: http://ai.stanford.edu/~moises/tutorial/img032.GIF
What is the Likelihood
Function?

The Likelihood Function

The differences between the likelihood


function and the probability density function
are nuanced but important.
● A probability density function expresses
the probability of observing our data
given the underlying distribution
parameters. It assumes that the
parameters are known.
Image Source: http://ai.stanford.edu/~moises/tutorial/img032.GIF
What is the Likelihood
Function?

The Likelihood Function

● The likelihood function expresses the


likelihood of parameter values occurring
given the observed data. It assumes that
the parameters are unknown.

Image Source: http://ai.stanford.edu/~moises/tutorial/img032.GIF


What is the Likelihood
Function?

The Likelihood Function

● Mathematically the likelihood function


looks similar to the probability density:

● For our Poisson example, we can fairly


easily derive the likelihood function

Image Source: http://ai.stanford.edu/~moises/tutorial/img032.GIF


What is the Likelihood
Function?

The Log-Likelihood Function

● In practice, the joint distribution function


can be difficult to work with and the of
the likelihood function is used instead. In
the case of our Poisson dataset the log-
likelihood function is:

Image Source: http://ai.stanford.edu/~moises/tutorial/img032.GIF


What is the Likelihood
Function?

The Maximum Likelihood Estimator

● A graph of the likelihood and log-


likelihood for our dataset shows that the
maximum likelihood occurs when = 2.
This means that our maximum likelihood
estimator,

Image Source: https://www.aptech.com/wp-content/uploads/2020/09/poisson-


likelihood-function.jpeg
What is the Likelihood MLE Bayesian Estimation
Predictions We make We make predictions
Function? predictions using the posterior
utilizing the distribution and the
Maximum Likelihood Estimation latent variables parameters which are
in the density considered as the
(MLE) vs Bayesian Estimation. function to random variables.
compute a
probability.
Situations Data with minimal Data with sparse value
to working values and the and knowledge about the
with knowledge of prior reliability of priors is high.
is low. We can use We can use Bayesian
Maximum Likelihood Estimation (MLE) vs MLE. estimation.
Bayesian Estimation. Complexity MLE is less Bayesian estimation is
complex because more complex because
we require to the computation requires
compute only the the likelihood function,
likelihood function evidence, and prior.
Hypothesis Testing

What Is Hypothesis Testing?

● Hypothesis testing is an act in statistics


whereby an analyst tests an assumption
regarding a population parameter. The
methodology employed by the analyst
depends on the nature of the data used
and the reason for the analysis.
● https://miro.medium.com/max/862/1*VX
xdieFiYCgR6v7nUaq01g.jpeg Image Source: https://miro.medium.com/max/862/1*VXxdieFiYCgR6v7nUaq01g.jpeg
Hypothesis Testing

What Is Hypothesis Testing?

● Hypothesis testing is used to assess the


plausibility of a hypothesis by using
sample data. Such data may come from
a larger population, or from a data-
generating process. The word
"population" will be used for both of
these cases in the following
descriptions.
Image Source: https://miro.medium.com/max/862/1*VXxdieFiYCgR6v7nUaq01g.jpeg
Hypothesis Testing

What Is Hypothesis Testing?

● Hypothesis testing is used to assess the


plausibility of a hypothesis by using
sample data.
● The test provides evidence concerning
the plausibility of the hypothesis, given
the data.
● Statistical analysts test a hypothesis by
measuring and examining a random
sample of the population being Image Source: https://miro.medium.com/max/862/1*VXxdieFiYCgR6v7nUaq01g.jpeg

analyzed.
Hypothesis Testing

How Hypothesis Testing Works

● In hypothesis testing, an analyst tests a


statistical sample, with the goal of
providing evidence on the plausibility of
the null hypothesis.
● Statistical analysts test a hypothesis by
measuring and examining a random
sample of the population being
analyzed. Image Source: https://miro.medium.com/max/642/0*EsLPwer1RYCSczmw
Hypothesis Testing

Types of Hypothesis Testing

● There are two types of Hypothesis


Testing
i. Null hypothesis
ii. Alternative hypothesis

Image Source:
https://www.analyticssteps.com/backend/media/thumbnail/6735922/4237247_1626434645_H
OTHESIS%20TESTINGArtboard%201.jpg
Hypothesis Testing

Types of Hypothesis Testing

● Null Hypothesis: It is denoted by H0. A


null hypothesis is the one in which
sample observations result purely from
chance. This means that the
observations are not influenced by some
non-random cause.
Image Source:
https://www.analyticssteps.com/backend/media/thumbnail/6735922/4237247_1626434645_H
OTHESIS%20TESTINGArtboard%201.jpg
Hypothesis Testing

Types of Hypothesis Testing

● Alternative Hypothesis: It is denoted


by Ha or H1. An alternative hypothesis
is the one in which sample observations
are influenced by some non-random
cause. A hypothesis test concludes
whether to reject the null hypothesis and
accept the alternative hypothesis or to
fail to reject the null hypothesis. The
decision is based on the value of X and Image Source:
https://www.analyticssteps.com/backend/media/thumbnail/6735922/4237247_1626434645_HYP
R. OTHESIS%20TESTINGArtboard%201.jpg
Hypothesis Testing

Steps in Hypothesis Testing

● Stating the Hypotheses


● Making Statistical Assumptions
● Formulating an Analysis Plan
● Investigating Sample Data
● Interpreting Results

Image Source: https://miro.medium.com/max/642/0*EsLPwer1RYCSczmw


Hypothesis Testing

Accepting or Rejecting Null


Hypothesis

● This is an extension of the last step -


interpreting results in the process of
hypothesis testing. A null hypothesis is
accepted or rejected basis P value and
the region of acceptance.
● P value – it is a function of the observed
sample results.
Image Source: https://i0.wp.com/www.iedunote.com/img/25143/hypothesis-
testing.png?resize=299%2C158&quality=100&ssl=1
Central limit theorem

What is the Central Limit Theorem


(CLT)?

● The Central Limit Theorem (CLT) is a


statistical concept that states that the
sample mean distribution of a random
variable will assume a near-normal or
normal distribution if the sample size is
large enough.
Image Source: https://cdn.corporatefinanceinstitute.com/assets/Central-Limit-Theorem-CLT-
Diagram-1200x734.png
Central limit theorem

How Does the Central Limit Theorem


Work?

● The central limit theorem forms the


basis of the probability distribution. It
makes it easy to understand how
population estimates behave when
subjected to repeated sampling. When
plotted on a graph, the theorem shows
the shape of the distribution formed by
means of repeated population samples. Image Source: https://cdn.corporatefinanceinstitute.com/assets/Central-Limit-Theorem-CLT-
How-it-works-and-how-it-arises.png
Central limit theorem

How Does the Central Limit Theorem


Work?

● From the figure above, we can deduce


that despite the fact that the original
shape of the distribution was uniform, it
tends towards a normal distribution as
the value of n (sample size) increases.

Image Source: https://cdn.corporatefinanceinstitute.com/assets/Central-Limit-Theorem-CLT-


How-it-works-and-how-it-arises.png
Central limit theorem

How Does the Central Limit Theorem


Work?

● Apart from showing the shape that the


sample means will take, the central limit
theorem also gives an overview of the
mean and variance of the distribution.
The sample mean of the distribution is
the actual population mean from which
the samples were taken.
Image Source: https://cdn.corporatefinanceinstitute.com/assets/Central-Limit-Theorem-CLT-
How-it-works-and-how-it-arises.png
Central limit theorem

How Does the Central Limit Theorem


Work?

● The variance of the sample distribution,


on the other hand, is the variance of the
population divided by n. Therefore, the
larger the sample size of the distribution,
the smaller the variance of the sample
mean.
Image Source: https://cdn.corporatefinanceinstitute.com/assets/Central-Limit-Theorem-CLT-
How-it-works-and-how-it-arises.png
Chi-square test

What Is a Chi-Square Statistic?

● A chi-square (χ2) statistic is a test that


measures how a model compares to
actual observed data. The data used in
calculating a chi-square statistic must be
random, raw, mutually exclusive, drawn
from independent variables, and drawn
from a large enough sample.
Image Source: https://i.ytimg.com/vi/HKDqlYSLt68/maxresdefault.jpg
Chi-square test

What Is a Chi-Square Statistic?

● Chi-square tests are often used in


hypothesis testing.
● The chi-square statistic compares the
size of any discrepancies between the
expected results and the actual results,
given the size of the sample and the
number of variables in the relationship.
Image Source: https://i.ytimg.com/vi/HKDqlYSLt68/maxresdefault.jpg
Chi-square test

The Formula for Chi-Square Is

Image Source: https://i.ytimg.com/vi/HKDqlYSLt68/maxresdefault.jpg


Chi-square test

What Does a Chi-Square Statistic Tell


You?

There are two main kinds of chi-square tests:


● The test of independence, which asks a
question of relationship, such as, "Is there
a relationship between student sex and
course choice?";

Image Source: https://www.statstest.com/wp-content/uploads/2020/10/Chi-Square-Test-of-


Independence-1.jpg
Chi-square test

What Does a Chi-Square Statistic Tell


You?

● The goodness-of-fit test, which asks


something like "How well does the coin in
my hand match a theoretically fair coin?"
Chi-square analysis is applied to categorical
variables and is especially useful when those
variables are nominal (where order doesn't
matter, like marital status or gender).
Image Source: https://www.statstest.com/wp-content/uploads/2020/09/Chi-Square-Goodness
Fit-Test-1024x622.jpg
Chi-square test

What is a chi-square test used for?

● Chi-square is a statistical test used to


examine the differences between
categorical variables from a random
sample in order to judge goodness of fit
between expected and observed results.

Image Source: https://encrypted-


tbn0.gstatic.com/images?q=tbn:ANd9GcR7Hd2uJDxTmKlfIWnu_U6Hqz4X8J5DRfRewA&usq
CAU
Chi-square test

Who uses chi-square analysis?

● Researchers
● Nominal or Ordinal variable

Image Source: https://statswork.com/blog/wp-content/uploads/2019/10/chi-square-


infographics.png
Data Analytics using Python
(30 Hours)
In this section, we will discuss:

● Data mining, wrangling, data manipulation techniques


● Data cleaning and pre-processing techniques
● Data analytics project lifecycle
● Numerical Computing using NumPy Library
● Multidimensional data handling using Pandas Library
● Data Visualization using Matplotlib
● Advanced data visualization using seaborn
● Pandas profiling for report generation
● Need for data visualization
Data mining, wrangling,data
manipulation techniques

Data Mining

● Data Mining is defined as extracting


information from huge sets of data.
● Data mining is a process of extracting
and discovering patterns in large data
sets involving methods at the
intersection of machine learning,
statistics, and database systems.

Image Source: https://www.shutterstock.com/search/data+mining


Data mining, wrangling,data
manipulation techniques

Why is data mining important?

● Data mining tools predict behaviors and


future trends, allowing businesses to
make proactive, knowledge-driven
decisions. Data mining tools can answer
business questions that traditionally
were too time consuming to resolve.

Image Source:
https://slideplayer.com/slide/3900027/13/images/14/Why+is+data+mining+important.jpg
Data mining, wrangling,data
manipulation techniques

Data mining Steps

● Understand Business
● Understand the Data
● Prepare the Data
● Model the Data
● Evaluate the Data
● Deploy the Solution

Image Source: https://www.researchgate.net/figure/The-steps-for-data-mining-process_fig3_344166043


Data mining, wrangling,data
manipulation techniques

Benefits of Data Mining

● It helps companies gather reliable


information.
● It's an efficient, cost-effective solution
compared to other data applications.
● It helps businesses make profitable
production and operational adjustments.
● Data mining uses both new and legacy
systems.
● It helps businesses make informed
Image Source: https://www.includehelp.com/basics/data-mining-introduction-benefits-disadvantages-
decisions. and-applications.aspx
Data mining, wrangling,data
manipulation techniques

Data Mining Tools

● Artificial Intelligence
● Association Rule Learning
● Clustering
● Classification
● Data Analytics
● Data Cleansing and Preparation
● Data Warehousing
● Machine Learning
● Regression
Image Source: https://medium.com/@springboard_ind/top-10-data-mining-tools-b06171d476d0
Data mining, wrangling,data
manipulation techniques

Data Mining Applications

● Retail
● Financial services
● Insurance
● Manufacturing
● Entertainment
● Healthcare

Image Source: http://higssoftware.com/blog/data-mining-process-steps.php/


Data mining, wrangling,data
manipulation techniques

Data Wrangling

● Data wrangling is the process of


cleaning and unifying messy and
complex data sets for easy access and
analysis.
● Data Wrangling is also known as Data
Munging.

Image Source: https://towardsdatascience.com/data-wrangling-raw-to-clean-transformation-


b30a27bf4b3b
Data mining, wrangling,data
manipulation techniques
What is the Purpose of Data
Wrangling?

● The primary purpose of data wrangling


can be described as getting data in
coherent shape

Image Source: https://pediaa.com/what-is-the-difference-between-data-wrangling-and-data-


cleaning/
Data mining, wrangling,data
manipulation techniques

Data Wrangling Steps

● Data Discovery
● Data Structuring
● Data Cleaning
● Data Enriching
● Data Validating
● Data Publishing

Image Source: https://favtutor.com/blogs/data-wrangling


Data mining, wrangling,data
manipulation techniques

Data Wrangling Tools

● Excel Power Query / Spreadsheets


● OpenRefine
● Google DataPrep
● Tabula
● DataWrangler
● CSVKit

Image Source: https://www.zohowebstatic.com/sites/default/files/dataprep/import-


sources.png
Data mining, wrangling,data
manipulation techniques

Data Wrangling in Python

● Numpy
● Pandas
● Matplotlib
● Plotly
● Theano

Image Source: https://pythongeeks.org/data-wrangling-in-python-with-examples/


Data mining, wrangling,data
manipulation techniques

Data Manipulation

● Pandas is an open-source python library


that is used for data manipulation and
analysis.
● It provides many functions and methods
to speed up the data analysis process.

Image Source: https://www.astera.com/type/blog/data-manipulation-tools/


Data mining, wrangling,data
manipulation techniques

What are Data Manipulations?

● Data manipulation with python is defined


as a process in the python programming
language that enables users in data
organization in order to make reading or
interpreting the insights from the data
more structured and comprises of
having better design.

Image Source: https://etipsguruji.com/data-manipulation-benefits-advantage/


Data mining, wrangling,data
manipulation techniques

Pandas

● Pandas is an open source Python


package that is most widely used for
data science/data analysis and machine
learning tasks.
● It is built on top of another package
named Numpy, which provides support
for multi-dimensional arrays.

Image Source: https://towardsdatascience.com/manipulating-the-data-with-


pandas-using-python-be6c5dfabd47
Data mining, wrangling,data
manipulation techniques
9 Effective Pandas Techniques in
Python for Data Manipulation

● Pivot Table
● Boolean Indexing
● Apply Function
● Crosstab
● Merge DataFrames
● Sorting DataFrames
● Plotting
● Cut function
● Impute Missing Values
Image Source: https://media.geeksforgeeks.org/wp-
content/uploads/20210109131825/Annotation20210109131527-660x225.png
Data cleaning and pre-
processing techniques

Data Cleaning

● Data cleaning is the process of fixing or


removing incorrect, corrupted,
incorrectly formatted, duplicate, or
incomplete data within a dataset.

Image Source: https://www.dataversity.net/what-is-data-cleansing/


Data cleaning and pre-
processing techniques
Why data cleaning is
essential?

● Error-Free Data
● Data Quality
● Accurate and Efficient
● Complete Data
● Maintains Data Consistency

Image Source: https://medium.com/@nicklauswinters/data-cleaning-


22b790eea834
Data cleaning and pre-
processing techniques

8 effective data cleaning techniques

● Remove duplicates
● Remove irrelevant data
● Standardize capitalization
● Convert data type
● Clear formatting
● Fix errors
● Language translation
● Handle missing values
Image Source: https://prwatech.in/blog/machine-learning/data-cleaning-in-
machine-learning/
Data cleaning and pre-
processing techniques
6 Steps to Manipulate and Cleanse
Data with Python

● Imputing Missing Values


● Outlier and Anomaly Detection
● X-Variable Cleaning Methods
● Y-Variable Cleaning Methods
● Merging DataFrames
● Parsing Dates

Image Source: https://brightdata.com/blog/how-tos/data-manipulation-cleaning-python


Data cleaning and pre-
processing techniques

Data Pre-processing

● Data pre-processing is a data mining


technique which is used to transform the
raw data in a useful and efficient format.

Image Source: https://medium.datadriveninvestor.com/data-preprocessing-3cd01eefd438


Data cleaning and pre-
processing techniques

Data Pre-processing Steps

● Data quality assessment


● Data cleaning
● Data transformation
● Data reduction

Image Source: https://www.techtarget.com/searchdatamanagement/definition/data-preprocessing


Data cleaning and pre-
processing techniques
Steps involved in data pre-
processing

● Importing the required Libraries


● Importing the data set
● Handling the Missing Data.
● Encoding Categorical Data.
● Splitting the data set into test set and
training set.
● Feature Scaling.

Image Source: https://brain-mentors.com/wp-content/uploads/2020/05/Untitled-picture-9-


1024x325.png
Data analytics project
lifecycle

Project lifecycle

● Domain/Business Understanding
● Data collection/Data Exploration
● Data Cleaning
● Feature Engineering
● Modelling

Image Source: https://pingax.com/understanding-data-analytics-project-life-cycle/


Data analytics project
lifecycle

Project lifecycle

● Evaluation
● Optimization and Tuning
● Deploy the Model to production
● Monitor the performance during
Production

Image Source: https://www.northeastern.edu/graduate/blog/data-analysis-project-lifecycle/


Numerical Computing using
NumPy Library

What is Numerical Computing?

● Numerical computing is an approach for


solving complex mathematical problems
using only simple arithmetic operations.
● Numerical Python has a fixed-size,
homogeneous (fixed-type), multi-
dimensional array type and lots of
functions for various array operations.

Image Source:
https://slideplayer.com/slide/4519120/15/images/4/What+is+numerical+computing.jpg
Numerical Computing using
NumPy Library

What is NumPy in Python?

● NumPy is an open-source library


available in Python, which helps in
mathematical, scientific, engineering,
and data science programming.
● It is a very useful library to perform
mathematical and statistical operations
in Python.

Image Source: https://favtutor.com/blogs/numpy-vs-pandas


Numerical Computing using
NumPy Library

Why use NumPy?

● NumPy is memory efficiency, meaning it


can handle the vast amount of data
more accessible than any other library.

Image Source: https://indianaiproduction.com/python-numpy-tutorial/


Numerical Computing using
NumPy Library

Creating Arrays

● The array object in NumPy is called


ndarray.
● We can create a NumPy ndarray object
by using the array() function.

Image Source: https://medium.com/analytics-vidhya/introduction-to-numpy-


16a6efaffdd7
Numerical Computing using
NumPy Library

Random Numbers using NumPy

NumPy offers the random module to work


with random numbers.

● rand()
● randint()
● choice()

Image Source: https://www.w3resource.com/python-exercises/numpy/python-numpy-random-exercise-


2.php
Numerical Computing using
NumPy Library

Indexing and Slicing in Python

● Array indexing is the same as accessing


an array element.
● Slicing in python means taking elements
from one given index to another given
index.

Image Source: https://www.pythoninformer.com/python-libraries/numpy/index-


and-slice/
Numerical Computing using
NumPy Library

Statistical Functions in Python


● median: This will return the median
along the specified axis.
● average: This will return the weighted
average along the specified axis.
● mean: This will return the arithmetic
mean along the specified axis.
● std: This will return the standard
deviation along the specified axis.
● var: This will return the variance along
the specified axis.
Image Source:
https://d3mxt5v3yxgcsr.cloudfront.net/courses/3391/course_3391_image.jpg
Numerical Computing using
NumPy Library

Matrix Multiplication in Python

● The Numpy matmul() function is used to


return the matrix product of 2 arrays.

Image Source: https://www.javatpoint.com/numpy-matrix-multiplication


Multidimensional data
handling using Pandas
Library
What is PANDAS

● PANDAS (PANel Data) is a high level


data manipulation tool used for analysis
data.
● It is vary easy to import and export data
using the Pandas library which has a
very rich set of functions.

Image Source:https://cdn.educba.com/academy/wp-
content/uploads/2019/04/What-is-Pandas-1.jpg
Multidimensional data
handling using Pandas
Library
What is PANDAS

● Pandas have three important data


structures, namely- Series, DataFrame,
and Panel to make the process of
analyzing data organized, effective and
efficient.

Image Source:https://cdn.educba.com/academy/wp-
content/uploads/2019/04/What-is-Pandas-1.jpg
Multidimensional data
handling using Pandas
Library
Data Structure in Pandas

Pandas deals with 3 data structure


● Series
● Data Frame
● Panel

Image Source: https://www.askpython.com/wp-content/uploads/2020/02/Python-


Pandas-Module.png
Multidimensional data
handling using Pandas
Library
Series

● Series is a one-dimensional array like


structure with homogeneous data, which
can be used to handle and manipulate
data.

Image Source:https://1.bp.blogspot.com/-
mYFbQb6dNWo/Xyq3MiVrWMI/AAAAAAAAsB8/ayRGqtckDEIyctuow7M65ezNtM
IdqeH7ACPcBGAYYCw/s437/Pandas%2Bseries.png
Multidimensional data
handling using Pandas
Library
Series

It has two parts


● Data part (An array of actual data)
● Associated index with data (associated
array of indexes or data labels)

Image Source:https://1.bp.blogspot.com/-
mYFbQb6dNWo/Xyq3MiVrWMI/AAAAAAAAsB8/ayRGqtckDEIyctuow7M65ezNtM
IdqeH7ACPcBGAYYCw/s437/Pandas%2Bseries.png
Multidimensional data
handling using Pandas
Library
Creation of Series

There are different ways in which a series


can be created in Pandas.
● Creation of Series from Scalar Values
● Creation of Series from NumPy Arrays
● Creation of Series from Dictionary

Image Source: https://www.datasciencemadesimple.com/wp-


content/uploads/2020/05/create-series-in-python-pandas-0.png
Multidimensional data
handling using Pandas
Library
Accessing Elements of a Series

There are two common ways for accessing


the elements of a series: Indexing and
Slicing.
● Indexing
● Slicing

Image Source: https://www.log2base2.com/images/general/python-list-index.png


Accessing Elements of a
Series

Indexing

● Indexing in Series is similar to that for


NumPy arrays, and is used to access
elements in a series.
● Indexes are of two types: positional
index and labelled index.
● Positional index takes an integer value
that corresponds to its position in the
series starting from 0, whereas labelled
index takes any user-defined label as
index. Image Source: https://www.log2base2.com/images/general/python-list-index.png
Accessing Elements of a
Series

Slicing

● Sometimes, we may need to extract a


part of a series.
● This can be done through slicing. This is
similar to slicing used with NumPy
arrays.
● We can define which part of the series is
to be sliced by specifying the start and
end parameters [start :end] with the
series name.
Image Source: https://www.log2base2.com/images/general/python-list-index.png
Data Visualization using
Matplotlib

Why are visualizations important?

● Visualizations are the easiest way to


analyze and absorb information.
● Visuals help to easily understand the
complex problem.
● They help in identifying patterns,
relationships, and outliers in data.

Image Source:https://i.ytimg.com/vi/VyhLRJVoIrI/maxresdefault.jpg
Data Visualization using
Matplotlib

Matplotlib

● Matplotlib is a 2-D plotting library that


helps in visualizing figures.
● Matplotlib emulates Matlab like graphs
and visualizations.
● Matlab is not free, is difficult to scale and
as a programming language is tedious.

Image Source: https://matplotlib.org/3.1.1/_images/sphx_glr_anatomy_001.png


Data Visualization using
Matplotlib

Installing Matplotlib

● Type !pip install matplotlib in the Jupyter


Notebook or if it doesn’t work in cmd
type conda install -c conda-forge
matplotlib .
● This should work in most cases.

Image Source:https://i.ytimg.com/vi/Iq9f2bQJOPg/maxresdefault.jpg
Data Visualization using
Matplotlib

Things to follow

● Plotting of Matplotlib is quite easy.


Generally, while plotting they follow the
same steps in each and every plot.
● Matplotlib has a module called pyplot
which aids in plotting figure.
● The Jupyter notebook is used for
running the plots.

Image Source:https://i.ytimg.com/vi/Iq9f2bQJOPg/maxresdefault.jpg
Data Visualization using
Matplotlib

Histogram

● A histogram takes in a series of data


and divides the data into a number of
bins.
● It then plots the frequency data points in
each bin (i.e. the interval of points).
● It is useful in understanding the count of
data ranges.

Image Source:https://i.stack.imgur.com/TBlN1.png
Data Visualization using
Matplotlib

Histogram: Pie chart

● It is a circular plot which is divided into


slices to illustrate numerical proportion.
The slice of a pie chart is to show the
proportion of parts out of a whole.
● When to use: Pie chart should be used
seldom used as It is difficult to compare
sections of the chart. Bar plot is used
instead as comparing sections is easy.

Image Source:https://media.geeksforgeeks.org/wp-
content/uploads/20200426195330/plot28.png
Data Visualization using
Matplotlib

Time Series by line plot

● Time series is a line plot and it is


basically connecting data points with a
straight line.
● It is useful in understanding the trend
over time.
● It can explain the correlation between
points by the trend.
● An upward trend means positive
correlation and downward trend means
a negative correlation.
Image Source: https://pandas.pydata.org/pandas-
docs/version/0.14.0/_images/frame_plot_basic.png
Data Visualization using
Matplotlib

Time Series by line plot

● Violin plot is a better chart than boxplot


as it gives a much broader
understanding of the distribution.
● It resembles a violin and dense areas
point the more distribution of data
otherwise hidden by box plots

Image
Source:https://miro.medium.com/max/1204/1*J9OnuX8f5BjlB3XZiyHkVA.png
Boxplot and Violinplot

Boxplot

● Boxplot gives a nice summary of the


data.
● It helps in understanding our distribution
better.

Image
Source:https://miro.medium.com/max/482/1*fCE_5juz235c6cmaOP_PDQ.png
Data Visualization using
Matplotlib

TwinAxis

● TwinAxis helps in visualizing plotting 2


plots w.r.t to the y-axis and same x-axis.

Image Source:
https://miro.medium.com/max/1400/1*dHwnU6ySte5aqNZYezQAgQ.png
Data Visualization using
Matplotlib

Stack Plot and Stem Plot

● Stack plot visualizes data in stacks and


shows the distribution of data over time.
● Stemplot even takes negative values, so
the difference is taken of data and is
plotted over time.

Image Source:https://media.geeksforgeeks.org/wp-
content/uploads/20190429193409/plt.jpg
Data Visualization using
Matplotlib

Bar Plot

● Bar Plot shows the distribution of data


over several groups.
● It is commonly confused with a
histogram which only takes numerical
data for plotting.
● It helps in comparing multiple numeric
values.

Image Source:https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-
content/uploads/2020/04/data-visualisation-bar-charts-in-python-pandas-1-
1024x441.png
Data Visualization using
Matplotlib

Scatter Plot

● Scatter plot helps in visualizing 2


numeric variables.
● It helps in identifying the relationship of
the data with each variable i.e
correlation or trend patterns.
● It also helps in detecting outliers in the
plot.

Image
Source:https://files.realpython.com/media/scatter_plot_drinks_2.715371f88408.p
Data Visualization using
Matplotlib

3D Scatterplot

● 3D Scatterplot helps in visualizing 3


numerical variables in a three-
dimensional plot.

Image Source:https://media.geeksforgeeks.org/wp-
content/uploads/20200504194233/plot119.png
Advanced data visualization
using seaborn

What is Seaborn?

● Matplotlib is the king of Python data


visualization libraries and makes it a
breeze to explore tabular data visually.
● Seaborn is another Python data
visualization library built on top of
Matplotlib that introduces some features
that weren’t previously available, and, in
this tutorial, we’ll use Seaborn.

Image Source: https://seaborn.pydata.org/_images/introduction_1_0.png


Advanced data visualization
using seaborn
Installing the libraries and loading the
data

We will start by installing the libraries and


importing our data. Running the below
command will install the Pandas, Matplotlib,
and Seaborn libraries for data visualization:
● pip install pandas matplotlib seaborn

Image Source:
https://jovian.ai/api/gist/0503237bdcde46c4bd46c58843edf06b/preview/c9e3b2b3
739a42c8b1298e3fb281d1af
Advanced data visualization
using seaborn
Installing the libraries and loading the
data

Now, let’s import the libraries under their


standard aliases:

● import matplotlib.pyplot as plt


● import pandas as pd
● import seaborn as sns

Image Source:
https://jovian.ai/api/gist/0503237bdcde46c4bd46c58843edf06b/preview/c9e3b2b3
739a42c8b1298e3fb281d1af
Advanced data visualization
using seaborn
Performing univariate analysis with
Seaborn

● The goal of EDA is simple — get to


know your dataset at the deepest level
possible.
● Becoming intimate with the data and
learning its relationships between its
variables is an absolute must.

Image
Source:https://miro.medium.com/max/460/1*3AATuqbjcp3xMiTkuWwNPw.png
Advanced data visualization
using seaborn

Creating histograms in Seaborn

Now, we create our first plot, which is a


histogram:
● sns.histplot(x=sample["price"])

Image Source:https://blog.logrocket.com/wp-content/uploads/2021/11/histogram-
count.png
Advanced data visualization
using seaborn

Creating count plots in Seaborn

The most common plot for categorical


features is a countplot. Passing the name of
a categorical feature in our dataset to
Seaborn’s countplot draws a bar chart, with
each bar height representing the number of
diamonds in each category.
● sns.countplot(sample["cut"])

Image Source:https://blog.logrocket.com/wp-content/uploads/2021/11/dataset-
color.png
Advanced data visualization
using seaborn
Performing bivariate analysis with
Seaborn

● Now, let’s look at the relationships


between two variables at a time.
● Let’s start with the connection between
diamond carats and price.

Image
Source:https://miro.medium.com/max/1400/1*cIOI1p56qlHaLAbb6MxrKg.png
Pandas profiling for report
generation

What is Pandas Profiling

● Pandas Profiling is an open-source


python library, which allows you to do
your EDA very quickly.
● By the way, it also generates an
interactive HTML report, which you can
show to anyone.

Image Source:http://www.leehonan.com/content/images/2018/10/Screen-Shot-
2018-10-28-at-5.43.26-pm.png
Pandas profiling for report
generation

What is Pandas Profiling

These are some of the things you get in your


report:
● Type inference
● Essentials
● Quantile statistics
● Descriptive statistics
● Most frequent values
● Histogram

Image Source:http://www.leehonan.com/content/images/2018/10/Screen-Shot-
2018-10-28-at-5.43.26-pm.png
Pandas profiling for report
generation

How to install Pandas Profiling

First of all, you need to install the package.


#installing Pandas Profiling
● !pip install https://github.com/pandas-
profiling/pandas-
profiling/archive/master.zip -q

Image
Source:https://miro.medium.com/max/1400/1*qoT2rIJyQ7qD_aCE9naqGw.jpeg
Pandas profiling for report
generation

How to install Pandas Profiling

● Then import both pandas and


panda_profiling.
● We will be using the Titanic dataset to
complete our analysis so import that as
well.
● After you import it, you should always
take a look at your dataset and then
merely link report to it:

Image
Source:https://miro.medium.com/max/1400/1*qoT2rIJyQ7qD_aCE9naqGw.jpeg
Need for data visualization

What is Data Visualization?

● Data visualization gives us a clear idea


of what the information means by giving
it visual context through maps or graphs.
● This makes the data more natural for the
human mind to comprehend and
therefore makes it easier to identify
trends, patterns, and outliers within large
data sets.

Image Source:https://venngage-
wordpress.s3.amazonaws.com/uploads/2020/06/What-is-Data-Visualization-Blog-
Need for data visualization

Why is Data Visualization Important?

● Data visualization uses visual data to


communicate information in a manner
that is universal, fast, and effective.
● This practice can help companies
identify which areas need to be
improved, which factors affect customer
satisfaction and dissatisfaction, and
what to do with specific products.

Image Source:https://i.ytimg.com/vi/VyhLRJVoIrI/maxresdefault.jpg
Need for data visualization

What Are The Benefits of Data


Visualization?

● Data visualization positively affects an


organization’s decision-making process
with interactive visual representations of
data.
● Businesses can now recognize patterns
more quickly because they can interpret
data in graphical or pictorial forms.

Image Source: https://cdn.slingshotapp.io/wp-content/uploads/2021/09/slingshot-


Benefits-of-Data-Visualization-768x505.png
Need for data visualization

Which Data Visualization Techniques


are Used?

● There are many different methods of


putting together information in a way
that the data can be visualized.
● Depending on the data being modeled,
and what its intended purpose is, a
variety of different graphs and tables
may be utilized to create an easy to
interpret dashboard.

Image Source:https://cdn.whatagraph.com/blog/data-visualisation-techniques.png
Need for data visualization

Who Uses Data Visualization?

● Data visualization is used across all


industries to increase sales with existing
customers and target new markets and
demographics for potential customers.

Image Source:https://neilpatel.com/wp-content/uploads/2021/03/Data-
Visualization_Featured-Image-1.png
Fundamentals of Predictive
Analytics using Machine
Learning techniques (40 hr)
In this section, we will discuss:

● Machine learning and its types & applications


● Supervised machine learning techniques
● Classification vs. regression
● Understanding Regression and types
● Linear regression using OLS
● Multi-Variate Linear Regression
● Correlation concepts
● Metrics- Loss function, MSE, RMSE, MAE, R2 Score
Machine learning and its types
& Applications
Machine learning

AI Approach that uses a system that is capable of


learning from experience without having to be
programmed.

Image Source:
https://static.javatpoint.com/tutorial/machine-learning/images/introduction-to-machine-
learning.png
Machine learning and its types
& Applications
How does Machine Learning Work?

Learns from historical data, builds the prediction


models, and whenever it receives new data,
predicts the output for it.

Image Source:
https://static.javatpoint.com/tutorial/machine-learning/images/introduction-to-machine-
learning2.png
Machine learning and its types
& Applications
Applications

• Image Recognition
• Speech Recognition
• Traffic prediction
• Product recommendations
• Self-driving cars
• Email Spam and Malware Filtering
• Virtual Personal Assistant
• Online Fraud Detection
• Stock Market trading
• Medical Diagnosis
• Automatic Language Translation

Image Source: https://www.javatpoint.com/machine-learning


Machine learning and its types
& Applications
Types

• Supervised Learning
• Unsupervised Learning
• Reinforcement Learning

Image Source: https://static.javatpoint.com/tutorial/machine-learning/images/machine-


learning-algorithms.png
Machine learning and its types
& Applications
Supervised Learning

• It is a process of learning algorithm from the


training dataset.
• Using input and output variable, an algorithm is
used to learn the mapping function from the input
to the output.

Image Source: https://static.javatpoint.com/tutorial/machine-learning/images/supervised-


machine-learning.png
Machine learning and its types
& Applications
Unsupervised Learning

• Modeling the underlying or hidden structure or


distribution in the data in order to learn more
about the data.
• Only have input data and no corresponding
output variables.

Image Source: https://static.javatpoint.com/tutorial/machine-learning/images/unsupervised-machine-


learning-1.png
Machine learning and its types
& Applications
Supervised Learning

• It is a process of learning algorithm from the


training dataset.
• Using input and output variable, an algorithm is
used to learn the mapping function from the input
to the output.

Image Source: https://static.javatpoint.com/tutorial/machine-learning/images/supervised-


machine-learning.png
Machine learning and its types
& Applications
Reinforcement Learning

•Model keeps on increasing its performance


using a reward feedback to learn the behavior or
pattern.
• Markov decision process, Bellman’s equation,
Q-learning, SARSA (state-Action-Reward-State-
Action), Deep Q-network

Image Source: https://static.javatpoint.com/tutorial/reinforcement-learning/images/what-is-


reinforcement-learning.png
Supervised machine learning
techniques
Classification

• Classification is a process of finding a function


which helps in dividing the dataset into classes
based on different parameters.
• Output is having discrete value.

Image Source: https://media.geeksforgeeks.org/wp-content/uploads/supervised-data.png


Supervised machine learning
techniques
Regression

• Regression analysis is a statistical method to


model the relationship between a dependent
(target) and independent (predictor) variables
with one or more independent variables.
• Output is having continuous value.

Image Source: https://media.geeksforgeeks.org/wp-content/uploads/supervised-data.png


Classification vs. Regression

Image Source: https://static.javatpoint.com/tutorial/machine-learning/images/regression-vs-


classification-in-machine-learning.png
Understanding Regression and
Types
Types

• Linear Regression
•Logistic Regression
•Polynomial Regression
•Support Vector Regression
•Decision Tree Regression
•Random Forest Regression
•Ridge Regression
•Lasso Regression:

Image Source: https://static.javatpoint.com/tutorial/machine-learning/images/types-of-


regression.png
Linear regression using OLS
Ordinary Least Squares

• Ordinary least squares, or linear least squares,


estimates the parameters in a regression model
by minimizing the sum of the squared residuals.
• This method draws a line through the data
points that minimizes the sum of the squared
differences between the observed values and the
corresponding fitted values.

Image Source: https://statisticsbyjim.com/glossary/ordinary-least-


squares/#:~:text=Ordinary%20least%20squares%2C%20or%20linear,and%20the%20corresponding%20fitted
%20values
Understanding Regression and
Types
Linear Regression

• Relationship between a dependent (y) and one


or more independent (y) variables.

Image Source: https://static.javatpoint.com/tutorial/machine-learning/images/linear-


regression-in-machine-learning.png
Understanding Regression and
Types
Linear and Multiple Regression

•Multiple regression is a broader class of


regressions that encompasses linear and
nonlinear regressions with multiple explanatory
variables.
•Whereas linear regress only has one
independent variable impacting the slope of the
relationship, multiple regression incorporates
multiple independent variables.

Image Source: https://static.javatpoint.com/tutorial/machine-learning/images/linear-


regression-in-machine-learning.png
Metrics
Loss Function

A function that calculates loss for 1 data point is


called the loss function.

Image Source: https://miro.medium.com/max/1400/0*D88aae6e_CG35R1p.png


Metrics

Mean Squared Error (MSE) / Mean


Squared Deviation (MSD)

• It basically calculates the difference between


the estimated and the actual value, squares
these results and then computes their average.
• MSE can only assume non-negative values .
• ŷᵢ→ Predicted, yᵢ→ Actual

Image Source: https://miro.medium.com/max/682/0*omsdYpfRvm3ALVaO


Metrics

Root Mean Squared Error (RMSE) / Root


Mean Squared Deviation (RMSD)

• RMSE calculates the average of the squared


errors across all samples but, in addition, takes
the square root of the result.

Image Source: https://miro.medium.com/proxy/0*d3UTunsQ5djTl-4G


Metrics

Mean Absolute Error (MAE)

• It simply calculates the absolute value of the


errors and then takes the average of these
values.

Image Source: https://miro.medium.com/max/650/0*sTJz5_kSQua4iNE8


Metrics

R Squared (R²) / Coefficient of


Determination

• R Squared (R²) represents the proportion of the


variance for the dependent variable y that’s
explained by the independent variables X.
• R² explains to what extent the variance of one
variable explains the variance of the second variable
• If the R² of a model is 0.75, then approximately
75% of the observed variation can be explained by
the model’s features..

Image Source: https://miro.medium.com/max/1400/0*5-k42pnh0PpJhw-E


In this section, we will discuss:

● Residuals in Regression
● Polynomial features
● Classification techniques
● Types of distance metrics
● KNN Classification
● Gradient Decent
Residuals in Regression

What is it?

● A residual is the vertical distance


between a data point and the regression
line.
● Each data point has one residual.

Image Source: nws.noaa.gov


Residuals in Regression

Types of residual

● Positive if they are above the regression


line,
● Negative if they are below the
regression line,
● Zero if the regression line actually
passes through the point

Image Source: nws.noaa.gov


Residuals in Regression

More about residual

● As residuals are the difference between


any data point and the regression line,
they are sometimes called “errors.”

● In other words, the residual is the error


that isn’t explained by the regression
line.

Image Source: nws.noaa.gov


Residuals in Regression

More about residual

● The residual(e) can also be expressed


with an equation.

● Residual =
Observed value – predicted value
e = y – ŷ

Image Source: nws.noaa.gov


Polynomial features

● Polynomial features are those features created by raising existing features to


an exponent.

● Polynomial features are a type of feature engineering, e.g. the creation of new
input features based on the existing features.

● The “degree” of the polynomial is used to control the number of features


added, e.g. a degree of 3 will add two new variables for each input variable.
Classification techniques

What Is Classification?

● Classification is the process of


recognizing, understanding, and
grouping ideas and objects into preset
categories or “sub-populations.”

● Classification algorithms in machine


learning use input training data to
predict the likelihood that subsequent
data will fall into one of the
predetermined categories.
Image Source: https://www.javatpoint.com/classification-algorithm-in-machine-learning
Classification techniques

Popular Classification Algorithms

● Logistic Regression
● Naive Bayes
● K-Nearest Neighbors
● Decision Tree
● Support Vector Machines

Image Source: https://datamahadev.com/classification-algorithms-explained-in-30-minutes/


Classification techniques
Logistic Regression

● Logistic Regression is a calculation used


to predict a binary outcome: either
something happens, or does not.

● This can be exhibited as Yes/No,


Pass/Fail, Alive/Dead, etc.

Image Source: https://www.javatpoint.com/logistic-regression-in-machine-learning


Classification techniques
Naive Bayes

● Naive Bayes calculates the possibility of


whether a data point belongs within a
certain category or does not.

● In text analysis, it can be used to


categorize words or phrases as
belonging to a preset “tag”
(classification) or not.
Image Source: https://kdagiit.medium.com/naive-bayes-algorithm-4b8b990c7319
Classification techniques
K-nearest Neighbors

● K-nearest neighbors (k-NN) is a pattern


recognition algorithm that uses training
datasets to find the k closest relatives.

● When k-NN is used in classification, you


calculate to place data within the
category of its nearest neighbour.

Image Source: https://www.javatpoint.com/k-nearest-neighbor-algorithm-for-machine-


learning
Classification techniques
Decision Tree
● A decision tree is a supervised learning
algorithm that is able to order classes on
a precise level.
● It works like a flow chart, separating
data points into two similar categories at
a time from the “tree trunk” to
“branches,” to “leaves,” where the
categories become more finitely similar.
● This creates categories within
categories, allowing for organic
classification with limited human
supervision. Image Source: https://www.javatpoint.com/machine-learning-decision-tree-classification-
algorithm
Classification techniques

Random Forest

● The random forest algorithm is an


expansion of decision tree

● You first construct a multitude of


decision trees with training data, then fit
your new data within one of the trees as
a “random forest.”

Image Source: https://www.javatpoint.com/machine-learning-random-forest-algorithm


Classification techniques

Support Vector Machines

● A support vector machine (SVM) uses


algorithms to train and classify data
within degrees of polarity

● The SVM then assigns a hyperplane


that best separates the tags.

● In two dimensions this is simply a line.

Image Source: https://www.javatpoint.com/machine-learning-support-vector-machine-


algorithm
Distance metrics

What is it?

● A distance measure is an objective


score that summarizes the relative
difference between two objects in a
problem domain.

● Different distance measures must be


chosen and used depending on the
types of the data.

Image Source: https://tuhinmukherjee74.medium.com/different-types-of-distances-used-in-


machine-learning-explained-550e2979752c
Distance metrics

Types

● Following are the 4 most commonly


used distance measures in machine
learning:

● Hamming Distance
● Euclidean Distance
● Manhattan Distance
● Minkowski Distance

Image Source: https://tuhinmukherjee74.medium.com/different-types-of-distances-used-in-


machine-learning-explained-550e2979752c
Distance metrics
Hamming Distance

● Hamming distance calculates the


distance between two binary vectors,
also referred to as binary strings or
bitstrings.

Image Source: https://www.youtube.com/watch?v=7SVSXiWc0-o


Distance metrics
Hamming Distance

● Example :
Suppose there are two strings 1101 1001
and 1001 1101.

11011001 ⊕ 10011101 = 01000100. Since,


this contains two 1s, the Hamming distance,
d(11011001, 10011101) = 2.

Image Source: https://www.youtube.com/watch?v=7SVSXiWc0-o


Distance metrics
Euclidean Distance
● Euclidean distance calculates the
distance between two real-valued
vectors.

● Euclidean distance is calculated as the


square root of the sum of the squared
differences between the two vectors.

● EuclideanDistance = sqrt(sum for i to N


(v1[i] – v2[i])^2)
Image Source: https://aigents.co/data-science-blog/publication/distance-metrics-for-machine-
learning
Distance metrics
Manhattan Distance
● The Manhattan distance, also called the
Taxicab distance or the City Block
distance, calculates the distance
between two real-valued vectors.

● The taxicab name for the measure


refers to the intuition for what the
measure calculates: the shortest path
that a taxicab would take between city
blocks (coordinates on the grid).
Image Source: https://aigents.co/data-science-blog/publication/distance-metrics-for-machine-
learning
Distance metrics
Manhattan Distance

● Manhattan distance is calculated as the


sum of the absolute differences between
the two vectors.

● The Manhattan Distance between two


points (X1, Y1) and (X2, Y2) is given by
|X1 – X2| + |Y1 – Y2|.

Image Source: https://aigents.co/data-science-blog/publication/distance-metrics-for-machine-


learning
Distance metrics
Minkowski Distance

● Minkowski distance calculates the


distance between two real-valued
vectors.
● It is a generalization of the Euclidean
and Manhattan distance measures and
adds a parameter, called the “order” or
“p“, that allows different distance
measures to be calculated.

Image Source: https://slideplayer.com/slide/5139522/


Distance metrics
Minkowski Distance

● For example:

Given two vectors, vect1 as (4, 2, 6, 8) and


vect2 as (5, 1, 7, 9). Their Minkowski
distance for p = 2 is given by, ( |4 – 5|2 + |2
– 1|2 + |6 – 7|2 + |8 – 9|2 )1/2 which is
equal to 2.

Image Source: https://slideplayer.com/slide/5139522/


KNN Classification
Concept
● K-Nearest Neighbour is one of the
simplest Machine Learning algorithms
based on Supervised Learning
technique.

● K-NN algorithm assumes the similarity


between the new case/data and
available cases and put the new case
into the category that is most similar to
the available categories. Image source: https://www.javatpoint.com/k-nearest-neighbor-algorithm-for-machine-learning
KNN Classification
Concept
● K-NN algorithm can be used for
Regression as well as for Classification
but mostly it is used for the
Classification problems.

● It is also called a lazy learner algorithm


because it does not learn from the
training set immediately instead it stores
the dataset and at the time of
classification, it performs an action on Image source: https://www.javatpoint.com/k-nearest-neighbor-algorithm-for-machine-learning

the dataset.
KNN Classification
Why do we need a K-NN Algorithm?

● With the help of K-NN, we can easily


identify the category or class of a
particular dataset.

● Suppose there are two categories, i.e.,


Category A and Category B, and we
have a new data point x1, so this data
point will lie in which of these categories.
To solve this type of problem, we need a
K-NN algorithm.
Image Source: https://www.javatpoint.com/k-nearest-neighbor-algorithm-for-machine-
learning
KNN Classification
How does K-NN work?

● Step-1: Select the number K of the


neighbors

● Step-2: Calculate the Euclidean distance


of K number of neighbors

Image Source: https://www.javatpoint.com/k-nearest-neighbor-algorithm-for-machine-


learning
KNN Classification
How does K-NN work?

● Step-3: Take the K nearest neighbors as


per the calculated Euclidean distance.

● Step-4: Among these k neighbors, count


the number of the data points in each
category.

Image Source: https://www.javatpoint.com/k-nearest-neighbor-algorithm-for-machine-


learning
KNN Classification
How does K-NN work?

● Step-5: Assign the new data points to


that category for which the number of
the neighbor is maximum.

● Step-6: Our model is ready.

Image Source: https://www.javatpoint.com/k-nearest-neighbor-algorithm-for-machine-


learning
Gradient Descent
Concept

● Gradient descent (GD) is an iterative


first-order optimisation algorithm used to
find a local minimum/maximum of a
given function.

● This method is commonly used in


machine learning (ML) and deep
learning(DL) to minimise a cost/loss
function (e.g. in a linear regression).
Image Source: https://www.javatpoint.com/gradient-descent-in-machine-learning
Gradient Descent
How does Gradient Descent work?

● The equation for simple linear


regression is given as:

● Y=mX+c
● Where 'm' represents the slope of the
line, and 'c' represents the intercepts on
the y-axis.

Image Source:: https://www.javatpoint.com/gradient-descent-in-machine-learning


Gradient Descent
What is Cost-function?
● The cost function is defined as the
measurement of difference or error
between actual values and expected
values at the current position and
present in the form of a single real
number.

● It helps to increase and improve


machine learning efficiency by providing
feedback to this model so that it can
minimize error and find the local or
Image Source:: https://www.javatpoint.com/cost-function-in-machine-learning
global minimum.
Gradient Descent
The Gradient Descent Algorithm

● Gradient Descent method’s steps are:

1. Choose a starting point (initialisation)


2. Calculate gradient at this point
3. Make a scaled step in the opposite
direction to the gradient (objective:
minimise)
4. Repeat points 2 and 3 until one of the
criteria is met:
5. Maximum number of iterations reached
Image Source:: https://www.javatpoint.com/gradient-descent-in-machine-learning
6. Step size is smaller than the tolerance.
In this section, we will discuss:

● Logistic Regression
● Evaluation – Confusion Matrix, Precision, Recall, F1 Score, Accuracy
● Python Library – Sci-kit Learn
Logistic Regression

What is Logistic Regression?

● Logistic regression is a statistical


model that in its basic form uses a
logistic function to model a binary
dependent variable, although many
more complex extensions exist. In
regression analysis, logistic regression
(or logit regression) is estimating the
parameters of a logistic model (a form of
binary regression).
Image Source: https://helloacm.com/wp-content/uploads/2016/03/logistic-
regression-example.jpg
Logistic Regression

Logistic Regression

● Logistic Regression can be used to


classify the observations using different
types of data and can easily determine
the most effective variables used for the
classification. The below image is
showing the logistic function:

Image Source: https://www.javatpoint.com/logistic-regression-in-machine-learning


Logistic Regression

Logistic Function (Sigmoid Function):

• The sigmoid function is a mathematical


function used to map the predicted values
to probabilities.
• It maps any real value into another value
within a range of 0 and 1.

• The S-form curve is called the Sigmoid


function or the logistic function.

Image Source: https://www.javatpoint.com/logistic-regression-in-machine-learning


Logistic Regression

Logistic Regression Equation

● We know the equation of the straight


line can be written as:
● In Logistic Regression y can be between
0 and 1 only, so for this let's divide the
above equation by (1-y):
● But we need range between -[infinity] to
+[infinity], then take logarithm of the
equation it will become:

Image Source: https://www.javatpoint.com/logistic-regression-in-machine-learning


Logistic Regression

Types Logistic Regression Equation

● Binomial
● Multinomial
● Ordinal
Logistic Regression

Steps in Logistic Regression

• Data Pre-processing step


• Fitting Logistic Regression to the Training
set
• Predicting the test result
• Test accuracy of the result (Creation of
Confusion matrix)
• Visualizing the test set result.

Note:- Check out theory Manual for Example

Image Source: https://www.javatpoint.com/logistic-regression-in-machine-learning


Model Performance Metrics
Logistic Regression Evaluation
metrics

● Confusion matrix
● Precision
● Recall
● F1 Score
● Accuracy

Image Source: https://miro.medium.com/max/1400/1*tlrYPZgfX9cc1_RCPHPoJg.png


Model Performance Metrics

Confusion matrix

● A confusion matrix is a table that is often


used to describe the performance of a
classification model (or “classifier”) on a
set of test data for which the true values
are known.

Image Source: https://miro.medium.com/max/1400/1*tlrYPZgfX9cc1_RCPHPoJg.png


Model Performance Metrics

Confusion matrix

● TP (True-positives):

● TN (True-negatives):

● FP (False-positives):

● FN (False-negatives):

Image Source: https://miro.medium.com/max/1400/1*tlrYPZgfX9cc1_RCPHPoJg.png


Model Performance Metrics

Accuracy

● Accuracy is the proximity of measurement


results to the true value. It tell us how
accurate our classification model is able to
predict the class labels given in the
problem statement.

● Accuracy= (TP+TN)/Total customers

Image Source: https://miro.medium.com/max/1400/1*tlrYPZgfX9cc1_RCPHPoJg.png


Model Performance Metrics

Recall

● Recall/ Sensitivity/ TPR (True Positive


Rate) attempts to answer the following
question:
● What proportion of actual positives was
identified correctly?
● Recall is generally used in use cases
where the truth-detection is of utmost
importance

Image Source: https://miro.medium.com/max/1400/1*tlrYPZgfX9cc1_RCPHPoJg.png


Model Performance Metrics

Precision

● Precision attempts to answer the


following question:

● What proportion of positive


identifications was actually correct?

● Precision is generally used in cases


where it’s of utmost importance not to
have a high number of False positives

Image Source: https://miro.medium.com/max/1400/1*tlrYPZgfX9cc1_RCPHPoJg.png


Model Performance Metrics

F1 score

● F1 score (also F-score or F-measure)


is a measure of a test’s accuracy. It
considers both the precision p and the
recall r of the test to compute the score

Image Source: https://miro.medium.com/max/1400/1*tlrYPZgfX9cc1_RCPHPoJg.png


Python Library: Sci-Kit Learn

Sci-kit learn

● Open-source ML library for Python. Built


on NumPy, SciPy, and Matplotlib.

● Scikit-learn is a library in Python that


provides many unsupervised and
supervised learning algorithms.

Image Source: scikit-learn: machine learning in Python — scikit-learn 1.0.2 documentation


Python Library: Sci-Kit Learn

Functionality of Sci-kit learn

● Regression

● Classification

● Clustering

● Model Selection

● Pre-processing

Image Source: scikit-learn: machine learning in Python — scikit-learn 1.0.2 documentation


Python Library: Sci-Kit Learn

Install Sci-kit learn

● Using pip

pip install -U scikit-learn

● Using conda

conda install scikit-learn

Image Source: scikit-learn: machine learning in Python — scikit-learn 1.0.2 documentation


Python Library: Sci-Kit Learn

Features Sci-kit learn

● Supervised Learning algorithms


● Unsupervised Learning algorithms
● Clustering
● Cross Validation
● Feature extraction
● Feature selection
● Open Source

Image Source: scikit-learn: machine learning in Python — scikit-learn 1.0.2 documentation


Python Library: Sci-Kit Learn

Sci-kit learn – Dataset Loading


Features Response
Feature Response
matrix vector
Feature

● Features Matrix Feature Response


● Features Names Names Names
Response

● Response Vector
● Response Names
Image Source: scikit-learn: machine learning in Python — scikit-learn 1.0.2 documentation
from sklearn.datasets import load_iris

Python Library: Sci-Kit Learn iris = load_iris()

X = iris.data
Sci-kit learn – Dataset Loading
y = iris.target

feature_names = iris.feature_names

Feature
target_names = iris.target_names
● Features Matrix
● Features Names print("Feature names:", feature_names)

Response
print("Target names:", target_names)
● Response Vector
● Response Names print("\nFirst 10 rows of X:\n", X[:10])
Image Source: scikit-learn: machine learning in Python — scikit-learn 1.0.2 documentation
Python Library: Sci-Kit Learn from sklearn.datasets import load_iris
iris = load_iris()

X = iris.data
Sci-kit learn – Splitting dataset y = iris.target

from sklearn.model_selection import train_test_split

from sklearn.model_selection import X_train, X_test, y_train, y_test = train_test_split(


train_test_split X, y, test_size = 0.3, random_state = 1
)

print(X_train.shape)
print(X_test.shape)

print(y_train.shape)
print(y_test.shape)
Image Source: scikit-learn: machine learning in Python — scikit-learn 1.0.2 documentation
Python Library: Sci-Kit Learn from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, r2_score


Sci-kit learn – Linear Regression
regression_model = LinearRegression()

regression_model.fit(x_train, y_train)
● This supervised ML model is used when
the output variable is continuous and it y_predicted = regression_model.predict(x_test)
follows linear relation with dependent
variables. It can be used to forecast
sales in the coming months by analyzing
the sales data for previous months. rmse = mean_squared_error(y_test, y_predicted)

Image Source: scikit-learn: machine learning in Python — scikit-learn 1.0.2 documentation

You might also like