KEMBAR78
Exploratory Data Analysis | PDF | Skewness | Quartile
0% found this document useful (0 votes)
69 views173 pages

Exploratory Data Analysis

The document provides an overview of Exploratory Data Analysis (EDA), detailing its methodologies, types, and significance in data analysis. It emphasizes the importance of visual representation in understanding data patterns and relationships, alongside techniques for handling missing values. Additionally, it discusses the use of descriptive and exploratory analysis to derive insights from datasets, particularly through graphical methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views173 pages

Exploratory Data Analysis

The document provides an overview of Exploratory Data Analysis (EDA), detailing its methodologies, types, and significance in data analysis. It emphasizes the importance of visual representation in understanding data patterns and relationships, alongside techniques for handling missing values. Additionally, it discusses the use of descriptive and exploratory analysis to derive insights from datasets, particularly through graphical methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 173

Exploratory Data

Analysis
Dr. Vinay Chopra
Data analysis
Data analysis is the process of extracting useful information from a
dataset by inspecting, cleansing, transforming, and modelling it.
Methodologies used to do so include
a) Descriptive Analysis (which provides numerical insight into the
data).
b) Exploratory Analysis (which provides visual insight into the
data),
c) Predictive Analysis (which provides insight into the data based
on historical events),
d) and Inferential Analysis (which provides inferential insight into
the data based on historical events) (this involves getting the
insight of the population by obtaining the information from the
sample).
Data Analysis Types

Data analysis may be separated into four stages depending on the


methodology used:
a) Descriptive Analysis
b) Exploratory Data Analysis
c) Predictive Analysis
d) Inferential Analysis
Descriptive Analysis

Descriptive analysis is a numerical method of extracting information from data. The numerical
variables’ values are summarized in the descriptive analysis.
 Assume you’re looking at sales data from a vehicle company. In descriptive analytical
literature, you’ll look for answers to queries like
a) what is the mean, mode, and median of a car type’s selling price,
b) what was the income generated by selling a specific model of automobile, and so on.
c) Using this form of analysis, we may determine the central tendency and dispersion of the
numerical variables in the data.
 A descriptive analysis can assist you gain the high-level knowledge of the data and become
acclimated to the data set in most practical data science use cases.
Descriptive Analysis Terminologies

The following are some key descriptive analysis terminologies:


a) Mean: average value of total numbers given in the list of numbers
b) Mode: most frequent number in the given list of numbers
c) Median: middle value of the given list of numbers
d) Standard deviation: value of variation of the given set of values from the mean value
e) Variance: Variation is a term that is used to describe (square of standard deviation)
f) Interquartile Range (IQR): values between 25 and 75 percentile of a list of numbers
Importance of Descriptive
Analysis
 Data visualization is made simple with descriptive statistics.
 It enables data to be presented in a meaningful and intelligible
manner, allowing for a more straightforward understanding of the
data set.
 The analysis of raw data would be laborious, and determining
trends and patterns might be tough.
 Furthermore, raw data makes it difficult to visualize what is
being displayed.
Exploratory Data Analysis:

 In contrast to descriptive data analysis, which is a numerical


approach to data analysis, exploratory data analysis is a visual
approach to data analysis.
 We will turn to exploratory data analysis once we have a basic
comprehension of the data at hand through descriptive analysis.
Exploratory Data Analysis

The exploratory data analysis may alternatively be divided into two


parts:
a) Uni variate analysis: Analysis of a single variable (exploring
characteristics of a single variable)
b) Multivariate analysis: Analyses using many variables
(comparative analysis of multiple variables, if we compare the
correlation of two variables, it is called bivariate analysis)
Exploratory Data Analysis

We employ numerous types of plots and graphs to analyze data in


the visual style of data analysis.
a) A bar plot
b) histograms
c) box plot with whisker
d) violin plot, and other plots can be used to study a single variable
(univariate analysis).
e) We employ scatter plots, contour plots, multi-dimensional graphs,
and other multivariate analytic tools.
Typical data format and the types of
EDA
 The data from an experiment are generally collected into a
rectangular array (e.g., spreadsheet or database), most
commonly with one row per experimental subject and one
column for each subject identifier, outcome variable, and
explanatory variable.
 Each column contains the numeric values for a particular
quantitative variable or the levels for a categorical variable.
 People are not very good at looking at a column of numbers
or a whole spreadsheet and then determining important
characteristics of the data.
 They find looking at numbers to be tedious, boring, and/or
overwhelming.
 Exploratory data analysis is generally cross-classified in two
ways. First, each method is either non-graphical or graphical.
 And second, each method is either univariate or multivariate
(usually just bivariate).
 Non-graphical methods generally involve calculation of
summary statistics, while graphical methods obviously
summarize the data in a diagrammatic or pictorial way.
 Univariate methods look at one variable (data column) at a time,
while multivariate methods look at two or more variables at a time
to explore relationships.
 Usually our multivariate EDA will be bivariate (looking at exactly
two variables), but occasionally it will involve three or more
variables.
 It is almost always a good idea to perform univariate EDA on each
of the components of a multivariate EDA before performing the
multivariate EDA.
 Beyond the four categories created by the above cross-
classification, each of the categories of EDA have further divisions
based on the role (outcome or explanatory) and type (categorical
 Although there are guidelines about which EDA techniques are useful
in what circumstances, there is an important degree of looseness and
art to EDA.
 Competence and confidence come with practice, experience, and
close observation of others.
 Also, EDA need not be restricted to techniques you have seen before;
sometimes you need to invent a new way of looking at your data.
Need of Exploratory Data Analysis:

 Exploratory data analysis provides a visual representation of the


data, which aids in identifying the data’s features more clearly
 It assists us in determining which characteristics are most
significant, which is very handy when dealing with data that has a
lot of dimensions. (i.e., dimensionality reduction is aided by
approaches like as PCA and t-SNE)
 It’s a good technique to communicate the incurred outcome to
non-technical stakeholders and executives
Exploratory data analysis:
Introduction
Exploratory data analysis or “EDA” is a critical first step in analyzing the data from an
experiment. Exploratory Data Analysis refers to the critical process of performing initial
investigations on data so as to discover patterns, to spot anomalies, to test hypothesis and to
check assumptions with the help of summary statistics and graphical representations. Here
are the main reasons we use EDA:
 detection of mistakes
 checking of assumptions
 preliminary selection of appropriate models
 determining relationships among the explanatory variables,
 and assessing the direction and rough size of relationships between explanatory and
outcome variables.
Loosely speaking, any method of looking at data that does not include formal statistical
modeling and inference falls under the term exploratory data analysis.
Meaning of Statistics

 For a layman, ‘Statistics’ means numerical information expressed in


quantitative terms. This information may relate to objects, subjects,
activities, phenomena, or regions of space.
 As a matter of fact, data have no limits as to their reference, coverage, and
scope.
 At the macro level, these are data on gross national product and shares of
agriculture, manufacturing, and services in GDP (Gross Domestic Product).
 At the micro level, individual firms, howsoever small or large, produce
extensive statistics on their operations.
 The annual reports of companies contain variety of data on sales,
production, expenditure, inventories, capital employed, and other activities.
 These data are often field data, collected by employing scientific survey
techniques.
Meaning of Statistics

 Unless regularly updated, such data are the product of a one-time


effort and have limited use beyond the situation that may have called
for their collection.
 A student knows statistics more intimately as a subject of study like
economics, mathematics, chemistry, physics, and others. It is a
discipline, which scientifically deals with data, and is often described
as the science of data.
 In dealing with statistics as data, statistics has developed
appropriate methods of collecting, presenting, summarizing, and
analysing data, and thus consists of a body of these methods.
Basic Terminologies in statistics

 Population: A collection or set of individuals or objects or events whose


properties are to be analyzed.
 Sample: A subset of population is called sample. A well chosen sample will
contain most of the information about a particular population parameters.
 How one can choose a sample that best represents the entire population.
 E.g. if we want to study the eating habits of teenagers in U.S. There are over 42
millon teenagers in the U.S.
Exploratory data analysis:
Introduction
 Exploratory Data Analysis (EDA) is an approach to analyze the data using
visual techniques. It is used to discover trends, patterns, or to check
assumptions with the help of statistical summary and graphical representations.
 Dataset Used For the simplicity of the article, we will use a single dataset. We
will use the employee data for this.
 It contains 8 columns namely – First Name, Gender, Start Date, Last Login,
Salary, Bonus%, Senior Management, and Team
 Dataset Used: Employees.csv
Exploratory data analysis:
Introduction
 Let’s read the dataset using the
Pandas module and print the 1st five
rows. To print the first five rows we
will use the head() function.
import pandas as pd
import numpy as np
df = pd.read_csv('employees.csv')
df.head()
Exploratory data analysis:
Introduction
 Getting insights about the dataset: df. Shape
a) Output: (1000, 8)
b) This means that this dataset has 1000 rows and 8 columns.
 Let’s get a quick summary of the dataset using the describe() method.
a) The describe() function applies basic statistical computations on the dataset
like extreme values, count of data points standard deviation, etc.
b) Any missing value or NaN value is automatically skipped. describe()
function gives a good picture of the distribution of data.
Exploratory data analysis:
Introduction
 df.describe()
Exploratory data analysis:
Introduction
 Now, let’s also the
columns and their
data types. For this,
we will use the info()
method.
 df.info()
Handling Missing Values

 You all must be wondering why a dataset will contain any missing value. It
can occur when no information is provided for one or more items or for a
whole unit.
 For Example, Suppose different users being surveyed may choose not to share
their income, some users may choose not to share the address in this way
many datasets went missing.
 Missing Data is a very big problem in real-life scenarios. Missing Data can
also refer to as NA(Not Available) values in pandas.
 There are several useful functions for detecting, removing, and replacing null
values in Pandas DataFrame :
Handling Missing Values

a) isnull():The isnull() method returns a DataFrame object where all the values
are replaced with a Boolean value True for NULL values, and otherwise
False.
b) notnull():
i. function detects existing/ non-missing values in the data frame. The function
returns a boolean object having the same size as that of the object on which it
is applied, indicating whether each individual value is a na value or not.
ii. All of the non-missing values gets mapped to true and missing values get
mapped to false.
Handling Missing Values

a) dropna():Sometimes csv file has null values, which are later displayed as NaN in Data
Frame. Pandas dropna() method allows the user to analyze and drop Rows/Columns with
Null values in different ways.
b) fillna():manages and let the user replace NaN values with some value of their own.
c) replace(): The replace() in Python returns a copy of the string where all occurrences of a
substring are replaced with another substring.
d) interpolate(): this is a very powerful function to fill the missing values. It uses various
interpolation technique to fill the missing values rather than hard-coding the value.
Handling Missing Values

 Now let’s check if there are any missing


values in our dataset or not.
a) df.isnull().sum()
b) We can see that every column has a
different amount of missing values. Like
Gender as 145 missing values and salary
has 0.
c) Now for handling these missing values
there can be several cases like dropping the
rows containing NaN or replacing NaN
with either mean, median, mode, or some
other value.
Handling Missing Values

 Now, let’s try to fill the missing


values of gender with the string
“No Gender”.
a) df["Gender"].fillna("No Gender",
inplace = True)
b) df.isnull().sum()
Handling Missing Values

 We can see that now there is no null value for the


gender column. Now, Let’s fill the senior
management with the mode value.
a) mode = df['Senior Management'].mode().values[0]
b) df['Senior Management']= df['Senior
Management'].replace(np.nan, mode)
c) df.isnull().sum()
Handling Missing Values

 Now for the first name and team, we cannot fill the
missing values with arbitrary data, so, let’s drop all
the rows containing these missing values.
a) df = df.dropna(axis = 0, how ='any')
b) print(df.isnull().sum())
c) df.shape
 We can see that our dataset is now free of all the
missing values and after dropping the data the
number of also reduced from 1000 to 899.
Data visualization
After removing the missing data let’s visualize our data. Data Visualization is the process of
analyzing data in the form of graphs or maps, making it a lot easier to understand the trends or
patterns in the data. There are various types of visualizations –
a) Univariate analysis: This type of data consists of only one variable. The analysis of
univariate data is thus the simplest form of analysis since the information deals with only
one quantity that changes. It does not deal with causes or relationships and the main
purpose of the analysis is to describe the data and find patterns that exist within it.
b) Bi-Variate analysis: This type of data involves two different variables. The analysis of
this type of data deals with causes and relationships and the analysis is done to find out
the relationship among the two variables.
c) Multi-Variate analysis: When the data involves three or more variables, it is categorized
under multivariate.
Data visualization

 Let’s see some commonly used graphs –


Note: We will use Matplotlib and Seaborn library for the data
visualization. If you want to know about these modules refer to the
articles –
a) Matplotlib Tutorial
b) Python Seaborn Tutorial
 A histogram is basically used to represent data provided in a form of
some groups.
 It is accurate method for the graphical representation of numerical
data distribution.
 It is a type of bar plot where X-axis represents the bin ranges while Y-
axis gives information about frequency.
Creating a Histogram

 To create a histogram the first step is to create bin of the ranges, then
distribute the whole range of the values into a series of intervals, and
count the values which fall into each of the intervals.
 Bins are clearly identified as consecutive, non-overlapping intervals of
variables.
 The matplotlib.pyplot.hist() function is used to compute and create
histogram of x.
Attribute parameter

x array or sequence of array

bins optional parameter contains integer or sequence or strings

density optional parameter contains boolean values

range optional parameter represents upper and lower range of bins

histtype optional parameter used to create type of histogram [bar, barstacked, step, stepfilled], default is “bar”

align optional parameter controls the plotting of histogram [left, right, mid]

weights optional parameter contains array of weights having same dimensions as x

bottom location of the baseline of each bin

rwidth optional parameter which is relative width of the bars with respect to bin width

color optional parameter used to set color or sequence of color specs

label optional parameter string or sequence of string to match with multiple datasets

log optional parameter used to set histogram axis on log scale


Histogram

 It can be used for both uni and


bivariate analysis.
a) # importing packages
b) import seaborn as sns
c) import matplotlib.pyplot as plt
d) sns.histplot(x='Salary', data=df, )
e) plt.show()
Box plot

 A Box Plot is also known as Whisker plot is created to display the summary of
the set of data values having properties like minimum, first quartile, median,
third quartile and maximum.
 In the box plot, a box is created from the first quartile to the third quartile, a
vertical line is also there which goes through the box at the median.
 Here x-axis denotes the data to be plotted while the y-axis shows the frequency
distribution.
Boxplot

 It can also be used for univariate and bivariate analyses.


a) # importing packages
b) import seaborn as sns
c) import matplotlib.pyplot as plt
d) sns.boxplot( x="Salary", y='Team', data=df, )
e) plt.show()
Scatter Plot

 Scatter plots are used to observe relationship between


variables and uses dots to represent the relationship
between them.
 The scatter() method in the matplotlib library is used to
draw a scatter plot.
 Scatterplots are widely used to represent relation
among variables and how change in one affects the
other.
The scatter() method takes in the
following parameters:
 x_axis_data- An array containing x-axis data
 y_axis_data- An array containing y-axis data
 s- marker size (can be scalar or array of size equal to size of x or y)
 c- color of sequence of colors for markers
 marker- marker style
 cmap- cmap name
 linewidths- width of marker border
 edgecolor- marker border color
 alpha- blending value, between 0 (transparent) and 1 (opaque)
 Except x_axis_data and y_axis_data all other parameters are optional and their default value is None.
Below are the scatter plot examples with various parameters.
Scatter Plot
# importing packages
import seaborn as sns
import matplotlib.pyplot as plt

sns.scatterplot( x="Salary", y='Team', data=df,


hue='Gender', size='Bonus %')

# Placing Legend outside the Figure


plt.legend(bbox_to_anchor=(1, 1), loc=2)

plt.show()
For multivariate analysis, we can the
pairplot()method of seaborn module. We can
also use it for the multiple pairwise bivariate
distributions in a dataset.
a) # importing packages
b) import seaborn as sns
c) import matplotlib.pyplot as plt
d) sns.pairplot(df, hue='Gender', height=2)
Handling Outliers
 An Outlier is a data-item/object that deviates
significantly from the rest of the (so-called
normal)objects. They can be caused by
measurement or execution errors. The analysis for
outlier detection is referred to as outlier mining.
There are many ways to detect the outliers, and
the removal process is the data frame same as
removing a data item from the panda’s dataframe.
 Let’s consider the iris dataset and let’s plot the
boxplot for the SepalWidthCm column.
 # importing packages
 import seaborn as sns
 import matplotlib.pyplot as plt
 # Load the dataset
 df = pd.read_csv('Iris.csv')
 sns.boxplot(x='SepalWidthCm', data=df)
 In the above graph, the values above 4 and below
2 are acting as outliers.
Removing Outliers

 For removing the outlier, one must follow the


same process of removing an entry from the
dataset using its exact position in the dataset
because in all the above methods of detecting the
outliers end result is the list of all those data
items that satisfy the outlier definition according
to the method used.
 Example: We will detect the outliers using IQR
and then we will remove them. We will also
draw the boxplot to see if the outliers are
removed or not.
Measures of Central Tendency

The Mean, Median and Mode


 When given a set of raw data one of the most useful ways of summarizing that
data is to find an average of that set of data.
 An average is a measure of the center of the data set. There are three common
ways of describing the center of a set of numbers.
 They are the mean, the median and the mode and are calculated as follows.
 The mean − add up all the numbers and divide by how many numbers there are.
 The median − is the middle number. It is found by putting the numbers in order
and taking the actual middle number if there is one, or the average of the two
middle numbers if not.
 The mode − is the most commonly occurring number
Measures of Central Tendency
Measures of Central Tendency

 Central tendency describes the tendency of the observations to


bunch around a particular value, or category.
 The mean, median and mode are all measures of central
tendency.
 They are all measures of the ‘average’ of the distribution.
 The best one to use in a given situation depends on the type of
variable given.
 The mean has some advantages over the median as a measure of
central tendency of quantity variables.
 One of them is that all the observed values are used to calculate the
mean.
 However, to calculate the median, while all the observed values are
used in the ranking, only the middle or middle two values are used
in the calculation.
 Another is that the mean is fairly stable from sample to sample.
This means that if we take several samples from the same
population their means are less likely to vary than their medians.
 However, the median is used as a measure of central
tendency if there are a few extreme values observed.
 The mean is very sensitive to extreme values and it may
not be an appropriate measure of central tendency in
these cases.
 This is illustrated in the next example.
Standard Deviation

 Suppose we have a set of data where there is no variability in the


observed values. Each observation would have the same value, say
3, 3, 3, 3 and the mean would be that same value, 3.
 Each observation would not be different or deviate from the mean.
 Now suppose we have a set of observations where there is
variability. The observed values would deviate from the mean by
varying amounts.
 The standard deviation is a kind of average of these deviations from
the mean.
Formulae for Mean and Standard
Deviation of a Population
Box Plot: Introduction

 The method to summarize a set of data that is measured using an


interval scale is called a box and whisker plot.
 These are maximum used for data analysis. We use these types of
graphs or graphical representation to know:
a) Distribution Shape
b) Central Value of it
c) Variability of it
The Box-plot

 The box-plot is another way of representing a data set


graphically.
 It is constructed using the quartiles, and gives a good indication
of the spread of the data set and its symmetry (or lack of
symmetry).
 It is a very useful method for comparing two or more data sets.
 The box-plot consists of a scale, a box drawn between the first
and third quartile, the median placed within the box, whiskers on
both sides of the box and outliers (if any).
Box Plot

 A box plot is a chart that shows data from a five-number summary including
one of the measures of central tendency.
 But it is primarily used to indicate a distribution is skewed or not and if there
are potential unusual observations (also called outliers) present in the data set.
 Boxplots are also very beneficial when large numbers of data sets are involved
or compared.
 In simple words, we can define the box plot in terms of descriptive statistics
related concepts.
 That means box or whiskers plot is a method used for depicting groups of
numerical data through their quartiles graphically.
Box Plot

 These may also have some lines extending from the boxes or whiskers which
indicates the variability outside the lower and upper quartiles, hence the terms
box-and-whisker plot and box-and-whisker diagram.
 Outliers can be indicated as individual points.
 It helps to find out how much the data values vary or spread out with the help of
graphs.
 As we need more information than just knowing the measures of central
tendency, this is where the box plot helps.
 This also takes less space. It is also a type of pictorial representation of data.
 Since, the centre, spread and overall range are immediately apparent, using these
boxplots the distributions can be compared easily.
Parts of Box Plots

 Check the image below which shows


the
a) minimum,
b) maximum,
c) first quartile,
d) third quartile,
e) median
f) and outliers.
Parts of Box Plot
a) Minimum: The minimum value in the given dataset
b) First Quartile (Q1): The first quartile is the median of the lower half of the data set.
c) Median: The median is the middle value of the dataset, which divides the given dataset into
two equal parts. The median is considered as the second quartile.
d) Third Quartile (Q3): The third quartile is the median of the upper half of the data.
e) Maximum: The maximum value in the given dataset.
f) Apart from these five terms, the other terms used in the box plot are:
g) Interquartile Range (IQR): The difference between the third quartile and first quartile is
known as the interquartile range. (i.e.) IQR = Q3-Q1
h) Outlier: The data that falls on the far left or right side of the ordered data is tested to be the
outliers. Generally, the outliers fall more than the specified distance from the first and third
quartile. (i.e.) Outliers are greater than Q3-(1.5 . IQR) or less than Q1-(1.5 . IQR).
Boxplot Distribution

 The box plot distribution will


explain
a) how tightly the data is grouped,
b) how the data is skewed,
c) and also about the symmetry of
data.
Boxplot Distribution

 Positively Skewed: If the distance from the median to the


maximum is greater than the distance from the median to the
minimum, then the box plot is positively skewed.
 Negatively Skewed: If the distance from the median to minimum
is greater than the distance from the median to the maximum,
then the box plot is negatively skewed.
 Symmetric: The box plot is said to be symmetric if the median is
equidistant from the maximum and minimum values.
Box Plot Chart

 In a box and whisker plot:


a) the ends of the box are the upper and lower quartiles so that the
box crosses the interquartile range
b) a vertical line inside the box marks the median
c) the two lines outside the box are the whiskers extending to the
highest and lowest observations.
Applications

 It is used to know:
a) The outliers and their values
b) Symmetry of Data
c) Tight grouping of data
d) Data skewness – if, in which direction and how
Box Plot Example

Find the maximum, minimum, median, first quartile, third quartile for the given
data set: 23, 42, 12, 10, 15, 14, 9.
 Solution:
Given: 23, 42, 12, 10, 15, 14, 9.
Arrange the given dataset in ascending order.
9, 10, 12, 14, 15, 23, 42
Hence,
Minimum = 9
Maximum = 42
Median = 14
First Quartile = 10 (Middle value of 9, 10, 12 is 10)
Third Quartile = 23 (Middle value of 15, 23, 42 is 23).
Constructing a Box-plot
Using Box-plots to Compare Data
Sets
Skewness

 Skewness is a measurement of the distortion of symmetrical


distribution or asymmetry in a data set.
 Skewness is demonstrated on a bell curve when data points are
not distributed symmetrically to the left and right sides of the
median on a bell curve.
 If the bell curve is shifted to the left or the right, it is said to be
skewed.
 Skewness can be quantified as a representation of the extent to
which a given distribution varies from a normal distribution.
SKEWNESS: MEANING AND
DEFINITIONS
 It may be repeated here that
frequency distributions differ in
three ways: Average value,
Variability or dispersion, and Shape.
 Generally, there are two
comparable characteristics called
skewness and kurtosis that help us
to understand a distribution.
 Two distributions may have the
same mean and standard deviation
but may differ widely in their overall
appearance as can be seen from the
following:
SKEWNESS: MEANING AND
DEFINITIONS
 In both these distributions the value of mean and
standard deviation is the same
 But it does not imply that the distributions are alike
in nature.
 The distribution on the left-hand side is a
symmetrical one whereas the distribution on the
right-hand side is asymmetrical or skewed.
 Measures of skewness help us to distinguish
between different types of distributions.
Some important definitions of
skewness are as follows:
 "When a series is not symmetrical it is said to be asymmetrical or skewed.“ :Croxton &
Cowden.
 "Skewness refers to the asymmetry or lack of symmetry in the shape of a frequency
distribution." :Morris Hamburg.
 "Measures of skewness tell us the direction and the extent of skewness. In symmetrical
distribution the mean, median and mode are identical. The more the mean moves away from
the mode, the larger the asymmetry or skewness.“:-Simpson & Kalka
 "A distribution is said to be 'skewed' when the mean and the median fall at different points in
the distribution, and the balance (or centre of gravity) is shifted to one side or the other-to
left or right.“:-Garrett
 The above definitions show that the term 'skewness' refers to lack of symmetry" i.e., when a
distribution is not symmetrical (or is asymmetrical) it is called a skewed distribution.
Symmetrical Distribution.

 Symmetrical Distribution. It is
clear from the diagram (a)
that in a symmetrical
distribution the values of
mean, median and mode
coincide.
 The spread of the frequencies
is the same on both sides of
the center point of the curve.
Asymmetrical Distribution.

A distribution, which is not


symmetrical, is called a
skewed distribution and
such a distribution could
either be positively skewed
or negatively skewed as
would be clear from the
diagrams (b) and (c).
Positively Skewed and
Negatively Skewed Distribution
 Positively Skewed Distribution: In the
positively skewed distribution the value of
the mean is maximum and that of mode
least-the median lies in between the two
as is clear from the diagram (b).
 Negatively Skewed Distribution:
a) The following is the shape of negatively
skewed distribution. In a negatively
skewed distribution the value of mode is
maximum and that of mean least-the
median lies in between the two.
b) In the positively skewed distribution the
frequencies are spread out over a greater
range of values on the high-value end of
the curve (the right-hand side) than they
are on the low-value end.
 In the negatively skewed distribution the position is reversed, i.e.
the excess tail is on the left-hand side.
 It should be noted that in moderately symmetrical distributions
the interval between the mean and the median is approximately
one-third of the interval between the mean and the mode.
 It is this relationship, which provides a means of measuring the
degree of skewness.
TESTS OF SKEWNESS

 In order to ascertain whether a distribution is skewed or not the following tests


may be applied. Skewness is present if
1) The values of mean, median and mode do not coincide.
2) When the data are plotted on a graph they do not give the normal bell shaped
form i.e. when cut along a vertical line through the center the two halves are not
equal.
3) The sum of the positive deviations from the median is not equal to the sum of the
negative deviations.
4) Quartiles are not equidistant from the median.
5) Frequencies are not equally distributed at points of equal deviation from the
mode.
TESTS OF SKEWNESS

 On the contrary, when skewness is absent, i.e. in case of a symmetrical


distribution, the following conditions are satisfied:
a) The values of mean, median and mode coincide.
b) Data when plotted on a graph give the normal bell-shaped form.
c) Sum of the positive deviations from the median is equal to the sum of the
negative deviations.
d) Quartiles are equidistant from the median.
e) Frequencies are equally distributed at points of equal deviations from the
mode
MEASURES OF SKEWNESS

 There are four measures of skewness, each divided into absolute and relative measures.
 The relative measure is known as the coefficient of skewness and is more frequently
used than the absolute measure of skewness.
 Further, when a comparison between two or more distributions is involved, it is the
relative measure of skewness which is used.
 The measures of skewness are:
a) Karl Pearson's measure
b) Bowley’s measure
c) Kelly’s measure,
d) Moment’s measure.
Moment’s measure

 The formula for measuring skewness as given by Karl Pearson is as


follows:
 The direction of skewness is determined by ascertaining whether the mean is greater
than the mode or less than the mode
 If it is greater than the mode, then skewness is positive. But when the mean is less than
the mode, it is negative.
 The difference between the mean and mode indicates the extent of departure from
symmetry.
 It is measured in standard deviation units, which provide a measure independent of the
unit of measurement.
 It may be recalled that this observation was made in the preceding chapter while
discussing standard deviation.
 The value of coefficient of skewness is zero, when the distribution is symmetrical.
Normally, this coefficient of skewness lies between[-1,+1].
KURTOSIS
 Kurtosis is another measure of the
shape of a frequency curve.
 It is a Greek word, which means
bulginess. While skewness signifies the
extent of asymmetry, kurtosis measures
the degree of peakness of a frequency
distribution.
 Karl Pearson classified curves into
three types on the basis of the shape of
their peaks.
 These are mesokurtic, leptokurtic and
platykurtic. These three types of curves
are shown in figure below:
 It will be seen from Fig. 3.2 that mesokurtic curve is neither too
much flattened nor too much peaked.
 In fact, this is the frequency curve of a normal distribution.
 Leptokurtic curve is a more peaked than the normal curve.
 In contrast, platykurtic is a relatively flat curve.
Pivot Table Introduction
 A pivot table is a table of grouped values that aggregates the individual items of a more
extensive table (such as from a database, spreadsheet, or business intelligence program)
within one or more discrete categories.
 This summary might include averages, counts, percentages, standard deviations, and so
on which the pivot table groups together using a chosen aggregation function applied to
the grouped values.
 Microsoft introduced Pivot Tables into Excel with version 5. Pivot Tables replaced
Excel’s older cross-tabulation feature.
 A Pivot Table lets you display the data contained in a column of an Excel list (database)
by means of subtotals (or other calculations) that are defined by another column in the
same list.
Pivot Table Introduction

 Pivot tables are among the most useful and powerful features in
Excel.
 We use them in summarizing the data stored in a table.
 They organize and rearrange statistics (or "pivot") to draw
attention to the valuable facts.
 You can take an extremely large data set and see the relevant
information you need in a clean, concise, manageable way.
Why organize list data into a Pivot
Table?
Three key reasons for organizing data into a Pivot Table are:
a) To summarize the data contained in a lengthy list into a compact
format.
b) To find relationships within the data that are otherwise hard to
see because of the amount of detail.
c) To organize the data into a format that’s easy to chart.
Sample Data

 The sample data that we are


going to use contains 448
records with 8 fields of
information on the sale of
products across different
regions between 2013-2015.
 This data is perfect to
understand the pivot table.
Insert Pivot Tables

 To insert a pivot table in your sheet,


follow these steps:
a) Click on any cell in a data set.
b) On the Insert tab, in the Tables group,
click PivotTable.
A dialog box will appear.
Excel will auto-select
your dataset.
 It will also create a new
worksheet for your pivot
table.
 Click Ok.
 Then, it will create a pivot table
worksheet.
Drag Fields

 To get the total sales of each


salesperson, drag the following
fields to the following areas.
a) Salesperson field to Rows area.
b) Sales field to Values area.
Value Field Settings

a) By default, Excel gives the summation of


the values that are put into the Values
section.
b) You can change that from the Value Field
Settings.
c) Click on the Sum of Sales in the Values
field.
Value Field Settings

 Choose the type of calculation


you want to use.
 Click OK
Sort By Value:Right-click any Sales value and
choose Sort > Sort Largest to Smallest.
Two-Dimensional Pivot Table

We can create a pivot table in various two-


dimensional arrangements. Drag the
following fields to the different areas
a) Salesperson to Rows area.
b) Region to Columns area.
c) Sales to Values area.
Applying Filters to a Pivot table

 Lets see how we can add a filter to


our pivot table.
 We will continue with the
previous example and add the Year
field to the Filters area
 You can see that it adds a filter on
the top of the worksheet.
Grouping Data in a Pivot Table

 Excel allows you to group pivot table


items. To create the groups, execute
the following steps:
a) In the pivot table, select the data you
want to group.
b) Right-click and click on Group.
Percentage Contribution in a Pivot
Table
 There are various ways to display the
values in a table. One way is to show
the value as a percentage of the total.
a) Add the sales field again to the values
section.
b) Right-click on the second instance and
select % of Grand Total.
Mechanics: Another Example

 For typical data entry and


storage, data usually
appear in flat tables,
meaning that they consist
of only columns and
rows, as in the following
portion of a sample
spreadsheet showing data
on shirt types:
Mechanics: Another Example
 While tables such as these can
contain many data items, it can be
difficult to get summarized
information from them.
 A pivot table can help quickly
summarize the data and highlight
the desired information.
 The usage of a pivot table is
extremely broad and depends on
the situation.
 The first question to ask is, "What
am I seeking?"
 In the example here, let us ask,
"How many Units did we sell in
each Region for every Ship Date?":
 Below is a simple example of how putting data in a Pivot Table can be
useful.
 At left in the spreadsheet illustration above is a simple list, or Excel
database.
 Even looking at this simple, short list it’s difficult to discern patterns
in the data.
 For example, it takes a bit of study to see that the number of Units
Sold in the Northeast region is much greater than the number of Units
Sold for the Southwest region.
 Or to find out that Gouda outsells Brie in the Northwest. Questions of
this type that you might have about the data can be answered, but
only with some effort.
 By contrast, the Pivot Table at the right simplifies and summarizes the
data to make relationships and patterns obvious.
 And, if you had much more data in the list at left (perhaps with many
additional entries for each region), you could still achieve a
condensed Pivot Table summary the same size as the one at right.
 The Pivot Table also allows you to include or exclude whatever list
data you like.
 You can easily chart the data organized into a Pivot Table, while to
chart the data in the list at left you’d first need to restructure the data
and obtain the sum for each region.
 The Pivot Table simplifies the process because it obtains subtotals
automatically and puts them in a range you can immediately use for
charting.
Using the Chart Wizard, it’s easy to get a
summary, graphical view of your data using the
Pivot Table as your tool to organize and
summarize the data.
What’s Required to Construct a
Pivot Table?
 To create a Pivot Table you need to identify these two elements in
your data:
a) a data field, where the data field is the variable you want to
summarize
b) a row and/or column field where the row and/or column fields are the
variables that will “control” the data summary
 To create the Pivot Table, invoke the Pivot
Table Wizard with the menu commands Data,
Pivot Table and Pivot Chart Report. The Pivot
Table Wizard leads you through three steps:
 Step 1: Allows you to specify where your data
is located and whether you want a chart as
well as a table. Most commonly you’ll get your
data from an Excel list that’s part of the
current worksheet.
 Step 2: Identifies the list range. If your have the insertion point
anywhere in the list when you start the Wizard, Excel defines the list
range automatically.
 Step 3: Creates the Pivot Table using its best guess as to layout.
 Users of the previous version of the Excel Pivot Table will recall a step
before this one that permits the user to determine the initial Pivot
Table layout.
 That step is still available if one chooses the Layout button on Step 3
before clicking Finish. (Of course, you can return to the layout view at
any time to make changes while working with the completed Pivot
Table.)
 If you choose the Layout button in Step 3 or invoke that view while
working with the Pivot Table, you get a dialog that looks like this:
Drag field buttons at right
(representing variables from
the list) to the layout diagram
at left. You need not use all the
field buttons.
 For example, I’ve used only four of the six possible field buttons in
the layout at right.
 You can include multiple field buttons in each of the layout locations.
You can choose to locate your new Pivot Table on the sheet with the
Excel list or on a new worksheet.
 Once you’ve created your Pivot Table, you can modify it by using the
buttons on the Pivot Table toolbar that automatically displays. Two of
the buttons are especially key:
 Although you may want to return to the Wizard layout view, you can
drag and drop items onto and off of the Pivot Table and move them
around in the layout by just using the mouse.
 Note that all the field buttons are visible on the toolbar to allow easy
dragging onto the Pivot Table, should you want to add more detail.
A drop-down menu (with some
redundant options) is available
from the first Pivot Table toolbar
button on the left.
Pivot Table Row and Column Fields

A row field
 A row field in a Pivot Table is a variable that takes on different values.
For example, a row field might be “Manufacturer” and its values
might be “Schwinn”, “Cannondale”, and “Omega”.
 The values a variable takes on are sometimes referred to as “items”.
In the example below, for each value of the variable “Manufacturer”,
the Pivot Table displays a summary of the chosen data field in an
adjoining column.
 The data field in this example is “Annual Sales” and the summary
function is sum.
 Notice that the Pivot Table uses the label “Sum of Annual Sales” to
identify not only the data field (Annual Sales) but also the default
summary operation (sum).
 A column field
 A Pivot Table column field works like a row field. A column field might
be the variable “Year” with values ranging from 1995 to 1998. Data
beneath each column in the Pivot Table is associated with the year at
the head of the column.
 The basic effect of row and column fields in a Pivot Table is that each
value or item that the field takes on defines a different row or column.
 So if a list has a row field that takes on three items (Schwinn,
Cannondale, Omega) and a column field that takes on four items
(1995, 1996, 1997, 1998), the Pivot Table has three rows and four
columns, and therefore twelve summary cells (exclusive of the cells
that hold Grand Totals and labels).
Introduction: ANOVA

 Buying a new product or testing a new technique but not sure how it stacks up against the
alternatives? It’s an all too familiar situation for most of us.
 Most of the options sound similar to each other so picking the best out of the lot is a
challenge.
 Consider a scenario where we have three medical treatments to apply on patients with
similar diseases.
 Once we have the test results, one approach is to assume that the treatment which took the
least time to cure the patients is the best among them.
 What if some of these patients had already been partially cured, or if any other medication
was already working on them?
 In order to make a confident and reliable decision, we will need evidence to support our
approach. This is where the concept of ANOVA comes into play.
 Different ANOVA techniques can be used for making the best decisions. We’ll take a few
cases and try to understand the techniques for getting the results.
Introduction to ANOVA

 A common approach to figure out a reliable treatment method would be to analyze the
days it took the patients to be cured.
 We can use a statistical technique which can compare these three treatment samples and
depict how different these samples are from one another.
 Such a technique, which compares the samples on the basis of their means, is called
ANOVA.
 Analysis of variance (ANOVA) is a statistical technique that is used to check if the means
of two or more groups are significantly different from each other.
 ANOVA checks the impact of one or more factors by comparing the means of different
samples.
 We can use ANOVA to prove/disprove if all the medication treatments were equally
effective or not.
 Another measure to compare the samples is called a t-test. When
we have only two samples, t-test and ANOVA give the same
results.
 However, using a t-test would not be reliable in cases where
there are more than 2 samples.
 If we conduct multiple t-tests for comparing more than two
samples, it will have a compounded effect on the error rate of the
result.
Terminologies related to ANOVA
you need to know
 Before we get started with the applications of ANOVA, I would
like to introduce some common terminologies used in the
technique.
1) Grand Mean
 Mean is a simple or arithmetic average of a range of values.
There are two kinds of means that we use in ANOVA
calculations, which are separate sample mean , and Grand Mean .
 The grand mean is the mean of sample means or the mean of all
observations combined, irrespective of the sample.
Hypothesis

 Considering our above medication example, we can assume that


there are 2 possible cases – either the medication will have an
effect on the patients or it won’t.
 These statements are called Hypothesis. A hypothesis is an
educated guess about something in the world around us.
 It should be testable either by experiment or observation.
 Just like any other kind of hypothesis that you might have studied
in statistics, ANOVA also uses a Null hypothesis and an Alternate
hypothesis.
Hypothesis
 The Null hypothesis in ANOVA is
valid when all the sample means are
equal, or they don’t have any
significant difference.
 Thus, they can be considered as a
part of a larger set of the population.
 On the other hand, the alternate
hypothesis is valid when at least
one of the sample means is different
from the rest of the sample means.
 In mathematical form, they can be
represented as:
 Where and belong to any two sample means out of all the
samples considered for the test.
 In other words, the null hypothesis states that all the sample
means are equal or the factor did not have any significant effect
on the results.
 Whereas, the alternate hypothesis states that at least one of the
sample means is different from another.
 But we still can’t tell which one specifically.
Between Group Variability

 Consider the distributions of


the below two samples.
 As these samples overlap, their
individual means won’t differ
by a great margin.
 Hence the difference between
their individual means and
grand mean won’t be
significant enough.
Between Group Variability
 Now consider these two sample
distributions. As the samples differ from
each other by a big margin, their
individual means would also differ.
 The difference between the individual
means and grand mean would therefore
also be significant.
 Such variability between the
distributions called Between-group
variability.
 It refers to variations between the
distributions of individual groups (or
levels) as the values within each group
are different.
 Each sample is looked at and the
difference between its mean and
grand mean is calculated to
calculate the variability.
 If the distributions overlap or are
close, the grand mean will be
similar to the individual means
whereas if the distributions are far
apart, difference between means
and grand mean would be large.
 We will calculate Between
Group Variability just as we
calculate the standard
deviation.
 Given the sample means and
Grand mean, we can calculate
it as:
a) We also want to weigh each squared deviation by the size of the sample.
b) In other words, a deviation is given greater weight if it’s from a larger
sample.
c) Hence, we’ll multiply each squared deviation by each sample size and add
them up. This is called the sum-of-squares for between-group variability

There’s one more thing we have to do to derive a good measure of between-group


variability. Again, recall how we calculate the sample standard deviation.
Within Group Variability

 Consider the given


distributions of three
samples.
 As the spread (variability)
of each sample is
increased, their
distributions overlap and
they become part of a big
population.
 Now consider another distribution of the same three
samples but with less variability.
 Although the means of samples are similar to the
samples in the above image, they seem to belong to
different populations.
 Such variations within a sample are denoted by Within-
group variation. It refers to variations caused by
differences within individual groups (or levels) as not
all the values within each group are the same.
 Each sample is looked at on its own and variability
between the individual points in the sample is
calculated.
 In other words, no interactions between samples are
considered.
Heat Map

 A heat map (or heatmap) is a data visualization technique that shows


magnitude of a phenomenon as color in two dimensions.
 The variation in color may be by hue or intensity, giving obvious visual cues
to the reader about how the phenomenon is clustered or varies over space.
 There are two fundamentally different categories of heat maps: the cluster heat
map and the spatial heat map.
 In a cluster heat map, magnitudes are laid out into a matrix of fixed cell size
whose rows and columns are discrete phenomena and categories, and the
sorting of rows and columns is intentional and somewhat arbitrary, with the
goal of suggesting clusters or portraying them as discovered via statistical
analysis.
 The size of the cell is arbitrary but large enough to be clearly visible.
 By contrast, the position of a magnitude in a spatial heat map is forced by the
location of the magnitude in that space, and there is no notion of cells; the
phenomenon is considered to vary continuously.
Types
Spatial heat map
a) It displays the magnitude of a spatial
phenomena as color, usually cast over a
map.
b) In the image labeled “Spatial Heat Map
Example,” temperature is displayed by
color range across a map of the world.
Color ranges from blue (cold) to red (hot).
Grid heat map
 It displays magnitude as color in a two-
dimensional matrix, with each dimension
representing a category of trait and the
color representing the magnitude of some
measurement on the combined traits from
each of the two categories.
 For example, one dimension might represent year, and the other dimension
might represent month, and the value measured might be temperature.
 This heat map would show how temperature changed over the years in each
month.
 Grid heat maps are further categorized into two different types of matrices:
clustered, and correlogram.
Clustered heat map:
 The example of the monthly temperature by year is a clustered heat
map.
Correlogram:
a) A correlogram is a clustered heat map that has the same trait for each
axis in order to display how the traits in the set of traits interact with
each other.
b) The correlogram is a triangle instead of a square because the
combination of A-B is the same as the combination of B-A and so
does not need to be expressed twice.
What is a HeatMap
a) Heatmaps visualize the data in a 2-dimensional format in the form of colored
maps.
b) The color maps use hue, saturation, or luminance to achieve color variation to
display various details.
c) This color variation gives visual cues to the readers about the magnitude of
numeric values.
d) HeatMaps is about replacing numbers with colors because the human brain
understands visuals better than numbers, text, or any written data.
e) Human beings are visual learners; therefore, visualizing the data in any form
makes more sense.
f) Heatmaps represent data in an easy-to-understand manner. Thus visualizing
methods like HeatMaps have become popular.
a) Heatmaps can describe the density or intensity of variables, visualize
patterns, variance, and even anomalies.
b) Heatmaps show relationships between variables.
c) These variables are plotted on both axes.
d) We look for patterns in the cell by noticing the color change.
e) It only accepts numeric data and plots it on the grid, displaying different
data values by varying color intensity.
Uses of HeatMap:Business
Analytics:
 A heat map is used as a visual business analytics tool. A heat map gives quick
visual cues about the current results, performance, and scope for
improvements.
 Heatmaps can analyze the existing data and find areas of intensity that might
reflect where most customers reside, areas of risk of market saturation, or cold
sites and sites that need a boost.
 Heat maps can be continued to be updated to reflect the growth and efforts.
These maps can be integrated into a business’s workflow and become a part of
ongoing analytics.
 Heat maps present the data in a visual and easy to understand manner to
communicate to team members or clients.
Uses of HeatMap: Website:

a) Heatmaps are used in websites to visualize data of visitors’


behavior.
b) This visualization helps business owners and marketers to
identify the best & worst-performing sections of a webpage.
c) These insights help them with optimization.
Uses of HeatMap:Exploratory Data
Analysis:
 EDA is a task performed by data scientists to get familiar with the data. All the
initial studies are done to understand the data are known as EDA. \
 Exploratory Data Analysis (EDA) is the process of analyzing datasets before the
modeling task.
 It is a tedious task to look at a spreadsheet filled with numbers and determine
essential characteristics in a dataset.
 Therefore EDA is done to summarize their main features, often with visual
methods, which includes Heatmaps.
 Heatmaps are a compelling way to visualize relationships between variables in
high dimensional space.
 It can be done using feature variables as row headers and column headers, and the
variable vs. itself on the diagonal.
 Molecular Biology: Heat maps are used to study disparity and
similarity patterns in DNA, RNA, etc.
 Marketing and Sales: The heatmap’s capability to detect warm and
cold spots is used to improve marketing response rates by targeted
marketing.
 Heatmaps allow the detection of areas that respond to campaigns,
under-served markets, customer residence, and high sale trends, which
helps optimize product lineups, capitalize on sales, create targeted
customer segments, and assess regional demographics
Types of HeatMaps
Typically, there are two types of Heatmaps:
 Grid Heatmap: The magnitudes of values shown through colors are laid out into
a matrix of rows and columns, mostly by a density-based function. Below are the
types of Grid Heatmaps.
 Clustered Heatmap:
a) The goal of Clustered Heatmap is to build associations between both the data
points and their features.
b) This type of heatmap implements clustering as part of the process of grouping
similar features.
c) Clustered Heatmaps are widely used in biological sciences for studying gene
similarities across individuals.
a) The order of the rows in
Clustered Heatmap is
determined by performing
hierarchical cluster analysis of
the rows.

b) Clustering positions similar


rows together on the map.
Similarly, the order of the
columns is determined.
Correlogram: A correlogram replaces each
of the variables on the two axes with
numeric variables in the dataset.

Each square depicts the relationship


between the two intersecting variables,
which helps to build descriptive or
predictive statistical models.
Spatial Heatmap:

 Each square in a Heatmap is assigned a color representation


according to the nearby cells’ value.
 The location of color is according to the magnitude of the value
in that particular space.
 These Heatmaps are data-driven “paint by numbers” canvas
overlaid on top of an image.
 The cells with higher values than other cells are given a hot color,
while cells with lower values are assigned a cold color.
Who Uses Heat Maps?

 Heat maps are used by any organization that collects and uses
data to improve their finances, sales, operations, customer
service, or marketing.
 Some of the industries that use Heatmaps:
1. Healthcare
2. Finance
3. Technology
4. Real estate
Color schemes

Many different color schemes can illustrate the heat map, with
perceptual advantages and disadvantages for each.
The color palette choices are more than just aesthetics because
the colors in the HeatMap reveal patterns in the data.
Good color schemes can enhance pattern discovery, and poor
color choices can hide it.
General principles for using colors in
Heatmaps are:
 Vary hue to distinguish categories:
a) Vary the elements’ color to represent multiple plot categories.
b) Most people can differentiate between a moderate number of shades. It is
best to represent categories using hues.
 Vary luminance to represent numbers:
a) Varying brightness helps you see structure in numeric data.
b) In Bivariate distribution, luminance variation enhances discrete or
continuous pattern visualization.
c) The luminance color scheme also makes apparent that there are two
prominent peaks.
Inputs for Heatmap

Three input types for the heatmap.


 Wide-format:
a) Wide-format, also called
Untidy Format, is a matrix
where each row is an
individual, and each column
represents an observation.
b) In this case, a heatmap cell
color corresponds to the
observation value.
Correlation matrix:
a) Correlation Matrix is also called
square format formed by
performing the corr() function on
the dataset, and this matrix is
plotted on the heatmap.
b) Such Heatmaps helps discover
which variables are related to each
other.
Long format:
a) Long format, also called
tidy format, is when each
line represents an
observation.
b) You have three columns:
individual, variable name,
and value (x, y, and z).
c) You can plot a heatmap
from this kind of data as
follow:

You might also like