AD3301 DATA EXPLORATION AND VISUALIZATION
UNIT I
EXPLORATORY DATA ANALYSIS
EDA fundamentals – Understanding data science – Significance of EDA –
Making sense of data – Comparing EDA with classical and Bayesian
analysis – Software tools for EDA - Visual Aids for EDA- Data
transformation techniques-merging database, reshaping and pivoting,
Transformation techniques - Grouping Datasets - data aggregation –
Pivot tables and cross-tabulations.
EDA Fundamentals
Introduction
➢ Data encompasses a collection of discrete objects, numbers,
words, events, facts, measurements, observations, or even
descriptions of things.
➢ Such data is collected and stored by every event or process
occurring in several disciplines, including biology, economics,
engineering, marketing, and others.
➢ Processing such data elicits useful information and processing
such information generates useful knowledge.
➢ Exploratory Data Analysis enables generating meaningful and
useful information from such data.
➢ Exploratory Data Analysis (EDA) is a process of examining the
available dataset to discover patterns, spot anomalies, test
hypotheses, and check assumptions using statistical measures.
➢ Primary aim of EDA is to examine what data can tell us before
actually going through formal modeling or hypothesis formulation.
Understanding data science
➢ Data science involves cross-disciplinary knowledge from
computer science, data, statistics, and mathematics.
➢ There are several phases of data analysis, including
1. Data requirements
2. Data collection
3. Data processing
4. Data cleaning
5. Exploratory data analysis
6. Modeling and algorithms
7. Data product and communication
➢ These phases are similar to the CRoss-Industry Standard
Process for data mining (CRISP) framework in data mining.
1. Data requirements
• There can be various sources of data for an organization.
• It is important to comprehend what type of data is required
for the organization to be collected, curated, and stored.
• For example, an application tracking the sleeping pattern of
patients suffering from dementia requires several types of
sensors' data storage, such as sleep data, heart rate from the
patient, electro-dermal activities, and user activities
patterns.
• All of these data points are required to correctly diagnose the
mental state of the person.
• Hence, these are mandatory requirements for the application.
• It is also required to categorize the data, numerical or
categorical, and the format of storage and dissemination.
2. Data collection
• Data collected from several sources must be stored in the
correct format and transferred to the right information
technology personnel within a company.
• Data can be collected from several objects during several
events using different types of sensors and storage tools.
3. Data processing
• Preprocessing involves the process of pre-curating
(selecting and organizing) the dataset before actual analysis.
• Common tasks involve correctly exporting the dataset,
placing them under the right tables, structuring them, and
exporting them in the correct format.
4. Data cleaning
• Preprocessed data is still not ready for detailed analysis.
• It must be correctly transformed for an incompleteness
check, duplicates check, error check, and missing value
check.
• This stage involves responsibilities such as matching the
correct record, finding inaccuracies in the dataset,
understanding the overall data quality, removing duplicate
items, and filling in the missing values.
• Data cleaning is dependent on the types of data under study.
• Hence, it is essential for data scientists or EDA experts to
comprehend different types of datasets.
• An example of data cleaning is using outlier detection
methods for quantitative data cleaning.
5. EDA
• Exploratory data analysis is the stage where the message
contained in the data is actually understood.
• Several types of data transformation techniques might be
required during the process of exploration.
6. Modeling and algorithm
• Generalized models or mathematical formulas represent or
exhibit relationships among different variables, such as
correlation or causation.
• These models or equations involve one or more variables that
depend on other variables to cause an event.
• For example, when buying pens, the total price of pens(Total)
= price for one pen(UnitPrice) * the number of pens bought
(Quantity). Hence, our model would be Total = UnitPrice *
Quantity. Here, the total price is dependent on the unit price.
Hence, the total price is referred to as the dependent variable
and the unit price is referred to as an independent variable.
• In general, a model always describes the relationship
between independent and dependent variables.
• Inferential statistics deals with quantifying relationships
between particular variables.
• The Judd model for describing the relationship between data,
model, and the error still holds true: Data = Model + Error.
7. Data Product
• Any computer software that uses data as inputs, produces
outputs, and provides feedback based on the output to control
the environment is referred to as a data product.
• A data product is generally based on a model developed
during data analysis
• Example: a recommendation model that inputs user purchase
history and recommends a related item that the user is highly
likely to buy.
8. Communication
• This stage deals with disseminating the results to end
stakeholders to use the result for business intelligence.
• One of the most notable steps in this stage is data
visualization.
• Visualization deals with information relay techniques such as
tables, charts, summary diagrams, and bar charts to show
the analyzed result.
The significance of EDA
➢ Different fields of science, economics, engineering, and
marketing accumulate and store data primarily in electronic
databases.
➢ Appropriate and well-established decisions should be made
using the data collected.
➢ It is practically impossible to make sense of datasets containing
more than a handful of data points without the help of computer
programs.
➢ To make sure of the insights provided by the collected data and
to make further decisions, data mining is performed which
includes distinct analysis processes.
➢ Exploratory data analysis is the key and first exercise in data
mining.
➢ It allows us to visualize data to understand it as well as to create
hypotheses (ideas) for further analysis.
➢ The exploratory analysis centers around creating a synopsis of
data or insights for the next steps in a data mining project.
➢ EDA actually reveals the ground truth about the content without
making any underlying assumptions.
➢ Hence, data scientists use this process to actually understand
what type of modeling and hypotheses can be created.
➢ Key components of exploratory data analysis include
summarizing data, statistical analysis, and visualization of data.
➢ Python provides expert tools for exploratory analysis
• pandas for summarizing
• scipy, along with others, for statistical analysis
• matplotlib and plotly for visualizations
Steps in EDA
The four different steps involved in exploratory data analysis are,
1. Problem Definition
2. Data Preparation
3. Data Analysis
4. Development and Representation of the Results
1. Problem Definition
• It is essential to define the business problem to be solved before
trying to extract useful insight from the data.
• The problem definition works as the driving force for a data
analysis plan execution
• The main tasks involved in problem definition are
o defining the main objective of the analysis
o defining the main deliverables
o outlining the main roles and responsibilities
o obtaining the current status of the data
o defining the timetable, and
o performing cost/benefit analysis
• Based on the problem definition, an execution plan can be created.
2. Data Preparation
• This step involves methods for preparing the dataset before actual
analysis.
• This step involves
o defining the sources of data
o defining data schemas and tables
o understanding the main characteristics of the data
o cleaning the dataset
o deleting non-relevant datasets
o transforming the data
o dividing the data into required chunks for analysis
3. Data analysis
o This is one of the most crucial steps that deals with
descriptive statistics and analysis of the data
o The main tasks involve
o summarizing the data
o finding the hidden correlation
o relationships among the data
o developing predictive models
o evaluating the models
o calculating the accuracies
➢ Some of the techniques used for data summarization are
o summary tables
o graphs
o descriptive statistics
o inferential statistics
o correlation statistics
o searching
o grouping
o mathematical models
4. Development and representation of the results
• This step involves presenting the dataset to the target
audience in the form of graphs, summary tables, maps, and
diagrams.
• This is also an essential step as the result analyzed from the
dataset should be interpretable by the business
stakeholders, which is one of the major goals of EDA.
• Most of the graphical analysis techniques include
o scattering plots
o character plots
o histograms
o box plots
o residual plots
o mean plots
Making Sense of Data
➢ It is crucial to identify the type of data under analysis.
➢ Different disciplines store different kinds of data for different
purposes.
➢ Example: medical researchers store patients' data, universities store
students' and teachers' data, and real estate industries storehouse
and building datasets.
➢ A dataset contains many observations about a particular object.
➢ For instance, a dataset about patients in a hospital can contain many
observations.
➢ A patient can be described by a
o patient identifier (ID)
o name
o address
o weight
o date of birth
o address
o email
o gender
➢ Each of these features that describes a patient is a variable.
➢ Each observation can have a specific value for each of these
variables.
➢ For example, a patient can have the following:
PATIENT_ID = 1001
Name = Yoshmi Mukhiya
Address = Mannsverk 61, 5094, Bergen, Norway
Date of birth = 10th July 2018
Email = yoshmimukhiya@gmail.com
Weight = 10
Gender = Female
➢ These datasets are stored in hospitals and are presented for
analysis.
➢ Most of this data is stored in some sort of database management
system in tables/schema.
Table for storing patient information
➢ The table contains five observations (001, 002, 003, 004, 005).
➢ Each observation describes variables
(PatientID, name, address, dob, email, gender, and weight).
Types of datasets
➢ Most datasets broadly fall into two groups—numerical data and
categorical data.
Numerical data
➢ This data has a sense of measurement involved in it
➢ For example, a person's age, height, weight, blood pressure, heart
rate, temperature, number of teeth, number of bones, and the
number of family members.
➢ This data is often referred to as quantitative data in statistics.
➢ The numerical dataset can be either discrete or continuous types.
Discrete data
➢ This is data that is countable and its values can be listed.
➢ For example, if we flip a coin, the number of heads in 200 coin flips
can take values from 0 to 200 (finite) cases.
➢ A variable that represents a discrete dataset is referred to as a
discrete variable.
➢ The discrete variable takes a fixed number of distinct values.
➢ Example:
o The Country variable can have values such as Nepal, India, Norway,
and Japan.
o The Rank variable of a student in a classroom can take values from
1, 2, 3, 4, 5, and so on.
Continuous data
➢ A variable that can have an infinite number of numerical values
within a specific range is classified as continuous data.
➢ A variable describing continuous data is a continuous variable.
➢ Continuous data can follow an interval measure of scale or ratio
measure of scale
➢ Example:
o The temperature of a city
o The weight variable is a continuous variable
Example table:
Categorical data
➢ This type of data represents the characteristics of an object
➢ Examples: gender, marital status, type of address, or categories of
the movies.
➢ This data is often referred to as qualitative datasets in statistics.
➢ Examples of categorical data
o Gender (Male, Female, Other, or Unknown)
o Marital Status (Annulled, Divorced, Interlocutory, Legally
Separated, Married, Polygamous, Never Married, Domestic
Partner, Unmarried, Widowed, or Unknown)
o Movie genres (Action, Adventure, Comedy, Crime, Drama,
Fantasy, Historical, Horror, Mystery, Philosophical, Political,
Romance, Saga, Satire, Science Fiction, Social, Thriller, Urban,
or Western)
o Blood type (A, B, AB, or O)
o Types of drugs (Stimulants, Depressants, Hallucinogens,
Dissociatives, Opioids, Inhalants, or Cannabis)
➢ A variable describing categorical data is referred to as
a categorical variable.
➢ These types of variables can have a limited number of values.
Types of categorical variables
Binary categorical variable
➢ This type of variable can take exactly two values
➢ Also referred to as a dichotomous variable.
➢ Example: while creating an experiment, the result is either success
or failure.
Polytomous variables
➢ This type can take more than two possible values.
➢ Example: marital status can have several values, such as divorced,
legally separated, married, never married, unmarried, widowed,
etc.
➢ Most of the categorical dataset follows either nominal or ordinal
measurement scales.
Measurement scales
➢ There are four different types of measurement scales in statistics:
nominal, ordinal, interval, and ratio.
➢ These scales are used more in academic industries.
➢ Understanding the type of data is required to understand
o what type of computation could be performed
o what type of model should fit the dataset
o what type of visualization can be generated
➢ Need for classifying data as nominal or ordinal: While analyzing
datasets, the decision of generating pie chart, bar chart, or
histogram is taken based on whether it is nominal or ordinal.
Nominal
➢ These are used for labeling variables without any quantitative
value. The scales are generally referred to as labels.
➢ These scales are mutually exclusive and do not carry any
numerical importance.
➢ Examples:
1. What is your gender?
o Male
o Female
o Third gender/Non-binary
o I prefer not to answer
o Other
2. The languages that are spoken in a particular country
3. Biological species
4. Parts of speech in grammar (noun, pronoun, adjective, and so
on)
5. Taxonomic ranks in biology (Archea, Bacteria, and Eukarya)
➢ Nominal scales are considered qualitative scales and the
measurements that are taken using qualitative scales are
considered qualitative data.
➢ Using numbers as labels have no concrete numerical value or
meaning.
➢ No form of arithmetic calculation can be made on nominal
measures.
➢ Example: The following can be measured in the case of a nominal
dataset,
• Frequency is the rate at which a label occurs over a period of time
within the dataset.
• Proportion can be calculated by dividing the frequency by the total
number of events.
• Then, the percentage of each proportion is computed.
• To visualize the nominal dataset, either a pie chart or a bar chart
can be used.
Ordinal
➢ The main difference in the ordinal and nominal scale is the order.
➢ In ordinal scales, the order of the values is a significant factor.
➢ The Likert scale uses a variation of an ordinal scale.
➢ Example of ordinal scale using the Likert scale:
WordPress is making content managers' lives easier. How do you
feel about this statement?
Likert scale:
➢ The answer to the question is scaled down to five different ordinal
values, Strongly Agree, Agree, Neutral, Disagree, and Strongly
Disagree.
➢ These Scales are referred to as the Likert scale.
More examples of the Likert scale:
➢ To make it easier, consider ordinal scales as an order of ranking
(1st, 2nd, 3rd, 4th, and so on).
➢ The median item is allowed as the measure of central tendency;
however, the average is not permitted.
Interval
➢ Both the order and exact differences between the values are
significant.
➢ Interval scales are widely used in statistics.
➢ Examples:
o The measure of central tendencies—mean, median, mode,
and standard deviations.
o location in Cartesian coordinates and direction measured in
degrees from magnetic north.
Ratio
➢ Ratio scales contain order, exact values, and absolute zero.
➢ They are used in descriptive and inferential statistics.
➢ These scales provide numerous possibilities for statistical
analysis.
➢ Mathematical operations, the measure of central tendencies, and
the measure of dispersion and coefficient of variation can also be
computed from such scales.
➢ Examples: the measure of energy, mass, length, duration, electrical
energy, plan angle, and volume.
Summary of the data types and scale measures:
Comparing EDA with classical and Bayesian analysis
Several approaches to data analysis
➢ Classical data analysis
➢ Exploratory data analysis approach
➢ Bayesian data analysis approach
Classical data analysis
➢ This approach includes the problem definition and data collection
step followed by model development, which is followed by analysis
and result communication.
Exploratory data analysis approach
➢ This approach follows the same approach as classical data
analysis except for the model imposition and the data analysis
steps are swapped.
➢ The main focus is on the data, its structure, outliers, models, and
visualizations.
➢ EDA does not impose any deterministic or probabilistic models on
the data.
Bayesian data analysis approach
➢ This approach incorporates prior probability distribution
knowledge into the analysis steps.
➢ Prior probability distribution of any quantity expresses the belief
about that particular quantity before considering some evidence.
Three different approaches for data analysis
➢ It is difficult to estimate which model is best for data analysis.
➢ All of them have their paradigms and are suitable for different
types of data analysis
Software tools available for EDA
➢ Python
• an open-source programming language widely used in data
analysis, data mining, and data science
➢ R programming language
• an open-source programming language that is widely utilized
in statistical computation and graphical data analysis
➢ Weka
• an open-source data mining package that involves several
EDA tools and algorithms
➢ KNIME
• an open-source tool for data analysis and is based on Eclipse
Python tools and packages
NumPy
➢ NumPy is a Python library.
➢ NumPy is short for "Numerical Python".
➢ NumPy is used for working with arrays.
➢ It also has functions for working in domain of linear algebra,
fourier transform, and matrices.
Why use NumPy?
➢ In Python, lists serve the purpose of arrays, but they are slow to
process.
➢ NumPy provides an array object that is up to 50x faster than
traditional Python lists.
➢ The array object in NumPy is called ndarray, and it provides a lot
of support functions.
➢ Arrays are very frequently used in data science
Basic operations of EDA using the NumPy library
#Importing numpy
import numpy as np
#Creating different types of numpy arrays
# Importing numpy
import numpy as np
# Defining 1D array
my1DArray = np.array([1, 8, 27, 64])
print(my1DArray)
# Defining and printing 2D array
my2DArray = np.array([[1, 2, 3, 4], [2, 4, 9, 16], [4, 8, 18, 32]])
print(my2DArray)
#Defining and printing 3D array
my3Darray = np.array([[[ 1, 2 , 3 , 4],[ 5 , 6 , 7 ,8]], [[ 1, 2,
3, 4],[ 9, 10, 11, 12]]])
print(my3Darray)
#Displaying basic information, such as the data type, shape, size, and
strides of a NumPy array
# Print out memory address
print(my2DArray.data)
# Print the shape of array
print(my2DArray.shape)
# Print out the data type of the array
print(my2DArray.dtype)
# Print the stride of the array.
print(my2DArray.strides)
#Creating an array using built-in NumPy functions
# Array of ones
ones = np.ones((3,4))
print(ones)
# Array of zeros
zeros = np.zeros((2,3,4),dtype=np.int16)
print(zeros)
# Empty array
emptyArray = np.empty((3,2))
print(emptyArray)
# Full array
fullArray = np.full((2,2),7)
print(fullArray)
# Array of evenly-spaced values
evenSpacedArray = np.arange(10,25,5)
print(evenSpacedArray)
#NumPy arrays and file operations
# Save a numpy array into file
x = np.arange(0.0,50.0,1.0)
np.savetxt('data.out', x, delimiter=',')
# Loading numpy array from text
z = np.loadtxt('data.out', unpack=True)
print(z)
# Loading numpy array using genfromtxt method
my_array2 = np.genfromtxt('data.out',
skip_header=1,
filling_values=-999)
print(my_array2)
#Inspecting NumPy arrays
import numpy as np
my2DArray = np.array([[1, 2, 3, 4], [2, 4, 9, 16], [4, 8, 18, 32]])
# Print the number of `my2DArray`'s dimensions
print(my2DArray.ndim)
# Print the number of `my2DArray`'s elements
print(my2DArray.size)
# Print information about `my2DArray`'s memory layout
print(my2DArray.flags)
# Print the length of one array element in bytes
print(my2DArray.itemsize)
# Print the total consumed bytes by `my2DArray`'s elements
print(my2DArray.nbytes)
Broadcasting
Broadcasting is a mechanism that permits NumPy to operate with
arrays of different shapes when performing arithmetic operations.
# Rule 1: Two dimensions are operatable if they are equal
# Create an array of two dimension
A =np.ones((6, 8))
# Shape of A
print(A.shape)
# Create another array
B = np.random.random((6,8))
# Shape of B
print(B.shape)
# Sum of A and B, here the shape of both the matrix is same.
print(A + B)
# Rule 2: Two dimensions are also compatible when one of the
dimensions of the array is 1
# Initialize `x`
x = np.ones((3,4))
print(x)
# Check shape of `x`
print(x.shape)
# Initialize `y`
y = np.arange(4)
print(y)
# Check shape of `y`
print(y.shape)
# Subtract `x` and `y`
print(x - y)
# Rule 3: Arrays can be broadcast together if they are compatible in all
dimensions
x = np.ones((6,8))
y = np.random.random((2, 1, 8))
print(x + y)
#NumPy mathematics
# Basic operations (+, -, *, /, %)
x = np.array([[1, 2, 3], [2, 3, 4]])
y = np.array([[1, 4, 9], [2, 3, -2]])
# Add two array
add = np.add(x, y)
print(add)
# Subtract two array
sub = np.subtract(x, y)
print(sub)
# Multiply two array
mul = np.multiply(x, y)
print(mul)
# Divide x, y
div = np.divide(x,y)
print(div)
# Calculated the remainder of x and y
rem = np.remainder(x, y)
print(rem)
#Creating a subset and slice an array using an index
x = np.array([10, 20, 30, 40, 50])
# Select items at index 0 and 1
print(x[0:2])
# Select item at row 0 and 1 and column 1 from 2D array
y = np.array([[ 1, 2, 3, 4], [ 9, 10, 11 ,12]])
print(y[0:2, 1])
# Specifying conditions
biggerThan2 = (y >= 2)
print(y[biggerThan2])
Pandas
➢ Pandas is a Python library used for working with data sets.
➢ It has functions for analyzing, cleaning, exploring, and manipulating
data.
Why use Pandas?
➢ Pandas allow us to analyze big data and make conclusions based
on statistical theories.
➢ Pandas can clean messy data sets, and make them readable and
relevant.
➢ Relevant data is very important in data science.
What can Pandas do?
➢ Pandas give answers about the data.
• Is there a correlation between two or more columns?
• What is average value?
• Max value?
• Min value?
➢ Pandas are also able to delete rows that are not relevant, or
contains wrong values, like empty or NULL values. This is
called cleaning the data.
#Creating dataframe from Dictionary
import pandas as pd
mydataset = {
'cars': ["BMW", "Volvo", "Ford"],
'passings': [3, 7, 2]
}
myvar = pd.DataFrame(mydataset)
print(myvar)
# Creating dataframe from Dictionary
import pandas as pd
dict_df = [{'A': 'Apple', 'B': 'Ball'},{'A': 'Aeroplane', 'B':
'Bat', 'C': 'Cat'}]
dict_df = pd.DataFrame(dict_df)
print(dict_df)
# Creating dataframe from Series
import pandas as pd
import numpy as np
series_df = pd.DataFrame({
'A': range(1, 5),
'B': pd.Timestamp('20190526'),
'C': pd.Series(5, index=list(range(4)), dtype='float64'),
'D': np.array([3] * 4, dtype='int64'),
'E': pd.Categorical(["Depression", "Social Anxiety", "Bipolar Disorder",
"Eating Disorder"]),
'F': 'Mental health',
'G': 'is challenging'
})
print(series_df)
# Creating a dataframe from ndarrays
import pandas as pd
import numpy as np
sdf = {
'County':['Østfold', 'Hordaland', 'Oslo', 'Hedmark', 'Oppland',
'Buskerud'],
'ISO-Code':[1,2,3,4,5,6],
'Area': [4180.69, 4917.94, 454.07, 27397.76, 25192.10,
14910.94],
'Administrative centre': ["Sarpsborg", "Oslo", "City of Oslo",
"Hamar", "Lillehammer", "Drammen"]
}
sdf = pd.DataFrame(sdf)
print(sdf)
#Load a dataset from an external source into a pandas DataFrame
import pandas as pd
import numpy as np
columns = ['age', 'workclass', 'fnlwgt', 'education',
'education_num', 'marital_status', 'occupation', 'relationship', 'ethnicity',
'gender','capital_gain','capital_loss','hours_per_week','country_of
_origin','income']
df =pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databas
es/adult/adult.data',names=columns)
df.head(10)
#Select rows and columns in any dataframe
# Selects a row
df.iloc[10]
# Selects 10 rows
df.iloc[0:10]
# Selects a range of rows
df.iloc[10:15]
# Selects the last 2 rows
df.iloc[-2:]
# Selects every other row in columns 3-5
df.iloc[::2, 3:5].head()
#Combine NumPy and pandas to create a dataframe
import pandas as pd
import numpy as np
np.random.seed(24)
dFrame = pd.DataFrame({'F': np.linspace(1, 10, 10)})
dFrame = pd.concat([df, pd.DataFrame(np.random.randn(10, 5),
columns=list('EDCBA'))],
axis=1)
dFrame.iloc[0, 2] = np.nan
dFrame
Output dataframe table
SciPy
➢ SciPy is a scientific computation library that
uses NumPy underneath.
➢ SciPy stands for Scientific Python.
➢ It provides more utility functions for optimization, stats and signal
processing.
➢ Like NumPy, SciPy is open source so we can use it freely.
➢ SciPy has optimized and added functions that are frequently used
in NumPy and Data Science.
Matplotlib
➢ Matplotlib is a low-level graph plotting library in python that serves
as a visualization utility.
➢ It provides a huge library of customizable plots, along with a
comprehensive set of backends.
➢ It can be utilized to create professional reporting applications,
interactive analytical applications, complex dashboard
applications, web/GUI applications, embedded views, and many
more