UNIT I - EXPLORATORY DATA
ANALYSIS
EDA Fundamentals-Definition and importance of EDA in data
science. Understanding and Making Sense of Data, Software Tools for
EDA-Introduction to popular EDA tools (Pandas, NumPy, Matplotlib,
Seaborn, and Tableau), Visual Aids for EDA-Common plots and charts
used in EDA (histograms, box plots, scatter plots), Data Transformation
Techniques, Grouping and Aggregation
1. EDA Fundamentals
EDA is a process of examining the available dataset to discover
patterns, spot anomalies, test hypotheses, and check assumptions
using statistical measures. In this chapter, we are going to discuss
the steps involved in performing top-notch exploratory data analysis
and get our hands dirty using some open source databases. As
mentioned here and in several studies, the primary aim of EDA is to
examine what data can tell us before actually going through formal
modeling or hypothesis formulation.
Data Science
• Data science is at the peak of its hype and the skills for
data scientists are changing. Now, data scientists are
not only required to build a performant model, but it is
essential for them to explain the results obtained and
use the result for business intelligence.
• Data science involves cross-disciplinary knowledge from
computer science,
data, statistics, and mathematics.
• There are several phases of data analysis,
including data requirements, data collection, data
processing, data cleaning, exploratory data analysis,
modeling and algorithms, and data product and
communication.
• These phases are similar to the CRoss-Industry Standard
Process for data
mining (CRISP) framework in data mining.
• The main takeaway here is the stages of EDA, as it
is an important aspect of data analysis and data
mining. Let's understand in brief what these stages
are:
i. Data requirements: There can be various sources of
data for an organization. It is important to comprehend
what type of data is required for the organization
to be collected, curated, and stored.
• For example, an application tracking the sleeping
pattern of patients suffering from dementia requires
several types of sensors' data storage, such as sleep
data, heart rate from the patient, electro-dermal
activities, and user activities pattern.
• All of these data points are required to correctly
diagnose the mental state of the person. Hence,
these are mandatory requirements for the
application.
• In addition to this, it is required to categorize the data,
numerical or
categorical, and the format of storage and dissemination.
ii. Data collection: Data collected from several sources
must be stored in the correct format and transferred to the
right information technology personnel within a company.
As mentioned previously, data can be collected from
several objects on several events using different types of
sensors and storage tools.
iii. Data processing: Preprocessing involves the process of pre-
curating the
dataset before actual analysis.
• Common tasks involve correctly exporting the
dataset, placing them under the right tables,
structuring them, and exporting them in the correct
format.
iv. Data cleaning: Preprocessed data is still not ready for
detailed analysis. It must be correctly transformed for an
incompleteness check, duplicates check, error check, and
missing value check.
• These tasks are performed in the data cleaning stage,
which involves responsibilities such as matching the
correct record, finding inaccuracies in the dataset,
understanding the overall data quality, removing
duplicate items, and filling in the missing values.
• However, how could we identify these anomalies
on any dataset? Finding such data issues requires
us to perform some analytical techniques.
• To understand briefly, data cleaning is dependent on the
types of data
under study.
• Hence, it is most essential for data scientists or EDA
experts to comprehend different types of datasets.
An example of data cleaning would be using outlier
detection methods for quantitative data cleaning.
v. EDA: Exploratory data analysis is the stage where we
actually start to understand the message contained in the
data. It should be noted that several types of data
transformation techniques might be required during the
process of exploration.
vi. Modeling and algorithm: From a data science
perspective, generalized models or mathematical formulas
can represent or exhibit relationships among different
variables, such as correlation or causation.
• These models or equations involve one or more variables
that depend on
other variables to cause an event.
• For example, when buying, say, pens, the total price of
pens(Total) = price
for one pen (UnitPrice) * the number of pens bought
(Quantity).
• Hence, our model would be Total = UnitPrice *
Quantity. Here, the total price is dependent on the
unit price.
• Hence, the total price is referred to as the dependent
variable and the unit
price is referred to as an independent variable.
• In general, a model always describes the relationship
between independent and dependent variables.
Inferential statistics deals with quantifying
relationships between particular variables.
• The Judd model for describing the relationship
between data, model, and error still holds true: Data
= Model + Error.
viii. Data Product:
Any computer software that uses data as inputs,
produces outputs, and provides feedback based on the
output to control the environment is referred to as a data
product.
• A data product is generally based on a model
developed during data analysis, for example, a
recommendation model that inputs user purchase
history and recommends a related item that the user
is highly likely to buy.
ix. Communication:
This stage deals with disseminating the results to end
stakeholders to use
the result for business intelligence.
• One of the most notable steps in this stage is data
visualization. Visualization deals with information
relay techniques such as tables, charts, summary
diagrams, and bar charts to show the analyzed result.
2. Importance of EDA in Data Science
• Different fields of science, economics, engineering,
and marketing accumulate and store data primarily
in electronic databases. Appropriate and well-
established decisions should be made using the data
collected.
• It is practically impossible to make sense of datasets
containing more than a handful of data points
without the help of computer programs. To be
certain of the insights that the collected data
provides and to make further decisions, data mining
is performed where we go throughdistinctive
analysis processes.
• Exploratory data analysis is key, and usually the first
exercise in data mining. It allows us to visualize data
to understand it as well as to create hypotheses for
further analysis. The exploratory analysis centers
around creating a synopsis of data or insights for
the next steps in a data mining
project.
• EDA actually reveals ground truth about the content
without making any underlying assumptions. This
is the fact that data scientists use this process to
actually understand what type of modeling and
hypotheses can
be created.
• Key components of exploratory data analysis
include summarizing data, statistical analysis, and
visualization of data.
• Python provides expert tools for exploratory
analysis, with pandas for summarizing; scipy, along
with others, for statistical analysis; and matplotlib
and plotly for visualizations.
• After understanding the significance of EDA, let's
discover what are the most generic steps involved in
EDA
❖ Steps in EDA
• Having understood what EDA is, and its
significance, let's understand the various steps
involved in data analysis. Basically, it involves four
different steps. Let's go through each of them to get
a brief understanding of each step:
1. Problem definition: Before trying to extract useful
insight from the data, it is essential to define the
business problem to be solved.
• The problem definition works as the driving force
for a data analysis plan execution.
• The main tasks involved in problem definition
are defining the main objective of the analysis,
defining the main deliverables, outlining the main
roles and responsibilities, obtaining the current
status of the data, defining the timetable, and
performing cost/benefit analysis. Based on such a
problem definition, an execution plan can be
created.
2. Data preparation: This step involves methods for
preparing the dataset before actual analysis.
• In this step, we define the sources of data, define
data schemas and tables, understand the main
characteristics of the data, clean the dataset,
delete non-relevant datasets, transform the data, and
divide the data into required chunks for analysis.
3. Data analysis: This is one of the most crucial steps
that deals with descriptive statistics and analysis of the
data.
• The main tasks involve summarizing the data,
finding the hidden correlation and relationships
among the data, developing predictive models,
evaluating the models, and calculating the
accuracies.
• Some of the techniques used for data summarization
are summary tables, graphs, descriptive statistics,
inferential statistics, correlation statistics,
searching, grouping, and mathematical models.
4. Development and representation of the results: This
step involves presenting the dataset to the target audience in
the form of graphs, summary tables, maps, and diagrams.
• This is also an essential step as the result analyzed
from the dataset should be interpretable by the
business stakeholders, which is one of the major
goals of EDA.
• Most of the graphical analysis techniques include
scattering plots, character plots, histograms, box
plots, residual plots, mean plots, and others.
Understanding and making sense of Data
• It is crucial to identify the type of data under
analysis. We will learn about different types of
data that is encountered during analysis.
• Different disciplines store different kinds of data for
different purposes. For
example, medical researchers store patients' data,
universities store students' and teachers' data, and
real estate industries storehouse and
building datasets.
• A dataset contains many observations about a
particular object. For instance, a dataset about
patients in a hospital can contain many observations.
• A patient can be described by a patient identifier
(ID), name, address, weight, date of birth, address,
email, and gender. Each of these features that
describes a patient is a variable. Each observation
can have a specific value for each of these variables.
For example, a patient can have the following:
• These datasets are stored in hospitals and are presented
for analysis. Most of this data is stored in some sort of
database management system in tables/schema. An
example of a table for storing patient information is
shown here:
• To summarize the preceding table, there are four
observations (001, 002, 003, 004, 005). Each
observation describes variables (PatientID, name,
address, dob, email, gender, and weight).
• Most of the dataset broadly falls into two
groups—numerical data and categorical data.
1. Numerical data
• This data has a sense of measurement involved in it;
for example, a person's age, height, weight, blood
pressure, heart rate, temperature, number of teeth,
number of bones, and the number of family
members. This data is often referred to as
quantitative data in statistics.
• The numerical dataset can be either discrete or continuous
types.
1. Discrete data
• This is data that is countable and its values can be
listed out. For example, if we flip a coin, the number
of heads in 200 coin flips can take values from 0 to
200 (finite) cases.
• A variable that represents a discrete dataset is referred to
as a discrete
variable. The discrete variable takes a fixed number of
distinct values.
• For example, the Country variable can have values
such as Nepal, India, Norway, and Japan. It is
fixed. The Rank variable of a student in a
classroom can take values from 1, 2, 3, 4, 5, and so
on.
2. Continuous data
• A variable that can have an infinite number of numerical
values within a
specific range is classified as continuous data.
• A variable describing continuous data is a
continuous variable. For example, what is the
temperature of your city today? Can we be
finite?
• Similarly, the weight variable is also a continuous variable.
3. Categorical data
➢ This type of data represents the characteristics of an
object; for example, gender, marital status, type of
address, or categories of the movies. This data is
often referred to as qualitative datasets in statistics.
➢ Some of the most common types of categorical data you can
find in data:
• Gender (Male, Female, Other, or Unknown)
• Marital Status (Divorced, Legally Separated,
Married, Never Married, Unmarried, Widowed, or
Unknown)
• Movie genres (Action, Adventure, Comedy, Crime,
Drama, Fantasy,
Historical, Horror, Mystery, Philosophical,
Political, Saga, Satire, Science Fiction, Social,
Thriller, Urban, or Western)
• Blood type (A, B, AB, or O)
• Types of drugs (Stimulants, Depressants, Hallucinogens,
Dissociatives,
Opioids, Inhalants, or Cannabis)
A variable describing categorical data is referred
to as a categorical variable. These types of
variables can have one of a limited number of
values.
There are different types of categorical variables:
• A binary categorical variable can take exactly
two values and is also referred to as a
dichotomous variable. For example, when we
create an experiment, the result is either success or
failure. Hence, results can be
understood as a binary categorical variable.
• Polytomous variables are categorical variables that
can take more than two possible values. For
example, marital status can have several values,
such as divorced, legally separated, married, never
married, unmarried, widowed and unknown. Since
marital status can take more than two possible
values, it is a polytomous variable.
• Most of the categorical dataset follows
either nominal or ordinal measurement
scales.
1. Measurement scales
There are four different types of measurement scales
described in statistics: nominal, ordinal, interval, and ratio.
These scales are used more in academic industries. Let's
understand each of them with some examples.
1. Nominal
These are practiced for labeling variables without any
quantitative value. The scales are generally referred to as
labels. And these scales are mutually exclusive and do not
carry any numerical importance. Let's see some examples:
• What is your gender?
• Male
• Female
• I prefer not to answer
• Other
Other examples include the following:
• The languages that are spoken in a particular country
• Biological species
• Parts of speech in grammar (noun, pronoun,
adjective, and so on)
• Taxonomic ranks in biology (Archea, Bacteria, and
Eukarya)
• Nominal scales are considered qualitative scales
and the measurements that are taken using
qualitative scales are considered qualitative data.
4.2.1.2 Ordinal
• The main difference in the ordinal and nominal scale is the
order. In ordinal
scales, the order of the values is a significant factor.
• Let's check an example of ordinal scale using the
Likert scale: WordPress is making content
managers' lives easier. How do you feel about this
statement? The following diagram shows the Likert
scale:
• As depicted in the preceding diagram, the answer to
the question of WordPress is making content
managers' lives easier is scaled down to five
different ordinal values, Strongly Agree, Agree,
Neutral, Disagree, and Strongly Disagree.
• Scales like these are referred to as the Likert scale.
Similarly, the following diagram shows more
examples of the Likert scale:
To make it easier, consider ordinal scales as an order of ranking (1st,
2nd, 3rd, 4th, and so on). The median item is allowed as the measure
of central tendency; however, the average is not permitted.
3. Interval
• In interval scales, both the order and exact differences
between the values
are significant.
• Interval scales are widely used in statistics, for
example, in the measure of central tendencies—
mean, median, mode, and standard deviations.
• Examples include location in Cartesian coordinates
and direction measured in degrees from magnetic
north. The mean, median, and mode are allowed on
interval data.
4. Ratio
• Ratio scales contain order, exact values, and absolute zero,
which makes it
possible to be used in descriptive and inferential statistics.
• These scales provide numerous possibilities for statistical
analysis.
• Mathematical operations, the measure of central
tendencies, and the
measure of dispersion and coefficient of variation
can also be computed from such scales.
• Examples include a measure of energy, mass,
length, duration, electrical energy, plan angle, and
volume. The following table gives a summary of
the data types and scale measures:
Software tools for EDA
There are several software tools available for Exploratory Data Analysis
(EDA) that help in analyzing and visualizing data. Here are some popular
ones:
1. Python Libraries
Pandas: Used for data manipulation and analysis, providing data
structures like DataFrames.
Matplotlib: A plotting library that provides static, animated, and
interactive visualizations.
Seaborn: Built on Matplotlib, it offers a high-level interface for drawing
attractive statistical graphics.
Plotly: An interactive graphing library that makes it easy to create plots,
including line charts, scatter plots, and histograms.
NumPy: Provides support for large, multi-dimensional arrays and
matrices, along with a collection of mathematical functions.
Scikit-learn: Offers tools for data preprocessing, as well as simple
visualizations like feature importance plots.
2. R Libraries
ggplot2: A powerful data visualization package for creating complex
plots.
dplyr: A grammar of data manipulation, providing a consistent set of
verbs for data manipulation.
tidyverse: A collection of R packages designed for data science,
including ggplot2, dplyr, tidyr, and others.
Shiny: Allows you to build interactive web apps straight from R.
3. Standalone Software
Tableau: A leading data visualization tool that allows users to create a
wide variety of charts and dashboards.
Power BI: A business analytics tool by Microsoft that provides
interactive visualizations and business intelligence capabilities.
Excel: A widely-used spreadsheet tool that offers basic to advanced data
analysis and visualization capabilities.
RapidMiner: An advanced data science platform that includes tools for
data prep, machine learning, and EDA.
KNIME: A data analytics, reporting, and integration platform that
supports data blending, EDA, and more.
Orange: A data visualization and analysis tool for both novices and
experts in data science.
4. Big Data Tools
Apache Spark: A big data processing framework that can be used for
large-scale EDA.
Hadoop: A framework that allows for distributed processing of large data
sets across clusters of computers.
5. Interactive Notebooks
Jupyter Notebook: An open-source web application that allows you to
create and share documents that contain live code, equations,
visualizations, and narrative text.
Google Colab: A free Jupyter notebook environment that runs in the
cloud and supports Python code execution.
These tools enable data scientists and analysts to explore data, identify
patterns, and extract meaningful insights, often through visualizations
and statistical summaries.
Introduction to the popular EDA tools
(pandas,Numpy,Matplotlib,seaborn and tableau)
1. Pandas
Pandas is a Python library that makes it easy to work with data,
especially in tables (like spreadsheets).
Advantage :
It helps you clean, organize, and analyze data quickly. You can
filter data, calculate statistics, and handle missing values with just
a few commands.
2. NumPy
NumPy is a Python library that focuses on numerical operations,
especially with large datasets.
Advantage :
It provides support for working with arrays (lists of numbers) and
matrices, making calculations faster and more efficient. It's the
foundation for many other data science tools.
3. Matplotlib
Matplotlib is a plotting library for creating static, 2D graphs and
charts in Python.
Advantage:
It allows you to create a wide variety of visualizations, like line
charts, bar graphs, and scatter plots, to help you understand and
present your data.
4. Seaborn
Seaborn is built on top of Matplotlib and provides a higher-level
interface for making attractive and informative statistical graphics.
Advantage:
It simplifies the process of creating complex visualizations, like
heatmaps and violin plots, making it easier to explore
relationships between different parts of your data.
5. Tableau
Tableau is a powerful data visualization tool that allows you to
create interactive and shareable dashboards.
Advantage:
It is user-friendly and doesn't require coding. You can connect to
different data sources, drag and drop to create visuals, and explore
your data through interactive charts and graphs.
Visual Aids for EDA
Visual Aids For EDA
1. Line chart
• We will use the matplotlib library and the stock price data to
plot time series lines. First of all, let's understand the dataset.
• We have created a function using the faker Python library to
generate the dataset.
• It is the simplest possible dataset with just two columns. The
first column is Date and the second column is Price, indicating
the stock price on that date.
• Let's generate the dataset by calling the helper method. In
addition to this, we have saved the CSV file.
• We can optionally load the CSV file using the pandas
(read_csv) library and proceed with visualization. generateData
function is defined here:
Example Source code:
import datetime
import random
import pandas as pd
import radar
def generateData(n):
listdata = []
start = datetime.datetime(2019, 8, 1)
end = datetime.datetime(2019, 8, 30)
for _ in range(n):
# Generate a random date within the specified range
date = radar.random_datetime(start='2019-08-01', stop='2019-08-
30').strftime("%Y-%m-%d")
# Generate a random price between 900 and 1000
price = round(random.uniform(900, 1000), 4)
listdata.append([date, price])
# Create a DataFrame from the list of data
df = pd.DataFrame(listdata, columns=['Date', 'Price'])
# Convert the 'Date' column to datetime format
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
# Group the data by date and take the mean of the prices
df = df.groupby(by='Date').mean()
return df
Explanation:
Imports: The necessary libraries are imported at the beginning.
generateData function:
o The function takes an integer n as input, which determines the number
of data points to generate.
o It initializes a list listdata to store the generated data.
o A random date is generated using radar.random_datetime, and a
random price is generated between 900 and 1000.
o The data is appended to the listdata list.
o The data is converted to a Pandas DataFrame.
o The 'Date' column is converted to a datetime object.
o The data is grouped by date, and the average price for each date is
computed.
Output: The function returns a DataFrame with the mean price for each date.
The output of the preceding code is shown in the following screenshot:
Steps involved
Line chart:
Sample Source Code:
import matplotlib.pyplot as plt
# Sample data
x = [1, 2, 3, 4, 5]
y = [2, 3, 5, 7, 11]
# Create the plot
plt.plot(x, y, marker='o')
# Add title and labels
plt.title("Simple Line Graph")
plt.xlabel("X-axis Label")
plt.ylabel("Y-axis Label")
# Show the plot
plt.show()
EXPLANATION:
matplotlib.pyplot: This library is used to create static, animated, and
interactive visualizations in Python.
Data: The x and y lists represent the data points to be plotted on the
graph.
plt.plot(): This function plots the line graph. The marker='o' argument
adds markers at each data point.
plt.title(), plt.xlabel(), plt.ylabel(): These functions add a title and
labels to the x and y axes.
plt.show(): This function displays the graph.
OUTPUT:
2. Bar charts
• This is one of the most common types of visualization that
almost everyone must have encountered. Bars can be drawn
horizontally or vertically to represent categorical variables.
• Bar charts are frequently used to distinguish objects between
distinct collections in order to track variations over time.
• In most cases, bar charts are very convenient when the changes are
large.
• In order to learn about bar charts,
• We can use the calendar Python library to keep track of the
months of the year (1 to 12) corresponding to January to
December:
Example Source Code:
import numpy as np
import calendar
import matplotlib.pyplot as plt
import random
# Step 2: Set up the data
months = list(range(1, 13))
sold_quantity = [round(random.uniform(100, 200)) for x in range(1, 13)]
# Step 3: Specify the layout of the figure and allocate space
figure, axis = plt.subplots()
# Step 4: Display the names of the months on the x-axis
plt.xticks(months, calendar.month_name[1:13], rotation=20)
# Step 5: Plot the graph
plot = axis.bar(months, sold_quantity)
# Step 6: Display data values on the head of the bars (optional)
for rectangle in plot:
height = rectangle.get_height()
axis.text(rectangle.get_x() + rectangle.get_width() / 2., 1.002 * height,
'%d' % int(height), ha='center', va='bottom')
# Step 7: Display the graph on the screen
plt.show()
EXPLANATION
1. Imports:
o numpy, calendar, and matplotlib.pyplot are imported along with
random.
2. Set up Data:
o months is a list of integers from 1 to 12.
o sold_quantity generates random integers between 100 and 200 for
each month.
3. Figure and Axis Layout:
o figure, axis = plt.subplots() creates a figure and an axis for plotting.
4. Customize X-axis:
o plt.xticks() sets custom tick labels on the x-axis to display month
names.
5. Plot the Graph:
o axis.bar() creates a bar chart using the data.
6. Annotate Bars (Optional):
o A loop iterates through the bars and places the height value on top of
each bar for clarity.
7. Display the Graph:
o plt.show() renders the graph.
Running this code will display a bar chart showing the number of items sold for
each month, with the month names on the x-axis and the sold quantity on the y-
axis. The numbers on top of the bars represent the exact quantities sold.
OUTPUT:
1.3 Scatter plot
Scatter plots are also called scatter graphs, scatter charts, scattergrams,
and
scatter diagrams. They use a Cartesian coordinates system to display
values of typically two variables for a set of data.
When should we use a scatter plot? Scatter plots can be
constructed in the following two situations:
• When one continuous variable is dependent on another
variable, which is under the control of the observer
• When both continuous variables are independent
There are two important concepts—independent variable and
dependent variable.
• In statistical modeling or mathematical modeling, the values of
dependent variables rely on the values of independent
variables.
• The dependent variable is the outcome variable being studied.
• The independent variables are also referred to as regressors.
• The scatter plots are used when we need to show the
relationship between two variables, and hence are sometimes
referred to as correlation plots.
Example Source :
import matplotlib.pyplot as plt
import pandas as pd
# Example data creation
data = {
'age': [0, 6, 12, 24, 36, 48, 60, 72, 84, 96, 108, 120, 132, 144, 156,
168, 180],
# Age in months
'min_recommended': [14, 14, 14, 13, 12, 12, 11, 11, 10, 10, 10, 9,
9, 8, 8, 7, 7], # Min hours
'max_recommended': [17, 17, 17, 15, 14, 14, 13, 13, 12, 12, 12, 11, 11,
10, 10, 9, 9]
# Max hours
}
# Creating a DataFrame
sleepDf = pd.DataFrame(data)
# Scatter plot for minimum recommended sleep hours vs age
plt.scatter(sleepDf['age'] / 12., sleepDf['min_recommended'],
color='green', label='Min Recommended')
# Scatter plot for maximum recommended sleep hours vs age
plt.scatter(sleepDf['age'] / 12., sleepDf['max_recommended'],
color='red', label='Max Recommended')
# Labeling the x-axis (Age in years)
plt.xlabel('Age of person in Years')
# Labeling the y-axis (Total hours of sleep required')
plt.ylabel('Total hours of sleep required')
# Adding a title to the plot
plt.title('Recommended Sleep Hours by Age')
# Adding a legend to distinguish between the points
plt.legend()
# Display the plot
plt.show()
Explanation:
1. Data Creation:
The data dictionary is the same as before, with age in months and the
corresponding min_recommended and max_recommended sleep hours.
2. Creating the Scatter Plots:
plt.scatter(sleepDf['age'] / 12., sleepDf['min_recommended'],
color='green', label='Min Recommended'):
Creates a scatter plot of minimum recommended sleep hours (dependent
variable) against age in years (independent variable).
The points are colored green.
plt.scatter(sleepDf['age'] / 12., sleepDf['max_recommended'],
color='red', label='Max Recommended'):
Creates a scatter plot of maximum recommended sleep hours
(dependent variable) against age in years (independent variable).
The points are colored red.
3. Labeling and Titles:
The x-axis is labeled "Age of person in Years" and the y-axis is
labeled "Total hours of sleep required."
The plot is titled "Recommended Sleep Hours by Age."
4. Legend:
plt.legend() adds a legend to differentiate between the minimum and
maximum recommended sleep hours.
5. Display:
plt.show() renders the scatter plot.
Result:
This code will generate a scatter plot with two sets of points:
Green points represent the minimum recommended sleep hours for
each age.
Red points represent the maximum recommended sleep hours for
each age.
The x-axis represents the age in years, and the y-axis represents the
total hours of sleep required.
This type of scatter plot is useful for observing how the recommended
sleep hours vary as a person ages, making it easy to compare the minimum
and maximum recommendations.
OUTPUT:
2.Data Transformation Techniques
Data Transformation
Data transformation is a set of techniques used to convert data from
one format or structure to another format or structure. The following are
some examples of transformation activities:
Data deduplication involves the identification of duplicates and their
removal.
Key restructuring involves transforming any keys with built-in
meanings
to the generic keys.
Data cleansing involves extracting words and deleting out-of-date,
inaccurate, and incomplete information from the source language
without extracting the meaning or information to enhance the
accuracy of the source data.
Data validation is a process of formulating rules or algorithms that
help in validating different types of data against some known issues.
Format revisioning involves converting from one format to another.
Data derivation consists of creating a set of rules to generate
more information from the data source.
Data aggregation involves searching, extracting,
summarizing, and
preserving important information in different types of reporting
systems.
Data integration involves converting different data types and
merging them into a common structure or schema.
Data filtering involves identifying information relevant to any
particular user.
Data joining involves establishing a relationship between two or
more tables.
The main reason for transforming the data is to get a better representation
such that the transformed data is compatible with other data.
In addition to this, interoperability in a system can be achieved by following
a common data structure and format.
1. Data Duplication
Explaination:
Data deduplication involves identifying and removing duplicate entries
in a dataset.
Sample Source Code:
import pandas as pd
# Create a DataFrame with duplicate rows
data = {'ID': [1, 2, 2, 4, 5],
'Name': ['Alice', 'Bob', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 30, 35, 40]}
df = pd.DataFrame(data)
# Drop duplicate rows
df_dedup = df.drop_duplicates()
print("DataFrame after deduplication:")
print(df_dedup)
OUTPUT:
DataFrame after deduplication:
ID Name Age
0 1 Alice 25
1 2 Bob 30
3 4 Charlie 35
4 5 David 40
2. Key Restructuring
Explanation:
Transforming keys with specific meanings to generic keys.
Example Python Program:
import pandas as pd
# Create a DataFrame with meaningful keys
data = {'EmployeeID': [1, 2, 3],
'EmployeeName': ['Alice', 'Bob', 'Charlie'],
'EmployeeAge': [25, 30, 35]}
df = pd.DataFrame(data)
# Rename columns to generic keys
df_restructured = df.rename(columns={'EmployeeID': 'ID',
'EmployeeName': 'Name', 'EmployeeAge': 'Age'})
print("DataFrame after key restructuring:")
print(df_restructured)
OUTPUT:
DataFrame after key restructuring:
ID Name Age
0 1 Alice 25
1 2 Bob 30
2 3 Charlie 35
3. Data Cleansing
Explanation: Removing inaccurate or incomplete information.
Example Python Program:
import pandas as pd
# Create a DataFrame with some incorrect data
data = {'Name': ['Alice', 'Bob', 'Charlie', None],
'Age': [25, None, 35, 40]}
df = pd.DataFrame(data)
# Drop rows with missing values
df_cleaned = df.dropna()
print("DataFrame after data cleansing:")
print(df_cleaned)
OUTPUT:
DataFrame after data cleansing:
Name Age
0 Alice 25.0
2 Charlie 35.0
4. Data Validation
Explanation: Validating data against certain rules.
Example Python Program:
import pandas as pd
# Create a DataFrame
data = {'Age': [25, 30, -5, 35]} # -5 is an invalid age
df = pd.DataFrame(data)
# Validate ages (must be non-negative)
df_validated = df[df['Age'] >= 0]
print("DataFrame after data validation:")
print(df_validated)
OUTPUT:
DataFrame after data validation:
Age
0 25
1 30
3 35
5. 5. Format Revisioning
Explanation: Converting data from one format to another.
Example Source Code:
import pandas as pd
# Create a DataFrame
data = {'Date': ['2024-01-01', '2024-02-01']}
df = pd.DataFrame(data)
# Convert Date column from string to datetime
df['Date'] = pd.to_datetime(df['Date'])
print("DataFrame after format revisioning:")
print(df)
OUTPUT:
DataFrame after format revisioning:
Date
0 2024-01-01
1 2024-02-01
6. Data Derivation
Explanation: Creating new information from existing data.
Example Source code:
import pandas as pd
# Create a DataFrame
data = {'Sales': [100, 200, 300]}
df = pd.DataFrame(data)
# Derive a new column with a 10% increase in sales
df['SalesIncrease'] = df['Sales'] * 1.10
print("DataFrame after data derivation:")
print(df)
OUTPUT:
DataFrame after data derivation:
Sales SalesIncrease
0 100 110.0
1 200 220.0
2 300 330.0
7. Data Aggregation
Explanation: Summarizing data, e.g., calculating total or average.
Example Python Program:
import pandas as pd
# Create a DataFrame
data = {'Category': ['A', 'A', 'B', 'B'],
'Sales': [100, 150, 200, 250]}
df = pd.DataFrame(data)
# Aggregate data by category
df_aggregated = df.groupby('Category').sum()
print("DataFrame after data aggregation:")
print(df_aggregated)
Output:
DataFrame after data aggregation:
Sales
Category
A 250
B 450
8. Data Integration
Explanation: Combining different datasets into one.
Example Python Program:
import pandas as pd
# Create two DataFrames
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [1, 2, 3], 'Age': [25, 30, 35]})
# Merge DataFrames on 'ID'
df_integrated = pd.merge(df1, df2, on='ID')
print("DataFrame after data integration:")
print(df_integrated)
Output:
DataFrame after data integration:
ID Name Age
0 1 Alice 25
1 2 Bob 30
2 3 Charlie 35
9. Data Filtering
Explanation: Extracting specific information based on conditions.
Example Python Program:
import pandas as pd
# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40]}
df = pd.DataFrame(data)
# Filter data for people older than 30
df_filtered = df[df['Age'] > 30]
print("DataFrame after data filtering:")
print(df_filtered)
Output:
DataFrame after data filtering:
Name Age
2 Charlie 35
3 David 40
10. Data Joining
Explanation: Establishing relationships between tables based on common
keys.
Example Python Program:
import pandas as pd
# Create two DataFrames
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [1, 2, 4], 'Age': [25, 30, 45]})
# Perform a join operation (left join)
df_joined = pd.merge(df1, df2, on='ID', how='left')
print("DataFrame after data joining:")
print(df_joined)
Output:
DataFrame after data joining:
ID Name Age
0 1 Alice 25
1 2 Bob 30
2 3 Charlie NaN
In this example, the left join includes all records from df1 and matches
records from df2 based on the ID column. Charlie does not have a
corresponding ID in df2, so Age is NaN.
These examples demonstrate various data transformation techniques and
how they can be implemented using Python and pandas.
Benefits of data transformation
• Data transformation promotes interoperability between several
applications. The main reason for creating a similar format and
structure in the dataset is that it becomes compatible with other
systems.
• Comprehensibility for both humans and computers is improved
when using better-organized data compared to messier data.
• Data transformation ensures a higher degree of data quality and
protects applications from several computational challenges
such as null values, unexpected duplicates, and incorrect
indexings, as well as incompatible structures or formats.
• Data transformation ensures higher performance and scalability for
modern
analytical databases and dataframes.
Challenges
The process of data transformation can be challenging for several reasons:
• It requires a qualified team of experts and state-of-the-art
infrastructure. The cost of attaining such experts and
infrastructure can increase the cost of the operation.
• Data transformation requires data cleaning before data
transformation and data migration. This process of cleansing
can be expensively time- consuming.
• Generally, the activities of data transformations involve batch
processing. This means that sometimes, we might have to wait
for a day before the next batch of data is ready for cleansing.
This can be very slow.
Grouping and aggregation
Grouping and aggregation are powerful techniques for summarizing and visualizing data.
These techniques help you analyze data by grouping it into categories and computing
aggregate statistics for each category. In Python, the pandas library is commonly used for
these operations, and matplotlib or seaborn is used for visualization.
Here's a detailed guide on how to perform grouping and aggregation in data visualization
using Python:
Example Dataset
Let's start with a sample dataset to demonstrate these techniques. We'll use a DataFrame
with sales data across different regions and products.
import pandas as pd
# Sample data
data = {
'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],
'Product': ['A', 'A', 'B', 'B', 'A', 'B', 'A', 'B'],
'Sales': [200, 150, 300, 250, 180, 210, 320, 270],
'Quantity': [20, 15, 30, 25, 18, 21, 32, 27]
df = pd.DataFrame(data)
1. Grouping and Aggregation
Group by Region and Aggregate Sales
# Group by 'Region' and aggregate sales (sum and average)
grouped_df = df.groupby('Region').agg({
'Sales': ['sum', 'mean'],
'Quantity': ['sum', 'mean']
})
print("Grouped and Aggregated DataFrame:")
print(grouped_df)
Output:
Grouped and Aggregated DataFrame:
Sales Quantity
sum mean sum mean
Region
East 620 310.0 62 31.0
North 380 190.0 38 19.0
South 360 180.0 36 18.0
West 520 260.0 52 26.0
2. Visualization with Matplotlib
Bar Plot for Total Sales by Region
import matplotlib.pyplot as plt
# Group by 'Region' and aggregate sales (sum)
sales_by_region = df.groupby('Region')['Sales'].sum()
# Plot
sales_by_region.plot(kind='bar', color='skyblue')
plt.title('Total Sales by Region')
plt.xlabel('Region')
plt.ylabel('Total Sales')
plt.show()
Explanation: This creates a bar plot showing the total sales for each region.
Bar Plot for Average Sales and Quantity by Product
# Group by 'Product' and aggregate sales and quantity (mean)
agg_by_product = df.groupby('Product').agg({
'Sales': 'mean',
'Quantity': 'mean'
})
# Plot
agg_by_product.plot(kind='bar')
plt.title('Average Sales and Quantity by Product')
plt.xlabel('Product')
plt.ylabel('Average Value')
plt.show()
Explanation: This creates a bar plot for the average sales and quantity of each product.
3. Visualization with Seaborn
Seaborn provides a high-level interface for creating attractive and informative statistical
graphics.
Sample Source code:
#Box Plot for Sales Distribution by Region
import seaborn as sns
# Plot
sns.boxplot(x='Region', y='Sales', data=df)
plt.title('Sales Distribution by Region')
plt.xlabel('Region')
plt.ylabel('Sales')
plt.show()
Explanation: This creates a box plot that shows the distribution of sales across different
regions, highlighting the spread and outliers.
Grouping and Aggregation: Use pandas to group data and compute aggregate
statistics like sum, mean, etc.
Visualization with Matplotlib and Seaborn: Use matplotlib and seaborn to create
visualizations that make the aggregated data easier to interpret.
These techniques are essential for understanding patterns and trends in your data,
enabling better decision-making based on summarized information.