0% found this document useful (0 votes)

114 views47 pages

Notes Unit I

UNIT I EXPLORATORY DATA ANALYSIS (EDA) EDA Fundamentals-Definition and importance of EDA in data science. Understanding and Making Sense of Data, Software Tools for EDA-Introduction to popular EDA tools (Pandas, NumPy, Matplotlib, Seaborn, and Tableau), Visual Aids for EDA-Common plots and charts used in EDA (histograms, box plots, scatter plots),Data Transformation Techniques, Grouping and Aggregation Lab Component: Loading datasets and Basic data exploration (head, tail, describe, info). Data

Uploaded by

VIDHYALAKSHMI A M (RA2113003011015)

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

114 views47 pages

Notes Unit I

Uploaded by

VIDHYALAKSHMI A M (RA2113003011015)

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 47

UNIT I - EXPLORATORY DATA

ANALYSIS

EDA Fundamentals-Definition and importance of EDA in data

science. Understanding and Making Sense of Data, Software Tools for
EDA-Introduction to popular EDA tools (Pandas, NumPy, Matplotlib,
Seaborn, and Tableau), Visual Aids for EDA-Common plots and charts
used in EDA (histograms, box plots, scatter plots), Data Transformation
Techniques, Grouping and Aggregation

1. EDA Fundamentals

EDA is a process of examining the available dataset to discover

patterns, spot anomalies, test hypotheses, and check assumptions
using statistical measures. In this chapter, we are going to discuss
the steps involved in performing top-notch exploratory data analysis
and get our hands dirty using some open source databases. As
mentioned here and in several studies, the primary aim of EDA is to
examine what data can tell us before actually going through formal
modeling or hypothesis formulation.

Data Science
• Data science is at the peak of its hype and the skills for
data scientists are changing. Now, data scientists are
not only required to build a performant model, but it is
essential for them to explain the results obtained and
use the result for business intelligence.

• Data science involves cross-disciplinary knowledge from

computer science,

data, statistics, and mathematics.

• There are several phases of data analysis,

including data requirements, data collection, data
processing, data cleaning, exploratory data analysis,
modeling and algorithms, and data product and
communication.
• These phases are similar to the CRoss-Industry Standard
Process for data

mining (CRISP) framework in data mining.

• The main takeaway here is the stages of EDA, as it

is an important aspect of data analysis and data
mining. Let's understand in brief what these stages
are:

i. Data requirements: There can be various sources of

data for an organization. It is important to comprehend
what type of data is required for the organization

to be collected, curated, and stored.

• For example, an application tracking the sleeping

pattern of patients suffering from dementia requires
several types of sensors' data storage, such as sleep
data, heart rate from the patient, electro-dermal
activities, and user activities pattern.
• All of these data points are required to correctly
diagnose the mental state of the person. Hence,
these are mandatory requirements for the

application.

• In addition to this, it is required to categorize the data,

numerical or
categorical, and the format of storage and dissemination.

ii. Data collection: Data collected from several sources

must be stored in the correct format and transferred to the
right information technology personnel within a company.
As mentioned previously, data can be collected from
several objects on several events using different types of
sensors and storage tools.

iii. Data processing: Preprocessing involves the process of pre-

curating the

dataset before actual analysis.

• Common tasks involve correctly exporting the

dataset, placing them under the right tables,
structuring them, and exporting them in the correct
format.

iv. Data cleaning: Preprocessed data is still not ready for

detailed analysis. It must be correctly transformed for an
incompleteness check, duplicates check, error check, and
missing value check.
• These tasks are performed in the data cleaning stage,
which involves responsibilities such as matching the
correct record, finding inaccuracies in the dataset,
understanding the overall data quality, removing
duplicate items, and filling in the missing values.

• However, how could we identify these anomalies

on any dataset? Finding such data issues requires
us to perform some analytical techniques.
• To understand briefly, data cleaning is dependent on the
types of data
under study.

• Hence, it is most essential for data scientists or EDA

experts to comprehend different types of datasets.
An example of data cleaning would be using outlier
detection methods for quantitative data cleaning.

v. EDA: Exploratory data analysis is the stage where we

actually start to understand the message contained in the
data. It should be noted that several types of data
transformation techniques might be required during the
process of exploration.

vi. Modeling and algorithm: From a data science

perspective, generalized models or mathematical formulas
can represent or exhibit relationships among different
variables, such as correlation or causation.

• These models or equations involve one or more variables

that depend on

other variables to cause an event.

• For example, when buying, say, pens, the total price of

pens(Total) = price

for one pen (UnitPrice) * the number of pens bought

(Quantity).

• Hence, our model would be Total = UnitPrice *

Quantity. Here, the total price is dependent on the
unit price.
• Hence, the total price is referred to as the dependent
variable and the unit

price is referred to as an independent variable.

• In general, a model always describes the relationship
between independent and dependent variables.
Inferential statistics deals with quantifying
relationships between particular variables.
• The Judd model for describing the relationship
between data, model, and error still holds true: Data
= Model + Error.

viii. Data Product:

Any computer software that uses data as inputs,
produces outputs, and provides feedback based on the
output to control the environment is referred to as a data
product.
• A data product is generally based on a model
developed during data analysis, for example, a
recommendation model that inputs user purchase
history and recommends a related item that the user
is highly likely to buy.

ix. Communication:

This stage deals with disseminating the results to end

stakeholders to use

the result for business intelligence.

• One of the most notable steps in this stage is data

visualization. Visualization deals with information
relay techniques such as tables, charts, summary
diagrams, and bar charts to show the analyzed result.
2. Importance of EDA in Data Science

• Different fields of science, economics, engineering,

and marketing accumulate and store data primarily
in electronic databases. Appropriate and well-
established decisions should be made using the data
collected.
• It is practically impossible to make sense of datasets
containing more than a handful of data points
without the help of computer programs. To be
certain of the insights that the collected data
provides and to make further decisions, data mining
is performed where we go throughdistinctive
analysis processes.
• Exploratory data analysis is key, and usually the first
exercise in data mining. It allows us to visualize data
to understand it as well as to create hypotheses for
further analysis. The exploratory analysis centers
around creating a synopsis of data or insights for
the next steps in a data mining
project.

• EDA actually reveals ground truth about the content

without making any underlying assumptions. This
is the fact that data scientists use this process to
actually understand what type of modeling and
hypotheses can
be created.
• Key components of exploratory data analysis
include summarizing data, statistical analysis, and
visualization of data.
• Python provides expert tools for exploratory
analysis, with pandas for summarizing; scipy, along
with others, for statistical analysis; and matplotlib
and plotly for visualizations.
• After understanding the significance of EDA, let's
discover what are the most generic steps involved in
EDA

❖ Steps in EDA

• Having understood what EDA is, and its

significance, let's understand the various steps
involved in data analysis. Basically, it involves four
different steps. Let's go through each of them to get
a brief understanding of each step:
1. Problem definition: Before trying to extract useful
insight from the data, it is essential to define the
business problem to be solved.
• The problem definition works as the driving force
for a data analysis plan execution.
• The main tasks involved in problem definition
are defining the main objective of the analysis,
defining the main deliverables, outlining the main
roles and responsibilities, obtaining the current
status of the data, defining the timetable, and
performing cost/benefit analysis. Based on such a
problem definition, an execution plan can be
created.

2. Data preparation: This step involves methods for

preparing the dataset before actual analysis.

• In this step, we define the sources of data, define

data schemas and tables, understand the main
characteristics of the data, clean the dataset,

delete non-relevant datasets, transform the data, and

divide the data into required chunks for analysis.

3. Data analysis: This is one of the most crucial steps

that deals with descriptive statistics and analysis of the
data.
• The main tasks involve summarizing the data,
finding the hidden correlation and relationships
among the data, developing predictive models,
evaluating the models, and calculating the
accuracies.

• Some of the techniques used for data summarization

are summary tables, graphs, descriptive statistics,
inferential statistics, correlation statistics,
searching, grouping, and mathematical models.

4. Development and representation of the results: This

step involves presenting the dataset to the target audience in
the form of graphs, summary tables, maps, and diagrams.
• This is also an essential step as the result analyzed
from the dataset should be interpretable by the
business stakeholders, which is one of the major
goals of EDA.
• Most of the graphical analysis techniques include
scattering plots, character plots, histograms, box
plots, residual plots, mean plots, and others.
Understanding and making sense of Data

• It is crucial to identify the type of data under

analysis. We will learn about different types of
data that is encountered during analysis.
• Different disciplines store different kinds of data for
different purposes. For

example, medical researchers store patients' data,

universities store students' and teachers' data, and
real estate industries storehouse and

building datasets.

• A dataset contains many observations about a

particular object. For instance, a dataset about
patients in a hospital can contain many observations.
• A patient can be described by a patient identifier
(ID), name, address, weight, date of birth, address,
email, and gender. Each of these features that
describes a patient is a variable. Each observation
can have a specific value for each of these variables.
For example, a patient can have the following:

• These datasets are stored in hospitals and are presented

for analysis. Most of this data is stored in some sort of
database management system in tables/schema. An
example of a table for storing patient information is
shown here:

• To summarize the preceding table, there are four

observations (001, 002, 003, 004, 005). Each
observation describes variables (PatientID, name,

address, dob, email, gender, and weight).

• Most of the dataset broadly falls into two

groups—numerical data and categorical data.
1. Numerical data

• This data has a sense of measurement involved in it;

for example, a person's age, height, weight, blood
pressure, heart rate, temperature, number of teeth,
number of bones, and the number of family
members. This data is often referred to as
quantitative data in statistics.

• The numerical dataset can be either discrete or continuous

types.

1. Discrete data

• This is data that is countable and its values can be

listed out. For example, if we flip a coin, the number
of heads in 200 coin flips can take values from 0 to
200 (finite) cases.

• A variable that represents a discrete dataset is referred to

as a discrete

variable. The discrete variable takes a fixed number of

distinct values.

• For example, the Country variable can have values

such as Nepal, India, Norway, and Japan. It is
fixed. The Rank variable of a student in a
classroom can take values from 1, 2, 3, 4, 5, and so
on.

2. Continuous data
• A variable that can have an infinite number of numerical
values within a

specific range is classified as continuous data.

• A variable describing continuous data is a

continuous variable. For example, what is the
temperature of your city today? Can we be
finite?

• Similarly, the weight variable is also a continuous variable.

3. Categorical data

➢ This type of data represents the characteristics of an

object; for example, gender, marital status, type of
address, or categories of the movies. This data is
often referred to as qualitative datasets in statistics.

➢ Some of the most common types of categorical data you can

find in data:

• Gender (Male, Female, Other, or Unknown)

• Marital Status (Divorced, Legally Separated,

Married, Never Married, Unmarried, Widowed, or
Unknown)
• Movie genres (Action, Adventure, Comedy, Crime,
Drama, Fantasy,

Historical, Horror, Mystery, Philosophical,

Political, Saga, Satire, Science Fiction, Social,
Thriller, Urban, or Western)

• Blood type (A, B, AB, or O)

• Types of drugs (Stimulants, Depressants, Hallucinogens,

Dissociatives,
Opioids, Inhalants, or Cannabis)

A variable describing categorical data is referred

to as a categorical variable. These types of
variables can have one of a limited number of
values.

There are different types of categorical variables:

• A binary categorical variable can take exactly

two values and is also referred to as a
dichotomous variable. For example, when we
create an experiment, the result is either success or
failure. Hence, results can be
understood as a binary categorical variable.

• Polytomous variables are categorical variables that

can take more than two possible values. For
example, marital status can have several values,
such as divorced, legally separated, married, never
married, unmarried, widowed and unknown. Since
marital status can take more than two possible
values, it is a polytomous variable.

• Most of the categorical dataset follows

either nominal or ordinal measurement
scales.

1. Measurement scales

There are four different types of measurement scales

described in statistics: nominal, ordinal, interval, and ratio.
These scales are used more in academic industries. Let's
understand each of them with some examples.
1. Nominal

These are practiced for labeling variables without any

quantitative value. The scales are generally referred to as
labels. And these scales are mutually exclusive and do not
carry any numerical importance. Let's see some examples:

• What is your gender?

• Male

• Female

• I prefer not to answer

• Other

Other examples include the following:

• The languages that are spoken in a particular country

• Biological species

• Parts of speech in grammar (noun, pronoun,

adjective, and so on)

• Taxonomic ranks in biology (Archea, Bacteria, and

Eukarya)

• Nominal scales are considered qualitative scales

and the measurements that are taken using
qualitative scales are considered qualitative data.

4.2.1.2 Ordinal

• The main difference in the ordinal and nominal scale is the

order. In ordinal

scales, the order of the values is a significant factor.

• Let's check an example of ordinal scale using the

Likert scale: WordPress is making content
managers' lives easier. How do you feel about this
statement? The following diagram shows the Likert
scale:

• As depicted in the preceding diagram, the answer to

the question of WordPress is making content
managers' lives easier is scaled down to five
different ordinal values, Strongly Agree, Agree,
Neutral, Disagree, and Strongly Disagree.
• Scales like these are referred to as the Likert scale.
Similarly, the following diagram shows more
examples of the Likert scale:

To make it easier, consider ordinal scales as an order of ranking (1st,

2nd, 3rd, 4th, and so on). The median item is allowed as the measure
of central tendency; however, the average is not permitted.
3. Interval

• In interval scales, both the order and exact differences

between the values

are significant.

• Interval scales are widely used in statistics, for

example, in the measure of central tendencies—
mean, median, mode, and standard deviations.
• Examples include location in Cartesian coordinates
and direction measured in degrees from magnetic
north. The mean, median, and mode are allowed on
interval data.
4. Ratio

• Ratio scales contain order, exact values, and absolute zero,

which makes it

possible to be used in descriptive and inferential statistics.

• These scales provide numerous possibilities for statistical

analysis.

• Mathematical operations, the measure of central

tendencies, and the
measure of dispersion and coefficient of variation
can also be computed from such scales.

• Examples include a measure of energy, mass,

length, duration, electrical energy, plan angle, and
volume. The following table gives a summary of
the data types and scale measures:
Software tools for EDA
There are several software tools available for Exploratory Data Analysis
(EDA) that help in analyzing and visualizing data. Here are some popular
ones:

1. Python Libraries

Pandas: Used for data manipulation and analysis, providing data

structures like DataFrames.

Matplotlib: A plotting library that provides static, animated, and

interactive visualizations.

Seaborn: Built on Matplotlib, it offers a high-level interface for drawing

attractive statistical graphics.

Plotly: An interactive graphing library that makes it easy to create plots,

including line charts, scatter plots, and histograms.

NumPy: Provides support for large, multi-dimensional arrays and

matrices, along with a collection of mathematical functions.

Scikit-learn: Offers tools for data preprocessing, as well as simple

visualizations like feature importance plots.

2. R Libraries

ggplot2: A powerful data visualization package for creating complex

plots.
dplyr: A grammar of data manipulation, providing a consistent set of
verbs for data manipulation.

tidyverse: A collection of R packages designed for data science,

including ggplot2, dplyr, tidyr, and others.

Shiny: Allows you to build interactive web apps straight from R.

3. Standalone Software

Tableau: A leading data visualization tool that allows users to create a

wide variety of charts and dashboards.

Power BI: A business analytics tool by Microsoft that provides

interactive visualizations and business intelligence capabilities.

Excel: A widely-used spreadsheet tool that offers basic to advanced data

analysis and visualization capabilities.

RapidMiner: An advanced data science platform that includes tools for

data prep, machine learning, and EDA.

KNIME: A data analytics, reporting, and integration platform that

supports data blending, EDA, and more.

Orange: A data visualization and analysis tool for both novices and
experts in data science.

4. Big Data Tools

Apache Spark: A big data processing framework that can be used for
large-scale EDA.

Hadoop: A framework that allows for distributed processing of large data

sets across clusters of computers.

5. Interactive Notebooks

Jupyter Notebook: An open-source web application that allows you to

create and share documents that contain live code, equations,
visualizations, and narrative text.
Google Colab: A free Jupyter notebook environment that runs in the
cloud and supports Python code execution.

These tools enable data scientists and analysts to explore data, identify
patterns, and extract meaningful insights, often through visualizations
and statistical summaries.

Introduction to the popular EDA tools

(pandas,Numpy,Matplotlib,seaborn and tableau)
1. Pandas

 Pandas is a Python library that makes it easy to work with data,

especially in tables (like spreadsheets).

Advantage :

 It helps you clean, organize, and analyze data quickly. You can
filter data, calculate statistics, and handle missing values with just
a few commands.

2. NumPy

 NumPy is a Python library that focuses on numerical operations,

especially with large datasets.

Advantage :

 It provides support for working with arrays (lists of numbers) and

matrices, making calculations faster and more efficient. It's the
foundation for many other data science tools.

3. Matplotlib

 Matplotlib is a plotting library for creating static, 2D graphs and

charts in Python.

Advantage:

It allows you to create a wide variety of visualizations, like line

charts, bar graphs, and scatter plots, to help you understand and
present your data.
4. Seaborn

 Seaborn is built on top of Matplotlib and provides a higher-level

interface for making attractive and informative statistical graphics.

Advantage:

 It simplifies the process of creating complex visualizations, like

heatmaps and violin plots, making it easier to explore
relationships between different parts of your data.

5. Tableau

 Tableau is a powerful data visualization tool that allows you to

create interactive and shareable dashboards.

Advantage:

 It is user-friendly and doesn't require coding. You can connect to

different data sources, drag and drop to create visuals, and explore
your data through interactive charts and graphs.

Visual Aids for EDA

Visual Aids For EDA

1. Line chart

• We will use the matplotlib library and the stock price data to
plot time series lines. First of all, let's understand the dataset.

• We have created a function using the faker Python library to

generate the dataset.

• It is the simplest possible dataset with just two columns. The

first column is Date and the second column is Price, indicating
the stock price on that date.
• Let's generate the dataset by calling the helper method. In
addition to this, we have saved the CSV file.
• We can optionally load the CSV file using the pandas
(read_csv) library and proceed with visualization. generateData
function is defined here:

Example Source code:

import datetime

import random

import pandas as pd

import radar

def generateData(n):

listdata = []

start = datetime.datetime(2019, 8, 1)

end = datetime.datetime(2019, 8, 30)

for _ in range(n):

# Generate a random date within the specified range

date = radar.random_datetime(start='2019-08-01', stop='2019-08-

30').strftime("%Y-%m-%d")

# Generate a random price between 900 and 1000

price = round(random.uniform(900, 1000), 4)

listdata.append([date, price])

# Create a DataFrame from the list of data

df = pd.DataFrame(listdata, columns=['Date', 'Price'])

# Convert the 'Date' column to datetime format

df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')

# Group the data by date and take the mean of the prices

df = df.groupby(by='Date').mean()

return df
Explanation:

 Imports: The necessary libraries are imported at the beginning.

 generateData function:

o The function takes an integer n as input, which determines the number

of data points to generate.

o It initializes a list listdata to store the generated data.

o A random date is generated using radar.random_datetime, and a

random price is generated between 900 and 1000.

o The data is appended to the listdata list.

o The data is converted to a Pandas DataFrame.

o The 'Date' column is converted to a datetime object.

o The data is grouped by date, and the average price for each date is
computed.

 Output: The function returns a DataFrame with the mean price for each date.

The output of the preceding code is shown in the following screenshot:

Steps involved

Line chart:

Sample Source Code:

import matplotlib.pyplot as plt

# Sample data

x = [1, 2, 3, 4, 5]

y = [2, 3, 5, 7, 11]

# Create the plot

plt.plot(x, y, marker='o')

# Add title and labels

plt.title("Simple Line Graph")

plt.xlabel("X-axis Label")

plt.ylabel("Y-axis Label")

# Show the plot

plt.show()

EXPLANATION:

 matplotlib.pyplot: This library is used to create static, animated, and

interactive visualizations in Python.
 Data: The x and y lists represent the data points to be plotted on the
graph.
 plt.plot(): This function plots the line graph. The marker='o' argument
adds markers at each data point.
 plt.title(), plt.xlabel(), plt.ylabel(): These functions add a title and
labels to the x and y axes.
 plt.show(): This function displays the graph.
OUTPUT:

2. Bar charts

• This is one of the most common types of visualization that

almost everyone must have encountered. Bars can be drawn
horizontally or vertically to represent categorical variables.
• Bar charts are frequently used to distinguish objects between
distinct collections in order to track variations over time.

• In most cases, bar charts are very convenient when the changes are
large.

• In order to learn about bar charts,

• We can use the calendar Python library to keep track of the
months of the year (1 to 12) corresponding to January to
December:
Example Source Code:

import numpy as np

import calendar

import matplotlib.pyplot as plt

import random

# Step 2: Set up the data

months = list(range(1, 13))

sold_quantity = [round(random.uniform(100, 200)) for x in range(1, 13)]

# Step 3: Specify the layout of the figure and allocate space

figure, axis = plt.subplots()

# Step 4: Display the names of the months on the x-axis

plt.xticks(months, calendar.month_name[1:13], rotation=20)

# Step 5: Plot the graph

plot = axis.bar(months, sold_quantity)

# Step 6: Display data values on the head of the bars (optional)

for rectangle in plot:

height = rectangle.get_height()

axis.text(rectangle.get_x() + rectangle.get_width() / 2., 1.002 * height,

'%d' % int(height), ha='center', va='bottom')

# Step 7: Display the graph on the screen

plt.show()
EXPLANATION

1. Imports:
o numpy, calendar, and matplotlib.pyplot are imported along with
random.
2. Set up Data:
o months is a list of integers from 1 to 12.
o sold_quantity generates random integers between 100 and 200 for
each month.
3. Figure and Axis Layout:
o figure, axis = plt.subplots() creates a figure and an axis for plotting.
4. Customize X-axis:
o plt.xticks() sets custom tick labels on the x-axis to display month
names.
5. Plot the Graph:
o axis.bar() creates a bar chart using the data.
6. Annotate Bars (Optional):
o A loop iterates through the bars and places the height value on top of
each bar for clarity.
7. Display the Graph:
o plt.show() renders the graph.
Running this code will display a bar chart showing the number of items sold for
each month, with the month names on the x-axis and the sold quantity on the y-
axis. The numbers on top of the bars represent the exact quantities sold.

OUTPUT:
1.3 Scatter plot

Scatter plots are also called scatter graphs, scatter charts, scattergrams,
and
scatter diagrams. They use a Cartesian coordinates system to display
values of typically two variables for a set of data.

When should we use a scatter plot? Scatter plots can be

constructed in the following two situations:
• When one continuous variable is dependent on another
variable, which is under the control of the observer

• When both continuous variables are independent

There are two important concepts—independent variable and
dependent variable.

• In statistical modeling or mathematical modeling, the values of

dependent variables rely on the values of independent
variables.
• The dependent variable is the outcome variable being studied.

• The independent variables are also referred to as regressors.

• The scatter plots are used when we need to show the

relationship between two variables, and hence are sometimes
referred to as correlation plots.

Example Source :
import matplotlib.pyplot as plt
import pandas as pd
# Example data creation
data = {
'age': [0, 6, 12, 24, 36, 48, 60, 72, 84, 96, 108, 120, 132, 144, 156,
168, 180],
# Age in months
'min_recommended': [14, 14, 14, 13, 12, 12, 11, 11, 10, 10, 10, 9,
9, 8, 8, 7, 7], # Min hours
'max_recommended': [17, 17, 17, 15, 14, 14, 13, 13, 12, 12, 12, 11, 11,
10, 10, 9, 9]
# Max hours
}
# Creating a DataFrame
sleepDf = pd.DataFrame(data)
# Scatter plot for minimum recommended sleep hours vs age
plt.scatter(sleepDf['age'] / 12., sleepDf['min_recommended'],
color='green', label='Min Recommended')
# Scatter plot for maximum recommended sleep hours vs age
plt.scatter(sleepDf['age'] / 12., sleepDf['max_recommended'],
color='red', label='Max Recommended')
# Labeling the x-axis (Age in years)
plt.xlabel('Age of person in Years')
# Labeling the y-axis (Total hours of sleep required')
plt.ylabel('Total hours of sleep required')
# Adding a title to the plot
plt.title('Recommended Sleep Hours by Age')
# Adding a legend to distinguish between the points
plt.legend()
# Display the plot
plt.show()
Explanation:
1. Data Creation:
 The data dictionary is the same as before, with age in months and the
corresponding min_recommended and max_recommended sleep hours.
2. Creating the Scatter Plots:
 plt.scatter(sleepDf['age'] / 12., sleepDf['min_recommended'],
color='green', label='Min Recommended'):
Creates a scatter plot of minimum recommended sleep hours (dependent
variable) against age in years (independent variable).
The points are colored green.
 plt.scatter(sleepDf['age'] / 12., sleepDf['max_recommended'],
color='red', label='Max Recommended'):
 Creates a scatter plot of maximum recommended sleep hours
(dependent variable) against age in years (independent variable).
 The points are colored red.
3. Labeling and Titles:
 The x-axis is labeled "Age of person in Years" and the y-axis is
labeled "Total hours of sleep required."
 The plot is titled "Recommended Sleep Hours by Age."
4. Legend:
 plt.legend() adds a legend to differentiate between the minimum and
maximum recommended sleep hours.
5. Display:
 plt.show() renders the scatter plot.
Result:
This code will generate a scatter plot with two sets of points:
 Green points represent the minimum recommended sleep hours for
each age.
 Red points represent the maximum recommended sleep hours for
each age.
 The x-axis represents the age in years, and the y-axis represents the
total hours of sleep required.
This type of scatter plot is useful for observing how the recommended
sleep hours vary as a person ages, making it easy to compare the minimum
and maximum recommendations.

OUTPUT:
2.Data Transformation Techniques
Data Transformation

Data transformation is a set of techniques used to convert data from

one format or structure to another format or structure. The following are
some examples of transformation activities:

 Data deduplication involves the identification of duplicates and their

removal.
 Key restructuring involves transforming any keys with built-in
meanings
 to the generic keys.
 Data cleansing involves extracting words and deleting out-of-date,
inaccurate, and incomplete information from the source language
without extracting the meaning or information to enhance the
accuracy of the source data.
 Data validation is a process of formulating rules or algorithms that
help in validating different types of data against some known issues.
 Format revisioning involves converting from one format to another.
 Data derivation consists of creating a set of rules to generate
more information from the data source.
 Data aggregation involves searching, extracting,
summarizing, and
 preserving important information in different types of reporting
systems.
 Data integration involves converting different data types and
merging them into a common structure or schema.
 Data filtering involves identifying information relevant to any
particular user.
 Data joining involves establishing a relationship between two or
more tables.

The main reason for transforming the data is to get a better representation
such that the transformed data is compatible with other data.

In addition to this, interoperability in a system can be achieved by following

a common data structure and format.

1. Data Duplication

Explaination:

Data deduplication involves identifying and removing duplicate entries

in a dataset.

Sample Source Code:

import pandas as pd

# Create a DataFrame with duplicate rows

data = {'ID': [1, 2, 2, 4, 5],

'Name': ['Alice', 'Bob', 'Bob', 'Charlie', 'David'],

'Age': [25, 30, 30, 35, 40]}

df = pd.DataFrame(data)

# Drop duplicate rows

df_dedup = df.drop_duplicates()

print("DataFrame after deduplication:")

print(df_dedup)

OUTPUT:

DataFrame after deduplication:

ID Name Age

0 1 Alice 25

1 2 Bob 30

3 4 Charlie 35

4 5 David 40

2. Key Restructuring

Explanation:

Transforming keys with specific meanings to generic keys.

Example Python Program:

import pandas as pd
# Create a DataFrame with meaningful keys

data = {'EmployeeID': [1, 2, 3],

'EmployeeName': ['Alice', 'Bob', 'Charlie'],

'EmployeeAge': [25, 30, 35]}

df = pd.DataFrame(data)

# Rename columns to generic keys

df_restructured = df.rename(columns={'EmployeeID': 'ID',

'EmployeeName': 'Name', 'EmployeeAge': 'Age'})

print("DataFrame after key restructuring:")

print(df_restructured)

OUTPUT:

DataFrame after key restructuring:

ID Name Age

0 1 Alice 25

1 2 Bob 30

2 3 Charlie 35

3. Data Cleansing

Explanation: Removing inaccurate or incomplete information.

Example Python Program:

import pandas as pd

# Create a DataFrame with some incorrect data

data = {'Name': ['Alice', 'Bob', 'Charlie', None],

'Age': [25, None, 35, 40]}

df = pd.DataFrame(data)

# Drop rows with missing values

df_cleaned = df.dropna()

print("DataFrame after data cleansing:")

print(df_cleaned)

OUTPUT:

DataFrame after data cleansing:

Name Age

0 Alice 25.0

2 Charlie 35.0

4. Data Validation

Explanation: Validating data against certain rules.

Example Python Program:

import pandas as pd

# Create a DataFrame

data = {'Age': [25, 30, -5, 35]} # -5 is an invalid age

df = pd.DataFrame(data)

# Validate ages (must be non-negative)

df_validated = df[df['Age'] >= 0]

print("DataFrame after data validation:")

print(df_validated)

OUTPUT:

DataFrame after data validation:

Age

0 25

1 30

3 35

5. 5. Format Revisioning

Explanation: Converting data from one format to another.

Example Source Code:

import pandas as pd
# Create a DataFrame

data = {'Date': ['2024-01-01', '2024-02-01']}

df = pd.DataFrame(data)

# Convert Date column from string to datetime

df['Date'] = pd.to_datetime(df['Date'])

print("DataFrame after format revisioning:")

print(df)

OUTPUT:

DataFrame after format revisioning:

Date

0 2024-01-01

1 2024-02-01

6. Data Derivation

Explanation: Creating new information from existing data.

Example Source code:

import pandas as pd

# Create a DataFrame

data = {'Sales': [100, 200, 300]}

df = pd.DataFrame(data)

# Derive a new column with a 10% increase in sales

df['SalesIncrease'] = df['Sales'] * 1.10

print("DataFrame after data derivation:")

print(df)
OUTPUT:

DataFrame after data derivation:

Sales SalesIncrease

0 100 110.0

1 200 220.0

2 300 330.0

7. Data Aggregation

Explanation: Summarizing data, e.g., calculating total or average.

Example Python Program:

import pandas as pd

# Create a DataFrame

data = {'Category': ['A', 'A', 'B', 'B'],

'Sales': [100, 150, 200, 250]}

df = pd.DataFrame(data)

# Aggregate data by category

df_aggregated = df.groupby('Category').sum()

print("DataFrame after data aggregation:")

print(df_aggregated)

Output:

DataFrame after data aggregation:

Sales

Explanation: Combining different datasets into one.

Example Python Program:

import pandas as pd

# Create two DataFrames

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})

df2 = pd.DataFrame({'ID': [1, 2, 3], 'Age': [25, 30, 35]})

# Merge DataFrames on 'ID'

df_integrated = pd.merge(df1, df2, on='ID')

print("DataFrame after data integration:")

print(df_integrated)

Output:

DataFrame after data integration:

ID Name Age

0 1 Alice 25

1 2 Bob 30

2 3 Charlie 35

9. Data Filtering

Explanation: Extracting specific information based on conditions.

Example Python Program:

import pandas as pd
# Create a DataFrame

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],

'Age': [25, 30, 35, 40]}

df = pd.DataFrame(data)

# Filter data for people older than 30

df_filtered = df[df['Age'] > 30]

print("DataFrame after data filtering:")

print(df_filtered)

Output:

DataFrame after data filtering:

Name Age

2 Charlie 35

3 David 40

10. Data Joining

Explanation: Establishing relationships between tables based on common

keys.

Example Python Program:

import pandas as pd

# Create two DataFrames

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})

df2 = pd.DataFrame({'ID': [1, 2, 4], 'Age': [25, 30, 45]})

# Perform a join operation (left join)

df_joined = pd.merge(df1, df2, on='ID', how='left')

print("DataFrame after data joining:")

print(df_joined)

Output:

DataFrame after data joining:

ID Name Age

0 1 Alice 25

1 2 Bob 30

2 3 Charlie NaN

In this example, the left join includes all records from df1 and matches
records from df2 based on the ID column. Charlie does not have a
corresponding ID in df2, so Age is NaN.

These examples demonstrate various data transformation techniques and

how they can be implemented using Python and pandas.

Benefits of data transformation

• Data transformation promotes interoperability between several

applications. The main reason for creating a similar format and
structure in the dataset is that it becomes compatible with other
systems.
• Comprehensibility for both humans and computers is improved
when using better-organized data compared to messier data.
• Data transformation ensures a higher degree of data quality and
protects applications from several computational challenges
such as null values, unexpected duplicates, and incorrect
indexings, as well as incompatible structures or formats.
• Data transformation ensures higher performance and scalability for
modern

analytical databases and dataframes.

Challenges

The process of data transformation can be challenging for several reasons:

• It requires a qualified team of experts and state-of-the-art
infrastructure. The cost of attaining such experts and
infrastructure can increase the cost of the operation.
• Data transformation requires data cleaning before data
transformation and data migration. This process of cleansing
can be expensively time- consuming.
• Generally, the activities of data transformations involve batch
processing. This means that sometimes, we might have to wait
for a day before the next batch of data is ready for cleansing.
This can be very slow.

Grouping and aggregation

Grouping and aggregation are powerful techniques for summarizing and visualizing data.
These techniques help you analyze data by grouping it into categories and computing
aggregate statistics for each category. In Python, the pandas library is commonly used for
these operations, and matplotlib or seaborn is used for visualization.

Here's a detailed guide on how to perform grouping and aggregation in data visualization
using Python:

Example Dataset

Let's start with a sample dataset to demonstrate these techniques. We'll use a DataFrame
with sales data across different regions and products.

import pandas as pd

# Sample data

data = {

'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],

'Product': ['A', 'A', 'B', 'B', 'A', 'B', 'A', 'B'],

'Sales': [200, 150, 300, 250, 180, 210, 320, 270],

'Quantity': [20, 15, 30, 25, 18, 21, 32, 27]

df = pd.DataFrame(data)

1. Grouping and Aggregation

Group by Region and Aggregate Sales

# Group by 'Region' and aggregate sales (sum and average)

grouped_df = df.groupby('Region').agg({

'Sales': ['sum', 'mean'],

'Quantity': ['sum', 'mean']

})

print("Grouped and Aggregated DataFrame:")

print(grouped_df)

Output:

Grouped and Aggregated DataFrame:

Sales Quantity

sum mean sum mean

Region

East 620 310.0 62 31.0

North 380 190.0 38 19.0

South 360 180.0 36 18.0

West 520 260.0 52 26.0

2. Visualization with Matplotlib

Bar Plot for Total Sales by Region

import matplotlib.pyplot as plt

# Group by 'Region' and aggregate sales (sum)

sales_by_region = df.groupby('Region')['Sales'].sum()

# Plot

sales_by_region.plot(kind='bar', color='skyblue')

plt.title('Total Sales by Region')

plt.xlabel('Region')

plt.ylabel('Total Sales')

plt.show()

Explanation: This creates a bar plot showing the total sales for each region.

Bar Plot for Average Sales and Quantity by Product

# Group by 'Product' and aggregate sales and quantity (mean)

agg_by_product = df.groupby('Product').agg({

'Sales': 'mean',

'Quantity': 'mean'

})

# Plot

agg_by_product.plot(kind='bar')

plt.title('Average Sales and Quantity by Product')

plt.xlabel('Product')

plt.ylabel('Average Value')

plt.show()
Explanation: This creates a bar plot for the average sales and quantity of each product.

3. Visualization with Seaborn

Seaborn provides a high-level interface for creating attractive and informative statistical
graphics.

Sample Source code:

#Box Plot for Sales Distribution by Region

import seaborn as sns

# Plot

sns.boxplot(x='Region', y='Sales', data=df)

plt.title('Sales Distribution by Region')

plt.xlabel('Region')

plt.ylabel('Sales')

plt.show()

Explanation: This creates a box plot that shows the distribution of sales across different
regions, highlighting the spread and outliers.

 Grouping and Aggregation: Use pandas to group data and compute aggregate
statistics like sum, mean, etc.

 Visualization with Matplotlib and Seaborn: Use matplotlib and seaborn to create
visualizations that make the aggregated data easier to interpret.

These techniques are essential for understanding patterns and trends in your data,
enabling better decision-making based on summarized information.

Unit I - Part I Notes
100% (7)
Unit I - Part I Notes
33 pages
Notes - Unit 1 - Exploratory Data Analysis
No ratings yet
Notes - Unit 1 - Exploratory Data Analysis
33 pages
Unit 1
No ratings yet
Unit 1
50 pages
Exploratory Data Analysis (Eda)
No ratings yet
Exploratory Data Analysis (Eda)
10 pages
UNIT 1 Exploratory Data Analysis
100% (3)
UNIT 1 Exploratory Data Analysis
21 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
62 pages
Ccs346 Eda Unit 1
No ratings yet
Ccs346 Eda Unit 1
129 pages
Notes - EDA-Unit1
No ratings yet
Notes - EDA-Unit1
34 pages
Unit 3
No ratings yet
Unit 3
83 pages
Unit 1
No ratings yet
Unit 1
29 pages
Unit 1 Dev
No ratings yet
Unit 1 Dev
26 pages
Unit - 1 EDA
No ratings yet
Unit - 1 EDA
123 pages
Unit 1
No ratings yet
Unit 1
19 pages
P23MBA547 Predictive Analytics
No ratings yet
P23MBA547 Predictive Analytics
133 pages
Unit3 Eda
No ratings yet
Unit3 Eda
13 pages
DSP Unit - Ii
No ratings yet
DSP Unit - Ii
14 pages
What Is Exploratory Data Analysis
No ratings yet
What Is Exploratory Data Analysis
28 pages
Systematic Approach To Perform Task Centric Exploratory Data Analysis With Case Study
No ratings yet
Systematic Approach To Perform Task Centric Exploratory Data Analysis With Case Study
8 pages
Data Science Lecture No 02
No ratings yet
Data Science Lecture No 02
21 pages
Group 7
No ratings yet
Group 7
19 pages
Data Analytics Interview Questions
No ratings yet
Data Analytics Interview Questions
3 pages
Unit 2
No ratings yet
Unit 2
58 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
17 pages
BI-LEc 3
No ratings yet
BI-LEc 3
24 pages
Unit 1 DXV
No ratings yet
Unit 1 DXV
28 pages
Data Science Lecture No 02
No ratings yet
Data Science Lecture No 02
21 pages
Dev Answer Key
No ratings yet
Dev Answer Key
21 pages
Unit 1
No ratings yet
Unit 1
23 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
23 pages
Unit I Exploratory Data Analysis
No ratings yet
Unit I Exploratory Data Analysis
38 pages
EDA Lecture Notes
No ratings yet
EDA Lecture Notes
205 pages
Data Exploration and Visualization
100% (1)
Data Exploration and Visualization
281 pages
Data Sciecnce
No ratings yet
Data Sciecnce
16 pages
Unit 1 - Exploratory Data Analysis Fundamentals
No ratings yet
Unit 1 - Exploratory Data Analysis Fundamentals
47 pages
Eda Sandhya
No ratings yet
Eda Sandhya
7 pages
Eda
No ratings yet
Eda
6 pages
AD3301 Data Exploration and Visualization
No ratings yet
AD3301 Data Exploration and Visualization
278 pages
Unit 1
No ratings yet
Unit 1
52 pages
Unit 4
No ratings yet
Unit 4
33 pages
22amh32 - Data Analytics and Data Science Unit I & Exploratory Data Analysis (Eda) 1. Exploratory Data Analysis (Eda)
No ratings yet
22amh32 - Data Analytics and Data Science Unit I & Exploratory Data Analysis (Eda) 1. Exploratory Data Analysis (Eda)
9 pages
Aiht Notes Dev 1-5
No ratings yet
Aiht Notes Dev 1-5
236 pages
EDA Unit 1
No ratings yet
EDA Unit 1
41 pages
Session1 DataCharacteristics
No ratings yet
Session1 DataCharacteristics
41 pages
FDS Unit 2
No ratings yet
FDS Unit 2
15 pages
Eda 2
No ratings yet
Eda 2
69 pages
Dev 1
No ratings yet
Dev 1
2 pages
The Analysis - in - EDA
No ratings yet
The Analysis - in - EDA
7 pages
DS Lecture 15
No ratings yet
DS Lecture 15
44 pages
Research Assignment 02burhan Ul Din
No ratings yet
Research Assignment 02burhan Ul Din
8 pages
ccs346 Eda Unit 1 Notes
No ratings yet
ccs346 Eda Unit 1 Notes
20 pages
Ccs346 Eda Unit 1 Notes
100% (2)
Ccs346 Eda Unit 1 Notes
20 pages
EDA Unit 1 Notes
No ratings yet
EDA Unit 1 Notes
27 pages
Unit - 1
No ratings yet
Unit - 1
25 pages
4.1 Advanced Data Analysis & Visualization
No ratings yet
4.1 Advanced Data Analysis & Visualization
12 pages
Exploratory Data Analysis Guide
No ratings yet
Exploratory Data Analysis Guide
38 pages
Eda 1
No ratings yet
Eda 1
25 pages
Module 2
No ratings yet
Module 2
78 pages
Wa0000.
No ratings yet
Wa0000.
15 pages
Assessment of Learning
No ratings yet
Assessment of Learning
90 pages
R Exercises For Modules
100% (1)
R Exercises For Modules
41 pages
Presentation Geoxp
No ratings yet
Presentation Geoxp
9 pages
R-Cheatsheet: Help Numerical Summaries Linear Regression
No ratings yet
R-Cheatsheet: Help Numerical Summaries Linear Regression
2 pages
GPL Reference Guide For IBM SPSS Statistics
No ratings yet
GPL Reference Guide For IBM SPSS Statistics
404 pages
Statistics Concepts and Controversies 9th Edition by David S Moore
0% (1)
Statistics Concepts and Controversies 9th Edition by David S Moore
307 pages
CASIO Classpad 300 Calculator: A P P E N D I X
No ratings yet
CASIO Classpad 300 Calculator: A P P E N D I X
16 pages
Module 5 Vital Statistics
No ratings yet
Module 5 Vital Statistics
9 pages
Understanding Correlation Basics
No ratings yet
Understanding Correlation Basics
27 pages
Spot The Mistakes - Graphs and Charts - Answers
No ratings yet
Spot The Mistakes - Graphs and Charts - Answers
5 pages
Biology IA: Common Mistakes Guide
No ratings yet
Biology IA: Common Mistakes Guide
8 pages
Scatterplots and Correlation Basics
No ratings yet
Scatterplots and Correlation Basics
4 pages
A Study On Factors Affecting Employee Engagement in A Workplace
No ratings yet
A Study On Factors Affecting Employee Engagement in A Workplace
56 pages
DA Unit 5
No ratings yet
DA Unit 5
28 pages
Generalized Linear Models For Insurance Data 5th Print Edition de Jong Instant Download
100% (1)
Generalized Linear Models For Insurance Data 5th Print Edition de Jong Instant Download
61 pages
QMB Solutions
0% (1)
QMB Solutions
180 pages
UBD Draft With Feedback
No ratings yet
UBD Draft With Feedback
2 pages
Essentials of Modern Business Statistics (6e) : John Loucks
No ratings yet
Essentials of Modern Business Statistics (6e) : John Loucks
33 pages
SG ch02
No ratings yet
SG ch02
44 pages
Python Programming Student Notes
No ratings yet
Python Programming Student Notes
10 pages
Notes 2 - Scatterplots and Correlation
No ratings yet
Notes 2 - Scatterplots and Correlation
6 pages
Case Study DSBDA
No ratings yet
Case Study DSBDA
12 pages
Scatter Diagram
No ratings yet
Scatter Diagram
5 pages
Unit-4 IGNOU STATISTICS
No ratings yet
Unit-4 IGNOU STATISTICS
23 pages
Root Cause Analysis & Action Plan
No ratings yet
Root Cause Analysis & Action Plan
56 pages
BASIC CBLM4 Solve Problem Related To Work Activities
100% (1)
BASIC CBLM4 Solve Problem Related To Work Activities
42 pages
DataVis Cheat Sheet
No ratings yet
DataVis Cheat Sheet
13 pages
Test For Types of Graphs and Charts
No ratings yet
Test For Types of Graphs and Charts
6 pages
Assignment Regression 6.5.2025
No ratings yet
Assignment Regression 6.5.2025
3 pages
Chapter09 Part 2
No ratings yet
Chapter09 Part 2
18 pages