0% found this document useful (0 votes)

108 views43 pages

Notes - EDA-Unit2

Uploaded by

savi ezhilarasan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

108 views43 pages

Notes - EDA-Unit2

Uploaded by

savi ezhilarasan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

CCS346 - EXPLORATORY DATA ANALYSIS

UNIT II EDA USING PYTHON

Data Manipulation using Pandas – Pandas Objects – Data Indexing and Selection – Operating on
Data – Handling Missing Data – Hierarchical Indexing – Combining datasets – Concat, Append,
Merge and Join – Aggregation and grouping – Pivot Tables – Vectorized String Operations.

COURSE OBJECTIVE:
To implement data manipulation using Pandas.

COURSE OUTCOME:
CO2: Implement data manipulation using Pandas
Data Manipulation using Pandas
Pandas - package built on top of NumPy
- provides an efficient implementation of a DataFrame
DataFrames - multidimensional arrays with attached row and column labels
- supports heterogeneous types and/or missing data
Pandas - enhanced versions of NumPy structured arrays
- rows and columns are identified with labels rather than simple integer indices
Three fundamental Pandas data structures: Series, DataFrame, and Index

Pandas Objects
1. The Pandas Series Object
Pandas Series - one-dimensional array of indexed data
- can be created from a list or array
- can be accessed with the values and index attributes
- can also be accessed using index via square-bracket notation

Series as generalized NumPy array

Numpy Array - has an implicitly defined integer index used to access the values
Pandas Series - has an explicitly defined index associated with the values
index need not be an integer

non-contiguous or non-sequential indices

Series as specialized dictionary

dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series
is a structure which maps typed keys to a set of typed values.
Series will be created where the index is drawn from the sorted keys

Series also supports array-style operations such as slicing

Constructing Series objects

Constructing a Pandas Series from scratch

Index is an optional argument

Data can be a list or NumPy array - index defaults to an integer sequence

Data can be a scalar, which is repeated to fill the specified index

Data can be a dictionary - index defaults to the sorted dictionary keys

Index can be explicitly set

2. The Pandas DataFrame Object

DataFrame as a generalized NumPy array

DataFrame - is a two-dimensional array with both flexible row indices and flexible column
names

columns attribute - column labels

index attribute - gives access to the index labels

DataFrame as specialized dictionary
DataFrame maps a column name to a Series of column data
Constructing DataFrame objects
From a single Series object From a list of dicts

From a dictionary of Series objects

From a two-dimensional NumPy array

From a NumPy structured array

3. The Pandas Index Object
- immutable array or as an ordered set

One difference between Index objects and NumPy arrays is that indices are immutable–that
is, they cannot be modified via the normal means
Index as ordered set
- unions, intersections, differences, and other combinations can be computed
Data Indexing and Selection
Indexing, slicing, masking, fancy indexing and combinations

Data Selection in Series

Series as dictionary

Series as one-dimensional array

Indexers: loc, iloc, and ix
loc attribute - indexing and slicing refer explicit index

iloc attribute - indexing and slicing refer implicit index ix - hybrid of the two

Data Selection in DataFrame

attribute-style access with column names

If column names are not strings, or if the column names conflict with methods of the DataFrame
– attribute style access is not possible. Eg : pop() method
DataFrame as two-dimensional array
Operating on Data
Performing element-wise operations
basic arithmetic - addition, subtraction, multiplication
sophisticated operations - trigonometric functions, exponential and logarithmic functions
Pandas inherits much of this functionality from NumPy, and the ufuncs
Ufuncs: Index Preservation
all NumPy ufunc - work on Pandas Series and DataFrame objects
Index alignment in Series
For binary operations - Pandas will align indices

Index alignment in DataFrame

Ufuncs: Operations Between DataFrame and Series

Handling Missing Data

• real-world data is rarely clean and homogeneous
• different data sources may indicate missing data in different ways - null, NaN, or NA
Trade-Offs in Missing Data Conventions

➢ number of schemes have been developed

➢ two strategies:
❑ mask
o Globally indicate missing values
o Boolean array- one bit in the data represent null value
o Requires allocation of an additional Boolean array - adds overhead in
both storage and computation
❑ sentinel value
o Indicates a missing entry
o Data-specific convention, such -9999 or some rare bit pattern
o Reduces the range of valid values that can be represented
➢ Missing Data in Pandas

• use sentinels for missing data

• special floating-point NaN value
• Python None object
➢ None: Pythonic missing data

• Used only in arrays with data type 'object'

• Performing aggregations - sum() or min() in an array with a None value – result

in an error

NaN: Missing numerical data

it is a special floating-point value
result of arithmetic with NaN will be another NaN
Aggregates over the values are well defined (i.e., they don't result in an error) but not always
useful
NaN and None in Pandas
Pandas automatically converts the None to a NaN value

Operating on Null Values

Detecting null values

Dropping null values

We cannot drop single values from a DataFrame; we can only drop full rows or full columns.
By default, dropna() will drop all rows in which any null value is present
axis=1 drops all columns containing a null value

Filling null values

fill NA entries with a single value, such as zero:
forward-fill to propagate the previous value forward

if a previous value is not available during a forward fill, the NA value remains.
Hierarchical Indexing
❑ Multi-indexing
❑ Store higher-dimensional data – data indexed by more than one or two keys
❑ Incorporate multiple index levels within a single index
❑ Higher-dimensional data can be represented within the 1D Series and 2D DataFrame
objects
A Multiply Indexed Series

❑ represent 2D data within a 1D Series

The bad way

❑ index or slice the series based on this multiple index

❑ need to select all values from 2010 – complex process use Python tuples as keys
The Better Way: Pandas MultiIndex

create a multi-index from the tuples

MultiIndex contains multiple levels of indexing

❑ Some entries are missing in the first column

❑ Blank entry indicates the same value as the line above it

Access all data for which the second index is 2010

MultiIndex as extra dimension

Each extra level in a multi-index represents an extra dimension of data

unstack() method - convert a multiply indexed Series into a DataFrame

stack() method provides the opposite operation:

Methods of MultiIndex Creation

1. Pass a list of two or more index arrays to the constructor

2. Pass a dictionary with appropriate tuples as keys

Explicit MultiIndex constructors

from a simple list of arrays

from a list of tuples

from a Cartesian product of single indices by passing levels and labels

MultiIndex level names

MultiIndex for columns

four-dimensional data, where the dimensions are the subject, the measurement type, the year,
and the visit number
Indexing and Slicing a MultiIndex

Multiply indexed Series

access single elements by indexing with multiple terms

partial indexing, or indexing just one of the levels in the index

Partial slicing is available as well, as long as the MultiIndex is sorted

With sorted indices, partial indexing can be performed on lower

levels by passing an empty slice in the first index
Selection based on Boolean masks Selection based on fancy indexing

Multiply indexed DataFrames Recover Guido's heart rate

Using loc, iloc, and ix indexers

Each individual index in loc or iloc can be passed as a
tuple of multiple indices

create a slice within a tuple will lead to a syntax error IndexSlice object
Rearranging Multi-Indices
1. Sorted and unsorted indices
Many of the MultiIndex slicing operations will fail if the index is not sorted

partial slice of this index, it will result in an error

With the index sorted - partial slicing will work as expected

2. Stacking and unstacking indices

Convert a dataset from a stacked multi-index to a simple two-dimensional representation

The opposite of unstack() is stack() - used to recover the original series

3. Index setting and resetting

Turn the index labels into columns - reset_index method

Build a MultiIndex from the column values – set_index method

Data Aggregations on Multi-Indices

Combining datasets
function which creates a DataFrame of a particular form
Concat, Append, Merge and Join

Concatenation of NumPy Arrays

1. Simple Concatenation with pd.concat

Concatenate higher-dimensional objects
❑ By default, the concatenation takes place row-wise within the DataFrame
❑ pd.concat allows specification of an axis
Duplicate indices
Pandas concatenation preserves indices, even if the result will have duplicate indices

Ignoring the index

Adding MultiIndex keys

2. Concatenation with joins

data from different sources might have different sets of column names
❑ By default - no data - NA
❑ To change this - specify join and join_axes parameters
❑ By default, the join is a union of the input columns (join='outer'), can be changed to
intersection - using join='inner'
Use join_axes argument - directly specify the index

3. The append() method

II . Merge and Join

high-performance, in-memory join and merge operations
Categories of Joins
1. One-to-one joins
❑ simplest type of merge
❑ very similar to the column-wise concatenation
❑ pd.merge() – use common column as a key
❑ The result of the merge is a new DataFrame
❑ Order of entries in each column is not maintained
❑ Merge - general discards the index
2. Many-to-one joins preserve duplicate entries as appropriate

3. Many-to-many joins
If the key column in both the left and right array contains duplicates, then the result is a many-to-
many merge
Specification of the Merge Key
1. The on keyword
This option works only if both the left and right DataFrames have the specified column name.

2. The left_on and right_on keywords

merge two datasets with different column names
The result has a redundant column that we can drop if desired
3. The left_index and right_index keywords
merge on an index

DataFrames implement the join() method - merge on indices

mix indices and columns

Specifying Set Arithmetic for Joins

inner join - how keyword, which defaults to "inner“

result contains the intersection of the two sets of inputs
❑ how -'outer', 'left', and 'right’.
❑ outer join - returns a join over the union of the input columns, and fills in all missing
values with NAs

The left join and right join return joins over the left entries and right entries, respectively

Overlapping Column Names: The suffixes Keyword

two input DataFrames have conflicting column names
❑ merge function automatically appends a suffix _x or _y to make the output columns
unique.
❑ It is possible to specify a custom suffix using the suffixes keyword
Aggregation and grouping
Efficient summarization: computing aggregations like sum(), mean(), median(), min(), and
max(), in which a single number gives insight into the nature of a potentially large dataset
Simple Aggregation in Pandas
for a Pandas Series the aggregates return a single value
For a DataFrame, by default the aggregates return results within each column
describe() - computes several common aggregates for each column and returns the result.

GroupBy: Split, Apply, Combine

aggregate conditionally on some label or index - implemented by groupby operation
1. Split, apply, combine
❑ The split step involves breaking up and grouping a DataFrame depending on the value of
the specified key.
❑ The apply step involves computing some function, usually an aggregate, transformation,
or filtering, within the individual groups.
❑ The combine step merges the results of these operations into an output array.

❑ Does not return DataFrames

❑ Returns a DataFrameGroupBy object
❑ Does no actual computation until the aggregation is applied - "lazy evaluation"
❑ To produce a result - apply an aggregate to the DataFrameGroupBy object
2. The GroupBy object
Aggregate, filter, transform, and apply
pass a dictionary mapping column names to operations to be applied on that column

Aggregation
It can take a string, a function, or a list, and compute all the aggregates at once

Filtering
❑ Allows to drop data based on the group properties.
❑ Eg: All groups in which the standard deviation is larger than some critical value
❑ The filter function - return a Boolean value specifying whether the group passes the
filtering.
Here because group A does not have a standard deviation greater than 4, it is dropped from the
result
Transformation
transformation - return transformed version of the full data to recombine
Eg: Center the data by subtracting the group-wise mean

The apply() method

Lets to apply an arbitrary function to the group results
Eg: Normalizes the first column by the sum of the second
Pivot Tables
❑ A pivot table is a similar operation that is commonly seen in spreadsheets and other
programs that operate on tabular data.
❑ The pivot table takes simple column-wise data as input, and groups the entries into a two-
dimensional table that provides a multidimensional summarization of the data.
Motivating Pivot Tables

Database of passengers on the Titanic, available through the Seaborn library

Pivot Tables by Hand

Group according to gender, survival status, or some combination

Look at survival by both sex and, say, class.

This two-dimensional GroupBy is common in Pandas and use , pivot_table- to handle multi-
dimensional aggregation.

Pivot Table Syntax

More readable than the groupby approach, and produces the same result.

1. Multi-level pivot tables

❑ Grouping in pivot tables can be specified with multiple levels, and via a number of
options.

❑ For example: age as a third dimension - bin the age using the pd.cut function

same strategy can b e applied to columns as well

Eg: add info on the fare paid using pd.qcut to automatically compute quantiles

The result is a four-dimensional aggregation with hierarchical indices

2. Additional pivot table options

fill_value and dropna - deal with missing data

aggfunc keyword - controls what type of aggregation is applied, which is a mean by default

compute totals along each grouping - margins keyword

Vectorized String Operations
• ease in handling and manipulating string data
• Pandas string operations
Introducing Pandas String Operations

Tables of Pandas String Methods

Methods using regular expressions

Miscellaneous methods

Vectorized item access and slicing

Indicator variables

Data Science - Unit-3-Part-2
No ratings yet
Data Science - Unit-3-Part-2
32 pages
EDA Lecture Notes
No ratings yet
EDA Lecture Notes
205 pages
Python Examples: The Following Code Shows How To Implement The Bubble Sort Algorithm in Python
No ratings yet
Python Examples: The Following Code Shows How To Implement The Bubble Sort Algorithm in Python
4 pages
Python Data Types Explained
No ratings yet
Python Data Types Explained
6 pages
COMP246-zFish Tracker-Assignment Part A, B, C - SRS and SDD
No ratings yet
COMP246-zFish Tracker-Assignment Part A, B, C - SRS and SDD
44 pages
Unit 3 Ids Notes
No ratings yet
Unit 3 Ids Notes
31 pages
Unit 2 - Data Preprocessing
No ratings yet
Unit 2 - Data Preprocessing
23 pages
Unit V Big Data Analytics
No ratings yet
Unit V Big Data Analytics
47 pages
MATPLOTLIB Updated
No ratings yet
MATPLOTLIB Updated
95 pages
Unit 3
No ratings yet
Unit 3
14 pages
R Language
No ratings yet
R Language
59 pages
MAD - 4 Unit
No ratings yet
MAD - 4 Unit
8 pages
Unit Ii Notes
No ratings yet
Unit Ii Notes
49 pages
Notes - EDA-Unit1
No ratings yet
Notes - EDA-Unit1
34 pages
Built-In Exceptions in Python
No ratings yet
Built-In Exceptions in Python
6 pages
Unit of Analysis
No ratings yet
Unit of Analysis
56 pages
Module 6 Data Visualiztion Matplotlib
No ratings yet
Module 6 Data Visualiztion Matplotlib
69 pages
DATA ANALYTICS Syllabus 3 Units
No ratings yet
DATA ANALYTICS Syllabus 3 Units
37 pages
Data Analytics Unit III
No ratings yet
Data Analytics Unit III
15 pages
Data Preprocessing in Python - Handling Missing Data
No ratings yet
Data Preprocessing in Python - Handling Missing Data
8 pages
Matplotlib Line and Scatter Plot Guide
No ratings yet
Matplotlib Line and Scatter Plot Guide
32 pages
Data Warehousing Full
No ratings yet
Data Warehousing Full
41 pages
CKJM Metrics Tool Study Guide
No ratings yet
CKJM Metrics Tool Study Guide
28 pages
Unit 2
No ratings yet
Unit 2
36 pages
Function Point Analysis: A Simple Five Step Counting Process
No ratings yet
Function Point Analysis: A Simple Five Step Counting Process
13 pages
Hadoop Basics for Data Science Students
No ratings yet
Hadoop Basics for Data Science Students
22 pages
BDA Notes
No ratings yet
BDA Notes
96 pages
Concepts and Techniques: - Chapter 4
No ratings yet
Concepts and Techniques: - Chapter 4
58 pages
CS3352 Fds
No ratings yet
CS3352 Fds
23 pages
Unit - III
No ratings yet
Unit - III
34 pages
Ai and ML qp1 Solved
No ratings yet
Ai and ML qp1 Solved
20 pages
SEO Tools for Engineering Students
No ratings yet
SEO Tools for Engineering Students
14 pages
Python Data Analysis Guide
No ratings yet
Python Data Analysis Guide
55 pages
Software Quality Management FAQs
No ratings yet
Software Quality Management FAQs
15 pages
Message Authentication
No ratings yet
Message Authentication
47 pages
5 Pca
No ratings yet
5 Pca
14 pages
Handling Missing Value
No ratings yet
Handling Missing Value
12 pages
BTAIML AI Notes Upto Unit 3
No ratings yet
BTAIML AI Notes Upto Unit 3
101 pages
Python Packages for Developers
No ratings yet
Python Packages for Developers
54 pages
Facets of Data
No ratings yet
Facets of Data
6 pages
Frequency Distributions Guide
No ratings yet
Frequency Distributions Guide
27 pages
AI&ML BM4251 Unit 1-5 Notes
No ratings yet
AI&ML BM4251 Unit 1-5 Notes
116 pages
AI Basics for Tech Enthusiasts
No ratings yet
AI Basics for Tech Enthusiasts
125 pages
Data Analytics Using R
No ratings yet
Data Analytics Using R
37 pages
BDA Unit 1
No ratings yet
BDA Unit 1
10 pages
Dap M4
No ratings yet
Dap M4
18 pages
CS3361 Set1
No ratings yet
CS3361 Set1
5 pages
Data Science - Unit-4
No ratings yet
Data Science - Unit-4
30 pages
ML Lab (R22) Manual
No ratings yet
ML Lab (R22) Manual
25 pages
Mining Class Comparisons
100% (1)
Mining Class Comparisons
4 pages
Object-Relational & NoSQL Databases
No ratings yet
Object-Relational & NoSQL Databases
46 pages
DSV Module-3
No ratings yet
DSV Module-3
24 pages
Zabbix & Grafana OnCall Integration Guide
No ratings yet
Zabbix & Grafana OnCall Integration Guide
32 pages
Cp7029 Information Storage Management
100% (1)
Cp7029 Information Storage Management
1 page
Data Preprocessing and Cleaning
No ratings yet
Data Preprocessing and Cleaning
6 pages
Data Mining:: Concepts and Techniques
100% (1)
Data Mining:: Concepts and Techniques
63 pages
Data Manipulation With Pandas
No ratings yet
Data Manipulation With Pandas
38 pages
Unit III Part 2 1725700061785
No ratings yet
Unit III Part 2 1725700061785
85 pages
Pandas Shan Ver2
No ratings yet
Pandas Shan Ver2
25 pages
Data Handlinng Using Pandas
No ratings yet
Data Handlinng Using Pandas
46 pages
Word2vec Overview
No ratings yet
Word2vec Overview
2 pages
Top 9 Ethical Issues in AI
No ratings yet
Top 9 Ethical Issues in AI
2 pages
X A Iiiiii Iiiiii
No ratings yet
X A Iiiiii Iiiiii
2 pages
Java Topics
No ratings yet
Java Topics
1 page
VM 1
No ratings yet
VM 1
10 pages
CSMMPPPP
No ratings yet
CSMMPPPP
5 pages
Unit 3 Eda Notes
No ratings yet
Unit 3 Eda Notes
24 pages
DM Assignment Diploma
No ratings yet
DM Assignment Diploma
3 pages
Lab 2 - Configuration 2-35
No ratings yet
Lab 2 - Configuration 2-35
10 pages
7th Semester Final Exam Schedule 2024
No ratings yet
7th Semester Final Exam Schedule 2024
3 pages
TDCX - JD - Front End Developer
No ratings yet
TDCX - JD - Front End Developer
2 pages
CSE Instructor Materials Chapter2
No ratings yet
CSE Instructor Materials Chapter2
26 pages
Turning Buzz Into Gold
No ratings yet
Turning Buzz Into Gold
26 pages
NCTL SS NCR-03 Minor
No ratings yet
NCTL SS NCR-03 Minor
3 pages
G11G12 PSL64 EAL - Web - Pdf.asset.1600681773666
No ratings yet
G11G12 PSL64 EAL - Web - Pdf.asset.1600681773666
35 pages
Devicenet: Leoni Special Cables GMBH
No ratings yet
Devicenet: Leoni Special Cables GMBH
2 pages
Electricity Bill Management System - Information Systems Project
No ratings yet
Electricity Bill Management System - Information Systems Project
9 pages
ETAP Specifications Guide V5 PDF
0% (1)
ETAP Specifications Guide V5 PDF
21 pages
Everding Rip50
No ratings yet
Everding Rip50
3 pages
Product Data Sheet: Tesys LRD Thermal Overload Relays - 30... 40 A - Class 10A
No ratings yet
Product Data Sheet: Tesys LRD Thermal Overload Relays - 30... 40 A - Class 10A
2 pages
Mobile Financial Services in Bangladesh - A Case Study of BKash
No ratings yet
Mobile Financial Services in Bangladesh - A Case Study of BKash
42 pages
Cities - Rain and Risk - Abstracts Booklet
No ratings yet
Cities - Rain and Risk - Abstracts Booklet
73 pages
Diesel Pump Service Price List 2023
No ratings yet
Diesel Pump Service Price List 2023
1 page
Git Basics
No ratings yet
Git Basics
19 pages
Literature Review On DTH Services
100% (1)
Literature Review On DTH Services
4 pages
LectroPol-5 Brochure English PDF
No ratings yet
LectroPol-5 Brochure English PDF
4 pages
Photoshop CC 2018 Key Features Guide
No ratings yet
Photoshop CC 2018 Key Features Guide
8 pages
Letter To RDSO Regarding Strengthening of ICF Bogie Frame Dated 09.06.2023
No ratings yet
Letter To RDSO Regarding Strengthening of ICF Bogie Frame Dated 09.06.2023
29 pages
Find BTEs in SAP Transactions
No ratings yet
Find BTEs in SAP Transactions
33 pages
LTE Multicarrier Modulation: OFDMA & SC-OFDMA
No ratings yet
LTE Multicarrier Modulation: OFDMA & SC-OFDMA
63 pages
SCAW Installation and Upgrade Procedure SCAW-9003B
No ratings yet
SCAW Installation and Upgrade Procedure SCAW-9003B
6 pages
Sysmax XP100 Cell Counter
No ratings yet
Sysmax XP100 Cell Counter
2 pages
Introduction To Unified Modeling Language (UML)
No ratings yet
Introduction To Unified Modeling Language (UML)
27 pages
09 KHD Ball Mill
100% (1)
09 KHD Ball Mill
101 pages
Tensor Numerical Methods in Scientific Computing Boris N Khoromskij Download
100% (8)
Tensor Numerical Methods in Scientific Computing Boris N Khoromskij Download
83 pages
Transformer Health Monitoring System
No ratings yet
Transformer Health Monitoring System
16 pages
Drawing - Boll & Kirch Filterbau GMBH
No ratings yet
Drawing - Boll & Kirch Filterbau GMBH
7 pages