CCS346 - EXPLORATORY DATA ANALYSIS
UNIT II EDA USING PYTHON
Data Manipulation using Pandas – Pandas Objects – Data Indexing and Selection – Operating on
Data – Handling Missing Data – Hierarchical Indexing – Combining datasets – Concat, Append,
Merge and Join – Aggregation and grouping – Pivot Tables – Vectorized String Operations.
COURSE OBJECTIVE:
To implement data manipulation using Pandas.
COURSE OUTCOME:
CO2: Implement data manipulation using Pandas
Data Manipulation using Pandas
Pandas - package built on top of NumPy
- provides an efficient implementation of a DataFrame
DataFrames - multidimensional arrays with attached row and column labels
- supports heterogeneous types and/or missing data
Pandas - enhanced versions of NumPy structured arrays
- rows and columns are identified with labels rather than simple integer indices
Three fundamental Pandas data structures: Series, DataFrame, and Index
Pandas Objects
1. The Pandas Series Object
Pandas Series - one-dimensional array of indexed data
- can be created from a list or array
- can be accessed with the values and index attributes
- can also be accessed using index via square-bracket notation
Series as generalized NumPy array
Numpy Array - has an implicitly defined integer index used to access the values
Pandas Series - has an explicitly defined index associated with the values
index need not be an integer
non-contiguous or non-sequential indices
Series as specialized dictionary
dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series
is a structure which maps typed keys to a set of typed values.
Series will be created where the index is drawn from the sorted keys
Series also supports array-style operations such as slicing
Constructing Series objects
Constructing a Pandas Series from scratch
Index is an optional argument
Data can be a list or NumPy array - index defaults to an integer sequence
Data can be a scalar, which is repeated to fill the specified index
Data can be a dictionary - index defaults to the sorted dictionary keys
Index can be explicitly set
2. The Pandas DataFrame Object
DataFrame as a generalized NumPy array
DataFrame - is a two-dimensional array with both flexible row indices and flexible column
names
columns attribute - column labels
index attribute - gives access to the index labels
DataFrame as specialized dictionary
DataFrame maps a column name to a Series of column data
Constructing DataFrame objects
From a single Series object From a list of dicts
From a dictionary of Series objects
From a two-dimensional NumPy array
From a NumPy structured array
3. The Pandas Index Object
- immutable array or as an ordered set
One difference between Index objects and NumPy arrays is that indices are immutable–that
is, they cannot be modified via the normal means
Index as ordered set
- unions, intersections, differences, and other combinations can be computed
Data Indexing and Selection
Indexing, slicing, masking, fancy indexing and combinations
Data Selection in Series
Series as dictionary
Series as one-dimensional array
Indexers: loc, iloc, and ix
loc attribute - indexing and slicing refer explicit index
iloc attribute - indexing and slicing refer implicit index ix - hybrid of the two
Data Selection in DataFrame
attribute-style access with column names
If column names are not strings, or if the column names conflict with methods of the DataFrame
– attribute style access is not possible. Eg : pop() method
DataFrame as two-dimensional array
Operating on Data
Performing element-wise operations
basic arithmetic - addition, subtraction, multiplication
sophisticated operations - trigonometric functions, exponential and logarithmic functions
Pandas inherits much of this functionality from NumPy, and the ufuncs
Ufuncs: Index Preservation
all NumPy ufunc - work on Pandas Series and DataFrame objects
Index alignment in Series
For binary operations - Pandas will align indices
Index alignment in DataFrame
Ufuncs: Operations Between DataFrame and Series
Handling Missing Data
• real-world data is rarely clean and homogeneous
• different data sources may indicate missing data in different ways - null, NaN, or NA
Trade-Offs in Missing Data Conventions
➢ number of schemes have been developed
➢ two strategies:
❑ mask
o Globally indicate missing values
o Boolean array- one bit in the data represent null value
o Requires allocation of an additional Boolean array - adds overhead in
both storage and computation
❑ sentinel value
o Indicates a missing entry
o Data-specific convention, such -9999 or some rare bit pattern
o Reduces the range of valid values that can be represented
➢ Missing Data in Pandas
• use sentinels for missing data
• special floating-point NaN value
• Python None object
➢ None: Pythonic missing data
• Used only in arrays with data type 'object'
• Performing aggregations - sum() or min() in an array with a None value – result
in an error
NaN: Missing numerical data
it is a special floating-point value
result of arithmetic with NaN will be another NaN
Aggregates over the values are well defined (i.e., they don't result in an error) but not always
useful
NaN and None in Pandas
Pandas automatically converts the None to a NaN value
Operating on Null Values
Detecting null values
Dropping null values
We cannot drop single values from a DataFrame; we can only drop full rows or full columns.
By default, dropna() will drop all rows in which any null value is present
axis=1 drops all columns containing a null value
Filling null values
fill NA entries with a single value, such as zero:
forward-fill to propagate the previous value forward
if a previous value is not available during a forward fill, the NA value remains.
Hierarchical Indexing
❑ Multi-indexing
❑ Store higher-dimensional data – data indexed by more than one or two keys
❑ Incorporate multiple index levels within a single index
❑ Higher-dimensional data can be represented within the 1D Series and 2D DataFrame
objects
A Multiply Indexed Series
❑ represent 2D data within a 1D Series
The bad way
❑ index or slice the series based on this multiple index
❑ need to select all values from 2010 – complex process use Python tuples as keys
The Better Way: Pandas MultiIndex
create a multi-index from the tuples
MultiIndex contains multiple levels of indexing
❑ Some entries are missing in the first column
❑ Blank entry indicates the same value as the line above it
Access all data for which the second index is 2010
MultiIndex as extra dimension
Each extra level in a multi-index represents an extra dimension of data
unstack() method - convert a multiply indexed Series into a DataFrame
stack() method provides the opposite operation:
Methods of MultiIndex Creation
1. Pass a list of two or more index arrays to the constructor
2. Pass a dictionary with appropriate tuples as keys
Explicit MultiIndex constructors
from a simple list of arrays
from a list of tuples
from a Cartesian product of single indices by passing levels and labels
MultiIndex level names
MultiIndex for columns
four-dimensional data, where the dimensions are the subject, the measurement type, the year,
and the visit number
Indexing and Slicing a MultiIndex
Multiply indexed Series
access single elements by indexing with multiple terms
partial indexing, or indexing just one of the levels in the index
Partial slicing is available as well, as long as the MultiIndex is sorted
With sorted indices, partial indexing can be performed on lower
levels by passing an empty slice in the first index
Selection based on Boolean masks Selection based on fancy indexing
Multiply indexed DataFrames Recover Guido's heart rate
Using loc, iloc, and ix indexers
Each individual index in loc or iloc can be passed as a
tuple of multiple indices
create a slice within a tuple will lead to a syntax error IndexSlice object
Rearranging Multi-Indices
1. Sorted and unsorted indices
Many of the MultiIndex slicing operations will fail if the index is not sorted
partial slice of this index, it will result in an error
With the index sorted - partial slicing will work as expected
2. Stacking and unstacking indices
Convert a dataset from a stacked multi-index to a simple two-dimensional representation
The opposite of unstack() is stack() - used to recover the original series
3. Index setting and resetting
Turn the index labels into columns - reset_index method
Build a MultiIndex from the column values – set_index method
Data Aggregations on Multi-Indices
Combining datasets
function which creates a DataFrame of a particular form
Concat, Append, Merge and Join
Concatenation of NumPy Arrays
1. Simple Concatenation with pd.concat
Concatenate higher-dimensional objects
❑ By default, the concatenation takes place row-wise within the DataFrame
❑ pd.concat allows specification of an axis
Duplicate indices
Pandas concatenation preserves indices, even if the result will have duplicate indices
Ignoring the index
Adding MultiIndex keys
2. Concatenation with joins
data from different sources might have different sets of column names
❑ By default - no data - NA
❑ To change this - specify join and join_axes parameters
❑ By default, the join is a union of the input columns (join='outer'), can be changed to
intersection - using join='inner'
Use join_axes argument - directly specify the index
3. The append() method
II . Merge and Join
high-performance, in-memory join and merge operations
Categories of Joins
1. One-to-one joins
❑ simplest type of merge
❑ very similar to the column-wise concatenation
❑ pd.merge() – use common column as a key
❑ The result of the merge is a new DataFrame
❑ Order of entries in each column is not maintained
❑ Merge - general discards the index
2. Many-to-one joins preserve duplicate entries as appropriate
3. Many-to-many joins
If the key column in both the left and right array contains duplicates, then the result is a many-to-
many merge
Specification of the Merge Key
1. The on keyword
This option works only if both the left and right DataFrames have the specified column name.
2. The left_on and right_on keywords
merge two datasets with different column names
The result has a redundant column that we can drop if desired
3. The left_index and right_index keywords
merge on an index
DataFrames implement the join() method - merge on indices
mix indices and columns
Specifying Set Arithmetic for Joins
inner join - how keyword, which defaults to "inner“
result contains the intersection of the two sets of inputs
❑ how -'outer', 'left', and 'right’.
❑ outer join - returns a join over the union of the input columns, and fills in all missing
values with NAs
The left join and right join return joins over the left entries and right entries, respectively
Overlapping Column Names: The suffixes Keyword
two input DataFrames have conflicting column names
❑ merge function automatically appends a suffix _x or _y to make the output columns
unique.
❑ It is possible to specify a custom suffix using the suffixes keyword
Aggregation and grouping
Efficient summarization: computing aggregations like sum(), mean(), median(), min(), and
max(), in which a single number gives insight into the nature of a potentially large dataset
Simple Aggregation in Pandas
for a Pandas Series the aggregates return a single value
For a DataFrame, by default the aggregates return results within each column
describe() - computes several common aggregates for each column and returns the result.
GroupBy: Split, Apply, Combine
aggregate conditionally on some label or index - implemented by groupby operation
1. Split, apply, combine
❑ The split step involves breaking up and grouping a DataFrame depending on the value of
the specified key.
❑ The apply step involves computing some function, usually an aggregate, transformation,
or filtering, within the individual groups.
❑ The combine step merges the results of these operations into an output array.
❑ Does not return DataFrames
❑ Returns a DataFrameGroupBy object
❑ Does no actual computation until the aggregation is applied - "lazy evaluation"
❑ To produce a result - apply an aggregate to the DataFrameGroupBy object
2. The GroupBy object
Aggregate, filter, transform, and apply
pass a dictionary mapping column names to operations to be applied on that column
Aggregation
It can take a string, a function, or a list, and compute all the aggregates at once
Filtering
❑ Allows to drop data based on the group properties.
❑ Eg: All groups in which the standard deviation is larger than some critical value
❑ The filter function - return a Boolean value specifying whether the group passes the
filtering.
Here because group A does not have a standard deviation greater than 4, it is dropped from the
result
Transformation
transformation - return transformed version of the full data to recombine
Eg: Center the data by subtracting the group-wise mean
The apply() method
Lets to apply an arbitrary function to the group results
Eg: Normalizes the first column by the sum of the second
Pivot Tables
❑ A pivot table is a similar operation that is commonly seen in spreadsheets and other
programs that operate on tabular data.
❑ The pivot table takes simple column-wise data as input, and groups the entries into a two-
dimensional table that provides a multidimensional summarization of the data.
Motivating Pivot Tables
Database of passengers on the Titanic, available through the Seaborn library
Pivot Tables by Hand
Group according to gender, survival status, or some combination
Look at survival by both sex and, say, class.
This two-dimensional GroupBy is common in Pandas and use , pivot_table- to handle multi-
dimensional aggregation.
Pivot Table Syntax
More readable than the groupby approach, and produces the same result.
1. Multi-level pivot tables
❑ Grouping in pivot tables can be specified with multiple levels, and via a number of
options.
❑ For example: age as a third dimension - bin the age using the pd.cut function
same strategy can b e applied to columns as well
Eg: add info on the fare paid using pd.qcut to automatically compute quantiles
The result is a four-dimensional aggregation with hierarchical indices
2. Additional pivot table options
fill_value and dropna - deal with missing data
aggfunc keyword - controls what type of aggregation is applied, which is a mean by default
compute totals along each grouping - margins keyword
Vectorized String Operations
• ease in handling and manipulating string data
• Pandas string operations
Introducing Pandas String Operations
Tables of Pandas String Methods
Methods using regular expressions
Miscellaneous methods
Vectorized item access and slicing
Indicator variables