1.
Need and Overview of Pandas:
What is Pandas?
Pandas is a Python library for data manipulation and analysis. It provides data
structures like Series (1D) and DataFrame (2D), making it easy to work with
structured data.
Why is Pandas Needed?
● Efficiently handles large datasets.
● Simplifies data cleaning, transformation, and analysis.
● Integrates with libraries like NumPy and Matplotlib.
● Supports various file formats: CSV, Excel, JSON, SQL, etc.
2. Setup for Pandas:
Step 1: Install Pandas
Pandas can be installed using pip, Python's package manager.run:
pip install pandas
For Jupyter Notebook/ Google Colab users, install Pandas using the following command
to ensure compatibility:
!pip install pandas
Step 2: Import Pandas
To use Pandas in your Python script or notebook, import it using the standard alias:
import pandas as pd
3. Pandas Data Structures: Series and DataFrame
Pandas provides two main data structures to handle and manipulate data efficiently: Series and
DataFrame.
i. Series
A Series is a one-dimensional labeled array that can hold data of any type (e.g.,
integers, floats, strings). It is similar to a column in a spreadsheet or a Python list with an
index.
Key Features
● Indexing: Each element has a unique label (index).
● Homogeneous: Holds data of a single type (e.g., all integers or all strings).
Code Example:
1. Creating a Series
2. Accessing Data in a Series
ii. DataFrame
● DataFrame: A 2D labeled data structure similar to a table.
Key Features
● Labeled Rows and Columns: Each row and column has a unique label (index
and column names).
● Heterogeneous: Columns can hold data of different types.
Note : Difference Between DataFrames and 2D Arrays:
DataFrames have labeled rows and columns, whereas arrays rely solely on numerical
indices.
Code Example:
1. Creating a DataFrame
2. Accessing Data in a DataFrame
Common File Formats for Datasets:
Note:
Parquet and Feather file formats, which are optimized for fast reading and writing of large
datasets. These formats are commonly used in data engineering and analytics for efficient
storage and processing.
Common Methods for Inspecting Data in Pandas:
These methods are particularly helpful for inspecting large datasets by viewing a small subset at
the beginning, end, or randomly.
Details of DataFrames:
Labels (Columns, Index ), Shape, Size, Info, and Describe:
Pandas provides several methods to quickly understand and summarize the structure and
content of a DataFrame.
Accessing Data Using .loc[] and .iloc[]
In pandas, .loc[] and .iloc[] are powerful indexers used to access and manipulate data in
a DataFrame.
1. .Ioc[]
.loc[] is primarily label-based indexing. It is used to access rows and columns by their
labels (names).
● It can accept a row label and column label to return a specific value or subset
of data.
● You can use boolean conditions with .loc[] as well.
Code Example:
2. .iloc[]
.iloc[] is primarily integer position-based indexing. It is used to access rows and columns
by their integer index positions.
● It works with integer-based indexing, so you can provide the position of the rows and
columns.
● It does not include the last index (like Python's usual behavior with slicing).
Code Example:
When to Use:
● .loc[] is useful when you need to access data by names (labels).
● .iloc[] is best when you need to access data by integer position (index numbers).
Accessing Single Values Using .at[] and .iat[]
.at[] is used to access a single value in a DataFrame by label.
Example:
df.at[row_label, column_label]
.iat[] is used to access a single value in a DataFrame by integer position.
Example:
df.iat[row_position, column_position]
Accessing Columns: Shorthand and Dot Notation
Shorthand Notation: Access a column in a DataFrame by label using square brackets.
Example:
df[‘column_name’]
Dot Notation: Access a column in a DataFrame by label using dot notation.
Example:
df.column_name
Filtering Data Based on Conditions
You can filter data by applying conditions to one or more columns to return rows that meet the
specified criteria.
Syntax:
df[condition]
condition: Boolean condition applied to one or more columns.
Example:
1. Filter rows based on a single condition:
Condition: Age > 30 )
df[df['Age'] > 30]
2. Filter rows based on multiple conditions (AND):
Condition: Age > 30 and City is "Chicago"
df[(df['Age'] > 30) & (df['City'] == 'Chicago')]
3. Filter rows based on multiple conditions (OR):
Condition: Age > 30 or City is "New York"
df[(df['Age'] > 30) | (df['City'] == 'New York')]
4. Filter rows using isin() for multiple values:
Condition: City is either "Chicago" or "Houston"
df[df['City'].isin(['Chicago', 'Houston'])]
Note: You can apply conditions based on numerical comparisons, string matching, and more,
using & (AND) and | (OR) for combining multiple conditions.
Regular Expressions (Regex) in Pandas
Regular expressions allow you to filter, match, and manipulate string data in pandas columns
based on patterns.
Common Syntax:
1. Filter rows containing a pattern:
Syntax:
df[df['column_name'].str.contains('pattern', regex=True)]
2. Filter rows not containing a pattern:
Syntax:
df[~df['column_name'].str.contains('pattern', regex=True)]
3. Filter rows starting with a specific pattern:
Syntax:
df[df['column_name'].str.match('^pattern')]
4. Replace values using regex:
Syntax:
df['column_name'] = df['column_name'].str.replace('pattern',
'replacement', regex=True)
General patterns which widely used with regex:
Transforming Data Using apply()
The apply() method in pandas is used to apply a custom function or a predefined operation
along the rows (axis=1) or columns (axis=0) of a DataFrame or on a Series.
Syntax:
For Series: Series.apply(func)
For DataFrame: DataFrame.apply(func, axis=0/1)
Example 1: Applying a Function to a Series
Example 2: Applying a Function to a Series
Example 3: Applying a Function Along DataFrame Rows
Transforming or Adding Data Using where()
The where() method in pandas is used to conditionally transform data. It retains values that
meet a given condition and replaces others with a specified value (default is NaN).
Syntax:
For Series: Series.where(cond, other=np.nan) # np -> numpy alias
For DataFrame: DataFrame.where(cond, other=np.nan, axis=0)
Example:
Let’s use where() on the above data and dataFrame
Inserting Columns:
Syntax : df.insert(position, new_column_name, column_data)
Example: df.insert(1, 'Gender', ['F', 'M'])
Dropping Columns
Syntax : df.drop(column_name, axis=1, inplace=True)
Example: df.drop('Gender', axis=1, inplace=True)
Renaming Columns
Syntax :
df.rename(columns={'old_column_name': 'new_column_name'},
inplace=True)
Example:
df.rename(columns={'name': 'FullName'}, inplace=True))
Merging DataFrames: Inner, Outer, Left, Right Joins
Merging combines two DataFrames using a common key (or keys). Joins control how the
DataFrames are merged based on the relationship of their keys.
Code Examples:
1. Inner Join
2. Outer Join
3. Left Join
4. Right Join
Concatenating DataFrames
Concatenation in pandas refers to combining two or more DataFrames along a particular axis
(either rows or columns). The concat() function is used to join DataFrames either vertically
(stacking rows) or horizontally (joining columns).
Syntax:
pd.concat([df1, df2, ...], axis=0, join='outer', ignore_index=False)
Note:
axis: Determines whether to concatenate along rows (axis=0, default) or columns (axis=1).
join: Specifies how to handle columns that are not present in both DataFrames:
● 'outer' (default): Includes all columns (union of columns).
● 'inner': Includes only columns common to all DataFrames.
ignore_index: If True, the index is reset. If False, keeps the original index from each
DataFrame.
Code Example
1. Concatenate Vertically (Stacking Rows)
2. Concatenate Horizontally (Joining Columns)
Handling Null (Missing) Values in Pandas
Null values are represented as NaN in pandas. Handling them efficiently is essential for data
cleaning and preparation. Pandas provides several methods to detect, fill, or drop missing data.
Grouping Data Using groupby()
groupby() in pandas is a powerful tool for grouping data based on one or more columns,
followed by applying aggregation or transformation operations to each group. It is commonly
used for summarizing, aggregating, and transforming data.
Syntax:
df.groupby(by, axis=0, level=None, as_index=True, sort=True,
group_keys=True)
by: Column(s) or index level(s) to group by.
axis: Axis to group along (default is 0 for rows).
level: Group by a particular level (useful for MultiIndex).
as_index: If True (default), the group labels become the index.
sort: If True (default), the groups are sorted.
group_keys: If True (default), it includes group keys in the result.
Common Operations with groupby()
1. Aggregation (e.g., sum, mean)
2. Transformation (e.g., normalization, filling missing values)
3. Iteration (e.g., iterating over groups)
Code Example:
1. Grouping and Aggregating with sum()
2.Grouping and Aggregating with multiple functions
3. Grouping and Iterating Over Groups
4. Grouping by Multiple Columns
5. Transforming Data Within Groups Using transform()