0% found this document useful (0 votes)

29 views50 pages

Pandas Guide

Uploaded by

gulnaaz.data.analyst

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views50 pages

Pandas Guide

Uploaded by

gulnaaz.data.analyst

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Pandas for Beginners

Pandas Overview
Pandas is a powerful and ﬂexible open-source data analysis and manipulation library in Python. It is
widely used for data analysis tasks such as cleaning, transforming, and visualizing data. Here's a guide
to get you started with Pandas.

1. Installing Pandas

To install Pandas, you can use pip:

bash

pip install pandas

2. Importing Pandas

You start by importing the library:

python

import pandas as pd

3. Creating DataFrames

A DataFrame is a 2-dimensional labeled data structure (like a table with rows and columns).

From a Dictionary:

python

data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)

From a CSV File:

python

df = pd.read_csv('file.csv')

From an Excel File:

python

df = pd.read_excel('file.xlsx')

4. Viewing Data

Head & Tail: View the ﬁrst or last few rows.

python

df.head() # Default shows the first 5 rows

df.tail(3) # Shows the last 3 rows

Shape: Get the dimensions (rows, columns).

python

print(df.shape)

Info: Get a summary of the DataFrame.

python

print(df.info())

Describe: Get statistical summary of numerical columns.

python

print(df.describe())
5. Selecting Data

Select Columns:

python

print(df['Name']) # Select single column

print(df[['Name', 'Age']]) # Select multiple columns

Select Rows:

Using `iloc` for index-based selection:

python

print(df.iloc[0]) # Select first row

print(df.iloc[0:2]) # Select first two rows

Using `loc` for label-based selection:

python

print(df.loc[0]) # Select first row

Conditional Selection:

python

# Select rows where Age is greater than 30

print(df[df['Age'] > 30])

6. Data Cleaning

Handling Missing Values:

Check for missing values:

python

print(df.isnull().sum())

Drop missing values:

python

df.dropna(inplace=True)

Fill missing values:

python

df.fillna(0, inplace=True) # Fill with a specific value

Rename Columns:

python

df.rename(columns={'Name': 'Full Name'}, inplace=True)

7. Data Manipulation

Adding a New Column:

python

df['Salary'] = [50000, 60000, 70000]

Deleting Columns:

python

df.drop(columns=['Salary'], inplace=True)

Sorting Data:

python

df.sort_values(by='Age', ascending=False, inplace=True)

8. GroupBy and Aggregation

You can group data by certain columns and apply aggregation functions.
python

grouped = df.groupby('City').mean()
print(grouped)

For more complex operations, you can apply custom aggregation:

python

df.groupby('City').agg({'Age': 'mean', 'Salary': 'sum'})

9. Merging and Joining DataFrames

Merging:

You can merge DataFrames similarly to SQL joins.

python

df1 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]})

df2 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Salary': [50000, 60000]})

merged_df = pd.merge(df1, df2, on='Name')

Concatenating:

Concatenate two DataFrames:

python

df_concat = pd.concat([df1, df2], axis=0) # axis=0 for rows, axis=1 for columns

10. Saving DataFrames

To CSV:

python

df.to_csv('output.csv', index=False)

To Excel:
python

df.to_excel('output.xlsx', index=False)

11. Advanced Topics

Pivot Tables:

python

df.pivot_table(values='Salary', index='City', columns='Age', aggfunc='mean')

Apply Functions: You can apply functions to DataFrame elements.

python

df['Salary'] = df['Salary'].apply(lambda x: x + 1000)

Pandas is a highly ﬂexible tool with a rich API that covers a wide range of data manipulation tasks. This
tutorial covers the most commonly used features to get you started.

I'll break this down by each Pandas topic with 10 examples and then follow up with 15 assessment
questions per topic for practice. I'll ﬁrst walk through one topic thoroughly, including examples and
assessments.

1. Creating DataFrames

Examples:

Example 1: Creating a DataFrame from a dictionary

python

data = {'Name': ['John', 'Sara', 'Tom'], 'Age': [28, 22, 35]}

df = pd.DataFrame(data)
print(df)

Example 2: Creating a DataFrame with speciﬁc indexes

python

data = {'Name': ['John', 'Sara', 'Tom'], 'Age': [28, 22, 35]}

df = pd.DataFrame(data, index=['a', 'b', 'c'])
print(df)

Example 3: Creating a DataFrame from a list of lists

python

data = [['John', 28], ['Sara', 22], ['Tom', 35]]

df = pd.DataFrame(data, columns=['Name', 'Age'])
print(df)

Example 4: Creating a DataFrame from a list of dictionaries

python

data = [{'Name': 'John', 'Age': 28}, {'Name': 'Sara', 'Age': 22}, {'Name': 'Tom', 'Age':
35}]
df = pd.DataFrame(data)
print(df)

Example 5: Creating an empty DataFrame and adding data later

python

df = pd.DataFrame(columns=['Name', 'Age'])
df.loc[0] = ['John', 28]
df.loc[1] = ['Sara', 22]
print(df)

Example 6: Creating a DataFrame from NumPy arrays

python

import numpy as np
data = np.array([['John', 28], ['Sara', 22], ['Tom', 35]])
df = pd.DataFrame(data, columns=['Name', 'Age'])
print(df)

Example 7: Creating a DataFrame from a CSV ﬁle

python

df = pd.read_csv('people.csv')
print(df)

Example 8: Creating a DataFrame from an Excel ﬁle

python

df = pd.read_excel('people.xlsx')
print(df)

Example 9: Creating a DataFrame from a list of tuples

python

data = [('John', 28), ('Sara', 22), ('Tom', 35)]

df = pd.DataFrame(data, columns=['Name', 'Age'])
print(df)

Example 10: Creating a DataFrame from another DataFrame

python

df1 = pd.DataFrame({'Name': ['John', 'Sara'], 'Age': [28, 22]})

df2 = pd.DataFrame(df1)
print(df2)

Assessments (15 Questions)

1. Create a DataFrame from a dictionary with the keys `['Name', 'Age']` and values `[['John',
'Tom'], [28, 35]]`.
2. Add a column `['City']` with values `['New York', 'Chicago']` to the DataFrame created in
Question 1.
3. Convert a list of lists `[['John', 28], ['Sara', 22], ['Tom', 35]]` into a DataFrame with column
names `['Name', 'Age']`.
4. Create a DataFrame from an empty DataFrame and manually add a row with values `['John',
28]`.
5. Create a DataFrame from a CSV file `data.csv` where the first row is the header.
6. Use NumPy to create a DataFrame with an array `[['Sara', 22], ['Tom', 35]]` and assign
column names.
7. Convert a dictionary `{'Name': ['Sara', 'Tom'], 'Age': [22, 35]}` into a DataFrame with
specified row indexes `['row1', 'row2']`.
8. Use a list of tuples `('Sara', 22)` and `('Tom', 35)` to create a DataFrame with column names
`['Name', 'Age']`.
9. Create a DataFrame from a list of dictionaries `{'Name': 'Sara', 'Age': 22}` and `{'Name':
'Tom', 'Age': 35}`.
10. Add a new column called `['Gender']` with values `['Female', 'Male']` to an existing
DataFrame.
11. Save the DataFrame created in Question 10 to a CSV file.
12. Create a DataFrame using an Excel file named `people_data.xlsx`.
13. Create a DataFrame by merging two existing DataFrames.
14. Load a DataFrame from a JSON file.
15. Create a DataFrame from another DataFrame and modify one of its columns.
2. Viewing Data

This topic focuses on how to explore and examine a DataFrame, which is crucial in data analysis. Let's
walk through some common methods for viewing the contents of a DataFrame, and then I'll provide
assessment questions afterward.

Examples

Example 1: Viewing the ﬁrst few rows with `head()`

python

import pandas as pd

# Sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 30, 35, 40, 22]}
df = pd.DataFrame(data)

# View the first 3 rows

print(df.head(3))

Example 2: Viewing the last few rows with `tail()`

python

# View the last 2 rows

print(df.tail(2))

Example 3: Checking the shape of the DataFrame

python

# Get the number of rows and columns

print(df.shape)

Example 4: Getting information about the DataFrame using `info()`

python

# Summary info about the DataFrame

print(df.info())

Example 5: Viewing basic statistics with `describe()`

python

# Get statistics of numeric columns

print(df.describe())

Example 6: Viewing the index of the DataFrame

python

# Get the index of the DataFrame

print(df.index)

Example 7: Viewing the column names of the DataFrame

python

# Get column names

print(df.columns)

Example 8: Viewing data types of each column

python

# Get data types of each column

print(df.dtypes)

Example 9: Viewing speciﬁc elements by position using `iloc[]`

python

# Select the element in the first row and first column

print(df.iloc[0, 0])

Example 10: Viewing speciﬁc rows by label using `loc[]`

python

# Select the row where the index is 0

print(df.loc[0])

Assessments (15 Questions)

1. Using the `head()` function, display the ﬁrst 4 rows of a DataFrame.

2. Use the `tail()` function to display the last 3 rows of a DataFrame.
3. How can you find out how many rows and columns are in a DataFrame?
4. Which function would you use to get a summary of data types and memory usage of a
DataFrame?
5. Get a statistical summary of all numeric columns in a DataFrame.
6. Display the names of all columns in a DataFrame.
7. How would you access the index of a DataFrame?
8. Write a function to get the data type of each column in a DataFrame.
9. How can you extract the element from the second row and second column using ìloc[]`?
10. Use the `loc[]` function to get the row where the index is `2`.
11. How can you view the first row of a DataFrame by position?
12. How would you select the last row of a DataFrame without using `tail()`?
13. Write a function to check if there are any missing values in the DataFrame.
14. How can you view the first 5 rows of only one column (e.g., 'Age')?
15. Get the maximum and minimum values for a numeric column using a DataFrame method.

3. Selecting Data

In this section, we'll explore diﬀerent ways to select data from a DataFrame, whether it’s selecting
columns, rows, or subsets based on certain conditions.

Examples

Example 1: Selecting a single column

python

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'],

'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)

# Select the 'Name' column

print(df['Name'])

Example 2: Selecting multiple columns

python

# Select the 'Name' and 'City' columns

print(df[['Name', 'City']])

Example 3: Selecting rows by index using `iloc[]`

python

# Select the first row

print(df.iloc[0])

# Select the first two rows

print(df.iloc[0:2])

Example 4: Selecting rows by label using `loc[]`

python

# Select the row with index 1

print(df.loc[1])

Example 5: Selecting rows where a column’s value meets a condition

python

# Select rows where 'Age' is greater than 28

print(df[df['Age'] > 28])

Example 6: Selecting rows where multiple conditions are met

python

# Select rows where 'Age' > 28 and 'City' is 'Chicago'

print(df[(df['Age'] > 28) & (df['City'] == 'Chicago')])

Example 7: Selecting rows based on string conditions

python

# Select rows where 'Name' starts with 'A'

print(df[df['Name'].str.startswith('A')])

Example 8: Using `isin()` to ﬁlter rows

python

# Select rows where 'City' is either 'New York' or 'Chicago'

print(df[df['City'].isin(['New York', 'Chicago'])])

Example 9: Selecting a subset of the DataFrame based on rows and columns

python

# Select the first two rows and 'Name' and 'Age' columns
print(df.loc[0:1, ['Name', 'Age']])

Example 10: Selecting speciﬁc rows and columns using `iloc[]`

python

# Select the element in the second row and third column

print(df.iloc[1, 2])

# Select the first two rows and first two columns

print(df.iloc[0:2, 0:2])

Assessments (15 Questions)

1. Select the column 'Age' from a DataFrame.

2. How would you select multiple columns, say 'Age' and 'City', from a DataFrame?
3. Write code to select the third row from a DataFrame using ìloc[]`.
4. How can you select the first three rows using ìloc[]`?
5. Write a function to select the row with index 2 using `loc[]`.
6. How can you select rows where the 'Age' column is less than 30?
7. How would you select rows where 'City' is 'New York' and 'Age' is greater than 30?
8. Write code to select rows where 'Name' ends with the letter 'e'.
9. Use the ìsin()` method to filter rows where 'City' is 'Chicago' or 'Los Angeles'.
10. How would you select specific rows (first and second) and specific columns ('Name' and 'Age')?
11. Select the element in the third row and second column using ìloc[]`.
12. How would you select rows where the 'Age' column is between 25 and 35?
13. Write code to select rows where 'Name' contains the substring 'Bob'.
14. How can you select the first two rows and all columns from a DataFrame?
15. Write a function to filter rows where 'Age' is greater than 25 and 'Name' does not start with 'C'.

4. Handling Missing Data

Dealing with missing or null data is essential when working with real-world datasets. Pandas provides
various tools to handle missing values, such as detecting them, ﬁlling them, or removing them.

Examples

Example 1: Detecting missing values with `isna()`

python

import pandas as pd

data = {'Name': ['Alice', 'Bob', None],

'Age': [25, None, 35],
'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)

# Detect missing values

print(df.isna())

Example 2: Detecting missing values with `isnull()`

python

# Alternative way to detect missing values

print(df.isnull())

Example 3: Counting missing values in each column

python

# Count the number of missing values in each column

print(df.isna().sum())

Example 4: Dropping rows with missing values

python

# Drop rows where any value is missing

df_dropped = df.dropna()
print(df_dropped)

Example 5: Dropping columns with missing values

python

# Drop columns where any value is missing

df_dropped_col = df.dropna(axis=1)
print(df_dropped_col)

Example 6: Filling missing values with a constant

python

# Fill missing values with a specific value

df_filled = df.fillna(0)
print(df_filled)

Example 7: Filling missing values in speciﬁc columns

python

# Fill missing values in the 'Age' column with the mean age
df['Age'] = df['Age'].fillna(df['Age'].mean())
print(df)

Example 8: Forward ﬁlling missing values

python

# Fill missing values using the previous non-null value (forward fill)
df_ffill = df.fillna(method='ffill')
print(df_ffill)

Example 9: Backward ﬁlling missing values

python

# Fill missing values using the next non-null value (backward fill)
df_bfill = df.fillna(method='bfill')
print(df_bfill)

Example 10: Interpolating missing values

python

# Interpolate missing numeric values

df['Age'] = df['Age'].interpolate()
print(df)

Assessments (15 Questions)

1. How would you detect missing values in a DataFrame using Pandas?
2. Write a function to count missing values in each column of a DataFrame.
3. How can you drop all rows that have at least one missing value?
4. How would you drop columns that contain missing values?
5. Write code to fill missing values with the constant `0`.
6. How can you replace missing values in the 'Age' column with the mean of that column?
7. Write code to forward-fill missing values in a DataFrame.
8. How can you backward-fill missing values in a DataFrame?
9. Use the ìnterpolate()` function to fill missing values in a numeric column.
10. Write code to fill missing values in a column with the median of that column.
11. How would you fill missing values in the 'City' column with the string 'Unknown'?
12. Write a function to drop rows where all values are missing.
13. How can you fill missing values in a DataFrame using the previous valid value from the same
row?
14. How would you replace missing values in a column with the mode (most frequent value) of that
column?
15. Write code to detect and count missing values in a specific column (e.g., 'Name').

5. Data Aggregation and Grouping

Data aggregation and grouping are fundamental for summarizing and analyzing data in a structured
way. In Pandas, you can use the `groupby()` function to group data and then apply aggregate
functions like `sum()`, `mean()`, etc.

Examples

Example 1: Grouping data by a single column

python

import pandas as pd

data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob'],
'Age': [25, 30, 35, 28, 32],
'Score': [85, 90, 88, 92, 79]
}
df = pd.DataFrame(data)

# Group by 'Name' and calculate the mean of 'Age' and 'Score'

grouped = df.groupby('Name').mean()
print(grouped)

Example 2: Grouping by multiple columns

python

# Group by 'Name' and 'Age', and calculate the sum of 'Score'

grouped_multi = df.groupby(['Name', 'Age']).sum()
print(grouped_multi)

Example 3: Using aggregation functions with `agg()`

python

# Group by 'Name' and aggregate 'Score' using multiple functions

agg_result = df.groupby('Name').agg({'Score': ['mean', 'max', 'min']})
print(agg_result)

Example 4: Counting the number of occurrences

python

# Count the number of occurrences of each name

count_result = df['Name'].value_counts()
print(count_result)

Example 5: Grouping and applying custom functions

python

# Define a custom function to calculate range

def age_range(x):
return x.max() - x.min()

# Group by 'Name' and apply the custom age_range function

age_range_result = df.groupby('Name')['Age'].agg(age_range)
print(age_range_result)

Example 6: Resetting index after grouping

python

# Group by 'Name' and get mean, then reset index

reset_result = df.groupby('Name').mean().reset_index()
print(reset_result)
Example 7: Grouping and ﬁltering groups

python

# Group by 'Name' and filter groups where the mean Score is greater than 85
filtered = df.groupby('Name').filter(lambda x: x['Score'].mean() > 85)
print(filtered)

Example 8: Grouping with multiple aggregations

python

# Group by 'Name' and calculate the sum and mean of 'Score'

agg_multi = df.groupby('Name').agg({'Score': ['sum', 'mean']})
print(agg_multi)

Example 9: Using `size()` to count the number of entries per group

python

# Count the number of entries per name

size_result = df.groupby('Name').size()
print(size_result)

Example 10: Pivot tables as an alternative to grouping

python

# Create a pivot table for Score based on Name and Age

pivot_table = df.pivot_table(values='Score', index='Name', columns='Age', aggfunc='mean')
print(pivot_table)

Assessments (15 Questions)

1. How would you group a DataFrame by a single column and calculate the mean of another
column?
2. Write code to group data by two columns and calculate the sum of a third column.
3. How can you apply multiple aggregation functions to a grouped DataFrame using àgg()`?
4. Write a function to count the number of occurrences of each unique value in a column.
5. How would you group by a column and apply a custom aggregation function?
6. Write code to reset the index of a grouped DataFrame.
7. How can you filter groups based on the mean of a column after grouping?
8. Write code to create a pivot table that summarizes data based on two columns.
9. How would you group a DataFrame and count the size of each group?
10. Write a function to find the maximum and minimum values of a numeric column after grouping
by a categorical column.
11. How can you group by multiple columns and calculate the average of one column?
12. Write code to create a new DataFrame from the aggregation results of a grouped DataFrame.
13. How would you create a DataFrame that only includes groups where the total count is greater
than 1?
14. Write code to summarize data by calculating the sum and mean for multiple columns in a
grouped DataFrame.
15. How can you create a summary table showing the average score for each name in the original
DataFrame?

6. Merging and Joining DataFrames

Merging and joining DataFrames are essential skills in data manipulation, allowing you to combine
datasets based on common keys or indices. Pandas provides various functions for these operations,
including `merge()`, `join()`, and `concat()`.

Examples

Example 1: Merging DataFrames using `merge()`

python

import pandas as pd

# Creating two DataFrames

df1 = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie']
})

df2 = pd.DataFrame({
'ID': [1, 2, 4],
'Age': [25, 30, 35]
})

# Merge the DataFrames on 'ID'

merged_df = pd.merge(df1, df2, on='ID', how='inner')
print(merged_df)

Example 2: Merging with diﬀerent types of joins

python

# Left join
left_join = pd.merge(df1, df2, on='ID', how='left')
print(left_join)

# Right join
right_join = pd.merge(df1, df2, on='ID', how='right')
print(right_join)

# Outer join
outer_join = pd.merge(df1, df2, on='ID', how='outer')
print(outer_join)

Example 3: Joining DataFrames using `join()`

python

# Creating two DataFrames with indices

df3 = pd.DataFrame({
'Name': ['Alice', 'Bob'],
'Age': [25, 30]
}, index=[1, 2])

df4 = pd.DataFrame({
'City': ['New York', 'Los Angeles']
}, index=[1, 3])

# Join the DataFrames

joined_df = df3.join(df4, how='outer')
print(joined_df)

Example 4: Concatenating DataFrames using `concat()`

python

# Creating two DataFrames for concatenation

df5 = pd.DataFrame({
'ID': [1, 2],
'Name': ['Alice', 'Bob']
})

df6 = pd.DataFrame({
'ID': [3, 4],
'Name': ['Charlie', 'David']
})

# Concatenate the DataFrames

concatenated_df = pd.concat([df5, df6], ignore_index=True)
print(concatenated_df)

Example 5: Concatenating along columns

python

# Concatenate DataFrames along columns

df7 = pd.DataFrame({
'ID': [1, 2],
'Name': ['Alice', 'Bob']
})

df8 = pd.DataFrame({
'Age': [25, 30]
})

concatenated_columns = pd.concat([df7, df8], axis=1)

print(concatenated_columns)

Example 6: Merging on multiple keys

python

# Creating two DataFrames

df9 = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie'],
'City': ['NY', 'LA', 'SF']
})

df10 = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]
})

# Merge on both 'ID' and 'Name'

merged_multi_key = pd.merge(df9, df10, on=['ID', 'Name'])
print(merged_multi_key)

Example 7: Using suﬃxes to diﬀerentiate columns

python

# Merging with suffixes for overlapping column names

df11 = pd.DataFrame({
'ID': [1, 2],
'Value': [100, 200]
})

df12 = pd.DataFrame({
'ID': [1, 2],
'Value': [300, 400]
})

merged_suffixes = pd.merge(df11, df12, on='ID', suffixes=('_left', '_right'))

print(merged_suffixes)

Example 8: Joining with diﬀerent indices

python

# Create DataFrames with different indices

df13 = pd.DataFrame({
'Name': ['Alice', 'Bob'],
'Age': [25, 30]
}, index=[1, 2])
df14 = pd.DataFrame({
'City': ['New York', 'Los Angeles']
}, index=[2, 3])

# Join the DataFrames with different indices

joined_diff_index = df13.join(df14, how='outer')
print(joined_diff_index)

Example 9: Using `concat()` with keys

python

# Using concat with keys to create a hierarchical index

df15 = pd.DataFrame({
'ID': [1, 2],
'Name': ['Alice', 'Bob']
})

df16 = pd.DataFrame({
'ID': [3, 4],
'Name': ['Charlie', 'David']
})

concatenated_keys = pd.concat([df15, df16], keys=['Group1', 'Group2'])

print(concatenated_keys)

Example 10: Merging with indicator

python

# Merge with indicator to see the source of each row

merged_with_indicator = pd.merge(df1, df2, on='ID', how='outer', indicator=True)
print(merged_with_indicator)

Assessments (15 Questions)

1. Write code to merge two DataFrames on a single key.

2. How would you perform a left join between two DataFrames?
3. Write code to join two DataFrames using the `join()` method.
4. How can you concatenate two DataFrames vertically?
5. Write code to concatenate two DataFrames horizontally.
6. How can you merge two DataFrames on multiple keys?
7. Write a function to merge two DataFrames and include suffixes for overlapping column names.
8. How would you join two DataFrames with different indices?
9. Write code to use `concat()` with keys to create a hierarchical index.
10. How can you merge two DataFrames and see the source of each row?
11. Write code to perform an outer join on two DataFrames.
12. How would you concatenate a list of DataFrames into a single DataFrame?
13. Write a function to merge two DataFrames and drop any duplicates that arise.
14. How can you merge two DataFrames without any duplicate rows?
15. Write code to merge two DataFrames and fill missing values in the resulting DataFrame.
7. Data Visualization with Pandas

Data visualization is an essential part of data analysis, allowing you to see trends, patterns, and
insights that may not be immediately obvious from raw data. Pandas integrates well with Matplotlib
and Seaborn for plotting.

Examples

Example 1: Basic line plot

python

import pandas as pd
import matplotlib.pyplot as plt

# Create a simple DataFrame

data = {'Year': [2018, 2019, 2020, 2021, 2022],
'Sales': [150, 200, 250, 300, 350]}
df = pd.DataFrame(data)

# Basic line plot

df.plot(x='Year', y='Sales', kind='line', title='Sales Over Years')
plt.xlabel('Year')
plt.ylabel('Sales')
plt.show()

Example 2: Bar plot

python

# Bar plot
df.plot(x='Year', y='Sales', kind='bar', title='Sales Over Years')
plt.xlabel('Year')
plt.ylabel('Sales')
plt.show()

Example 3: Histogram

python

# Create a DataFrame with random data

data2 = {'Scores': [82, 85, 87, 90, 95, 70, 75, 80, 88, 92]}
df2 = pd.DataFrame(data2)

# Histogram of Scores
df2['Scores'].plot(kind='hist', bins=5, title='Score Distribution')
plt.xlabel('Scores')
plt.show()

Example 4: Scatter plot

python

# Create a DataFrame for scatter plot

data3 = {
'Height': [5.1, 5.5, 5.8, 6.0, 5.7],
'Weight': [100, 150, 130, 165, 155]
}
df3 = pd.DataFrame(data3)

# Scatter plot
df3.plot(kind='scatter', x='Height', y='Weight', title='Height vs Weight')
plt.xlabel('Height (in feet)')
plt.ylabel('Weight (in lbs)')
plt.show()

Example 5: Box plot

python

# Box plot
df2['Scores'].plot(kind='box', title='Box Plot of Scores')
plt.ylabel('Scores')
plt.show()

Example 6: Area plot

python

# Area plot
df.plot(x='Year', y='Sales', kind='area', alpha=0.4, title='Sales Area Plot')
plt.xlabel('Year')
plt.ylabel('Sales')
plt.show()

Example 7: Pie chart

python

# Create a DataFrame for pie chart

data4 = {'Fruits': ['Apple', 'Banana', 'Cherry'],
'Quantity': [30, 25, 45]}
df4 = pd.DataFrame(data4)

# Pie chart
df4.plot.pie(y='Quantity', labels=df4['Fruits'], autopct='%1.1f%%', title='Fruit
Distribution', legend=False)
plt.show()

Example 8: Customizing plots

python

# Line plot with customization

ax = df.plot(x='Year', y='Sales', kind='line', title='Sales Over Years', color='orange')
ax.set_xlabel('Year')
ax.set_ylabel('Sales')
ax.grid()
plt.show()

Example 9: Using Seaborn for enhanced visualization

python

import seaborn as sns

# Scatter plot using Seaborn

sns.scatterplot(data=df3, x='Height', y='Weight', hue='Weight', palette='deep',
size='Weight', sizes=(20, 200))
plt.title('Height vs Weight with Seaborn')
plt.show()

Example 10: Pairplot with Seaborn

python

# Creating a DataFrame with more data for pairplot

data5 = {
'Height': [5.1, 5.5, 5.8, 6.0, 5.7],
'Weight': [100, 150, 130, 165, 155],
'Age': [20, 22, 23, 25, 30]
}
df5 = pd.DataFrame(data5)

# Pairplot using Seaborn

sns.pairplot(df5)
plt.show()

Assessments (15 Questions)

1. Write code to create a line plot for sales data over years.
2. How would you create a bar plot for the same sales data?
3. Write code to plot a histogram of randomly generated scores.
4. How can you create a scatter plot to visualize the relationship between height and weight?
5. Write a function to create a box plot for a set of scores.
6. How would you create an area plot for sales data?
7. Write code to create a pie chart showing the distribution of fruit quantities.
8. How can you customize a line plot with speciﬁc colors and gridlines?
9. Write code to create a scatter plot using Seaborn, including a color palette.
10. How can you create a pairplot to visualize relationships in a DataFrame with multiple numeric
columns?
11. Write a function to plot a histogram with a speciﬁed number of bins.
12. How would you create a customized pie chart with percentages displayed?
13. Write code to create a scatter plot with varying point sizes based on another column.
14. How can you add labels to axes and a title to a plot in Matplotlib?
15. Write a function to visualize the distribution of a DataFrame column using a box plot.

8. Time Series Analysis with Pandas

Time series analysis involves analyzing data points collected or recorded at speciﬁc time intervals.
Pandas provides powerful tools for working with time series data, making it easy to manipulate,
visualize, and analyze trends over time.

Examples

Example 1: Creating a Time Series

python

import pandas as pd

# Create a date range

date_range = pd.date_range(start='2020-01-01', periods=5, freq='D')

# Create a DataFrame with a time series

data = {'Value': [10, 20, 15, 25, 30]}
df = pd.DataFrame(data, index=date_range)
print(df)

Example 2: Plotting a Time Series

python

# Plot the time series

df.plot(title='Simple Time Series')
plt.xlabel('Date')
plt.ylabel('Value')
plt.show()

Example 3: Resampling Time Series Data

python

# Create a time series with random data

date_range = pd.date_range(start='2020-01-01', end='2020-01-10', freq='D')
data = {'Value': [10, 15, 20, 25, 30, 35, 40, 45, 50, 55]}
df = pd.DataFrame(data, index=date_range)

# Resample to weekly frequency and calculate the mean

weekly_mean = df.resample('W').mean()
print(weekly_mean)

Example 4: Shifting Time Series Data

python

# Shift the time series data by 1 day

df_shifted = df.shift(1)
print(df_shifted)

Example 5: Rolling Window Calculations

python

# Calculate the rolling mean with a window of 3 days

rolling_mean = df['Value'].rolling(window=3).mean()
print(rolling_mean)

Example 6: Time Series Indexing

python

# Select data for a specific date

specific_date = df.loc['2020-01-03']
print(specific_date)

Example 7: Handling Missing Dates

python

# Create a time series with missing dates

date_range = pd.date_range(start='2020-01-01', end='2020-01-10', freq='D')
data = {'Value': [10, None, 20, 25, None, 30, 35, None, 50, 55]}
df_missing = pd.DataFrame(data, index=date_range)
# Fill missing values using forward fill
df_filled = df_missing.fillna(method='ffill')
print(df_filled)

Example 8: Time Series Decomposition

python

from statsmodels.tsa.seasonal import seasonal_decompose

# Decompose the time series

result = seasonal_decompose(df['Value'], model='additive', period=2)
result.plot()
plt.show()

Example 9: Date Range Generation

python

# Create a date range with business days

business_days = pd.date_range(start='2020-01-01', end='2020-01-10', freq='B')
print(business_days)

Example 10: Converting a Column to DateTime

python

# Create a DataFrame with dates as strings

data2 = {'Date': ['2020-01-01', '2020-01-02', '2020-01-03'],
'Value': [10, 20, 30]}
df2 = pd.DataFrame(data2)

# Convert the 'Date' column to datetime

df2['Date'] = pd.to_datetime(df2['Date'])
print(df2)

Assessments (15 Questions)

1. How would you create a time series DataFrame with daily frequency?
2. Write code to plot a time series graph.
3. How can you resample time series data to a weekly frequency and calculate the mean?
4. Write a function to shift a time series by a specified number of periods.
5. How would you calculate the rolling mean over a time series with a window of 3 days?
6. Write code to select data for a specific date in a time series DataFrame.
7. How can you handle missing values in a time series using forward fill?
8. Write code to decompose a time series into its trend, seasonal, and residual components.
9. How would you generate a date range with business days only?
10. Write code to convert a column of strings into datetime format in a DataFrame.
11. How can you plot a time series with different colors for different time periods?
12. Write a function to calculate the cumulative sum of a time series.
13. How would you extract the month and year from a date column in a time series DataFrame?
14. Write code to resample a time series to a monthly frequency and calculate the sum.
15. How can you check for missing dates in a time series?

9. Data Cleaning and Preparation with Pandas

Data cleaning and preparation are essential steps in the data analysis process, ensuring that the data
is accurate, complete, and ready for analysis. Pandas oﬀers various functions and methods for
cleaning, transforming, and preparing data.

Examples

Example 1: Removing Duplicates

python

import pandas as pd

# Create a DataFrame with duplicate rows

data = {
'Name': ['Alice', 'Bob', 'Alice', 'Charlie'],
'Age': [25, 30, 25, 35]
}
df = pd.DataFrame(data)

# Remove duplicate rows

df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates)

Example 2: Filling Missing Values

python

# Create a DataFrame with missing values

data2 = {
'Name': ['Alice', 'Bob', 'Charlie', None],
'Age': [25, None, 35, 40]
}
df2 = pd.DataFrame(data2)

# Fill missing values with a specified value

df_filled = df2.fillna({'Name': 'Unknown', 'Age': df2['Age'].mean()})
print(df_filled)

Example 3: Dropping Missing Values

python

# Drop rows with missing values

df_dropped = df2.dropna()
print(df_dropped)

Example 4: Renaming Columns

python

# Rename columns
df.rename(columns={'Name': 'Full Name', 'Age': 'Age in Years'}, inplace=True)
print(df)

Example 5: Changing Data Types

python

# Create a DataFrame with mixed data types

data3 = {
'ID': ['1', '2', '3'],
'Age': ['25', '30', '35']
}
df3 = pd.DataFrame(data3)

# Convert data types

df3['ID'] = df3['ID'].astype(int)
df3['Age'] = df3['Age'].astype(int)
print(df3.dtypes)

Example 6: String Manipulation

python

# Create a DataFrame with strings

data4 = {
'Names': [' Alice ', 'Bob ', ' Charlie ']
}
df4 = pd.DataFrame(data4)

# Strip whitespace from strings

df4['Names'] = df4['Names'].str.strip()
print(df4)

Example 7: Changing Indexes

python

# Create a DataFrame
data5 = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]
}
df5 = pd.DataFrame(data5)

# Set 'Name' as the index

df5.set_index('Name', inplace=True)
print(df5)

Example 8: Filtering Data

python

# Filter data based on a condition

filtered_df = df5[df5['Age'] > 30]
print(filtered_df)

Example 9: Categorizing Data

python

# Create a DataFrame with categorical data

data6 = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Score': [85, 70, 95]
}
df6 = pd.DataFrame(data6)

# Convert 'Score' to categorical

df6['Grade'] = pd.cut(df6['Score'], bins=[0, 60, 80, 100], labels=['F', 'C', 'A'])
print(df6)

Example 10: Merging and Concatenating DataFrames

python

# Create two DataFrames

df7 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})
df8 = pd.DataFrame({'ID': [2, 3], 'Name': ['Bob', 'Charlie']})

# Merge the DataFrames

merged_df = pd.merge(df7, df8, on='ID', how='outer')
print(merged_df)

Assessments (15 Questions)

1. Write code to remove duplicate rows from a DataFrame.

2. How can you fill missing values in a DataFrame with a specific value?
3. Write code to drop rows with missing values from a DataFrame.
4. How would you rename columns in a DataFrame?
5. Write code to change the data types of specific columns in a DataFrame.
6. How can you strip whitespace from string entries in a DataFrame column?
7. Write code to change the index of a DataFrame to a specific column.
8. How would you filter a DataFrame to show rows that meet a specific condition?
9. Write code to categorize a numerical column into bins.
10. How can you concatenate two DataFrames vertically?
11. Write a function to check for missing values in a DataFrame.
12. How would you replace specific values in a DataFrame with new values?
13. Write code to sort a DataFrame by a specific column.
14. How can you check the data types of all columns in a DataFrame?
15. Write code to extract a specific column from a DataFrame as a Series.

10. Advanced Data Manipulation with Pandas

Advanced data manipulation techniques are essential for more complex data analysis tasks. This topic
covers advanced features in Pandas that allow you to transform, aggregate, and reshape your data.

Examples

Example 1: Pivot Tables

python

import pandas as pd

# Create a DataFrame for pivot table

data = {
'Date': ['2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02'],
'Category': ['A', 'B', 'A', 'B'],
'Sales': [100, 200, 150, 250]
}
df = pd.DataFrame(data)

# Create a pivot table

pivot_table = df.pivot_table(values='Sales', index='Date', columns='Category',
aggfunc='sum')
print(pivot_table)

Example 2: Group By
python

# Group by Category and calculate the total sales

grouped = df.groupby('Category')['Sales'].sum()
print(grouped)

Example 3: Applying Functions

python

# Apply a custom function to calculate the square of sales

df['Sales Squared'] = df['Sales'].apply(lambda x: x ** 2)
print(df)

Example 4: Chaining Methods

python

# Chaining methods for data manipulation

result = (df[df['Sales'] > 100]
.groupby('Category')['Sales']
.sum()
.reset_index())
print(result)

Example 5: Merging with Diﬀerent Join Types

python

# Create two DataFrames for merging

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [2, 3, 4], 'Score': [90, 85, 80]})

# Merge DataFrames with inner join

inner_merge = pd.merge(df1, df2, on='ID', how='inner')
print(inner_merge)

Example 6: Concatenating DataFrames

python

# Create two DataFrames

df3 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})
df4 = pd.DataFrame({'ID': [3, 4], 'Name': ['Charlie', 'David']})

# Concatenate DataFrames
concatenated = pd.concat([df3, df4], ignore_index=True)
print(concatenated)
Example 7: Reshaping Data with Melt

python

# Create a DataFrame to reshape

data2 = {
'ID': [1, 2],
'Math': [90, 80],
'Science': [85, 75]
}
df2 = pd.DataFrame(data2)

# Reshape with melt

melted = df2.melt(id_vars=['ID'], value_vars=['Math', 'Science'], var_name='Subject',
value_name='Score')
print(melted)

Example 8: Using `stack` and `unstack`

python

# Create a multi-index DataFrame

data3 = {
'Category': ['A', 'A', 'B', 'B'],
'Type': ['X', 'Y', 'X', 'Y'],
'Values': [10, 20, 15, 25]
}
df3 = pd.DataFrame(data3).set_index(['Category', 'Type'])

# Use stack
stacked = df3.stack()
print(stacked)

# Use unstack
unstacked = df3.unstack()
print(unstacked)

Example 9: Using `pivot` for Reshaping Data

python

# Create a DataFrame for pivot

data4 = {
'Date': ['2021-01-01', '2021-01-01', '2021-01-02'],
'Product': ['A', 'B', 'A'],
'Sales': [100, 200, 150]
}
df4 = pd.DataFrame(data4)

# Create a pivot table

pivot_result = df4.pivot(index='Date', columns='Product', values='Sales')
print(pivot_result)

Example 10: Time Series Aggregation

python

# Create a time series DataFrame

date_range = pd.date_range(start='2021-01-01', periods=5, freq='D')
data5 = {'Sales': [10, 20, 15, 25, 30]}
df5 = pd.DataFrame(data5, index=date_range)

# Resample to weekly frequency and sum sales

weekly_sales = df5.resample('W').sum()
print(weekly_sales)

Assessments (15 Questions)

1. Write code to create a pivot table summarizing sales by date and category.
2. How can you group a DataFrame by a specific column and calculate the sum?
3. Write a function to apply a custom operation on a DataFrame column.
4. How can you chain multiple methods to filter and group data in one statement?
5. Write code to perform an outer join on two DataFrames based on a common column.
6. How would you concatenate two DataFrames and reset the index?
7. Write code to melt a DataFrame from wide to long format.
8. How can you use `stack` and ùnstack` to reshape a multi-index DataFrame?
9. Write code to create a pivot table from a DataFrame with date and product data.
10. How would you aggregate time series data to a monthly frequency and calculate the mean?
11. Write a function to merge two DataFrames with a left join.
12. How can you extract unique values from a specific column in a DataFrame?
13. Write code to sort a DataFrame based on multiple columns.
14. How can you calculate cumulative sums for a DataFrame column?
15. Write code to filter a DataFrame based on a condition from multiple columns.

11. Data Visualization with Pandas

Data visualization is a crucial aspect of data analysis that helps in understanding patterns, trends, and
insights. Pandas provides built-in capabilities for creating basic visualizations, leveraging libraries like
Matplotlib and Seaborn for more advanced visualizations.

Examples

Example 1: Basic Line Plot

python

import pandas as pd
import matplotlib.pyplot as plt

# Create a simple DataFrame

data = {
'Year': [2020, 2021, 2022],
'Sales': [100, 150, 200]
}
df = pd.DataFrame(data)

# Plot a line graph

df.plot(x='Year', y='Sales', kind='line', title='Sales Over Years', marker='o')
plt.xlabel('Year')
plt.ylabel('Sales')
plt.show()

Example 2: Bar Plot

python

# Create a bar plot

df.plot(x='Year', y='Sales', kind='bar', title='Sales by Year')
plt.xlabel('Year')
plt.ylabel('Sales')
plt.show()

Example 3: Histogram

python

# Create a DataFrame with random data

data2 = {
'Scores': [88, 92, 85, 95, 70, 76, 89, 90, 83, 94]
}
df2 = pd.DataFrame(data2)

# Plot a histogram
df2['Scores'].plot(kind='hist', bins=5, title='Score Distribution')
plt.xlabel('Scores')
plt.show()

Example 4: Scatter Plot

python

# Create another DataFrame

data3 = {
'Height': [150, 160, 170, 180, 190],
'Weight': [50, 60, 70, 80, 90]
}
df3 = pd.DataFrame(data3)

# Create a scatter plot

df3.plot(kind='scatter', x='Height', y='Weight', title='Height vs. Weight')
plt.show()

Example 5: Box Plot

python

# Create a DataFrame with some data

data4 = {
'Category': ['A', 'A', 'B', 'B', 'C', 'C'],
'Values': [10, 15, 12, 18, 22, 25]
}
df4 = pd.DataFrame(data4)

# Create a box plot

df4.boxplot(column='Values', by='Category')
plt.title('Box Plot of Values by Category')
plt.suptitle('')
plt.xlabel('Category')
plt.ylabel('Values')
plt.show()

Example 6: Area Plot

python

# Create a DataFrame with cumulative data

data5 = {
'Year': [2020, 2021, 2022],
'Sales': [100, 150, 200],
'Expenses': [80, 120, 160]
}
df5 = pd.DataFrame(data5)

# Plot an area plot

df5.plot(x='Year', kind='area', stacked=True, title='Sales and Expenses Over Years')
plt.xlabel('Year')
plt.ylabel('Amount')
plt.show()

Example 7: Pie Chart

python

# Create a pie chart for category distribution

data6 = {
'Category': ['A', 'B', 'C'],
'Values': [30, 50, 20]
}
df6 = pd.DataFrame(data6)

# Plot a pie chart

df6.set_index('Category').plot(kind='pie', y='Values', autopct='%1.1f%%', title='Category
Distribution')
plt.ylabel('')
plt.show()
Example 8: Heatmap with Seaborn

python

import seaborn as sns

# Create a correlation matrix

data7 = {
'A': [1, 2, 3, 4],
'B': [4, 5, 6, 7],
'C': [7, 8, 9, 10]
}
df7 = pd.DataFrame(data7)

# Create a heatmap
sns.heatmap(df7.corr(), annot=True, cmap='coolwarm', title='Correlation Heatmap')
plt.title('Correlation Heatmap')
plt.show()

Example 9: Customizing Plots

python

# Create a line plot with customization

df.plot(x='Year', y='Sales', kind='line', title='Sales Over Years', marker='o',
color='red', linestyle='--')
plt.xlabel('Year')
plt.ylabel('Sales')
plt.grid()
plt.show()

Example 10: Subplots

python

# Create multiple subplots

fig, axs = plt.subplots(2, 1, figsize=(8, 8))

df.plot(x='Year', y='Sales', kind='line', ax=axs[0], title='Sales Over Years', marker='o')

axs[0].set_ylabel('Sales')

df2['Scores'].plot(kind='hist', bins=5, ax=axs[1], title='Score Distribution')

axs[1].set_xlabel('Scores')

plt.tight_layout()
plt.show()

Assessments (15 Questions)

1. Write code to create a line plot for sales data over several years.
2. How can you create a bar plot from a DataFrame?
3. Write code to generate a histogram of a numerical column.
4. How can you create a scatter plot to analyze the relationship between two variables?
5. Write code to create a box plot for data categorized by another variable.
6. How can you create an area plot to visualize cumulative data over time?
7. Write code to generate a pie chart from categorical data.
8. How can you create a heatmap to visualize correlations between multiple variables?
9. Write code to customize a plot with colors, markers, and line styles.
10. How can you create multiple subplots in a single ﬁgure?
11. Write a function to save a plot as a PNG ﬁle.
12. How would you plot a rolling average on a time series data?
13. Write code to change the title and labels of a plot after creating it.
14. How can you adjust the size of a plot in Matplotlib?
15. Write code to display a legend in a plot.
12. Working with Time Zones in Pandas

Time zone handling is an essential part of time series analysis, especially when dealing with data from
multiple time zones. Pandas provides robust tools for converting, localizing, and performing
operations on time zone-aware datetime objects.

Examples

Example 1: Localizing a Time Series

python

import pandas as pd

# Create a time series

date_range = pd.date_range(start='2023-01-01', periods=5, freq='D')
ts = pd.Series([10, 15, 20, 25, 30], index=date_range)

# Localize to UTC
ts_utc = ts.tz_localize('UTC')
print(ts_utc)

Example 2: Converting Time Zones

python

# Convert time series to a different time zone

ts_new_york = ts_utc.tz_convert('America/New_York')
print(ts_new_york)

Example 3: Creating Time Zone Aware Timestamps

python

# Create a timestamp with a specific time zone

timestamp = pd.Timestamp('2023-01-01 12:00:00', tz='Europe/London')
print(timestamp)

Example 4: Handling Daylight Saving Time (DST)

python

# Create a time series that includes a DST change

date_range = pd.date_range(start='2023-03-01', periods=10, freq='D')
ts_dst = pd.Series(range(10), index=date_range).dt.tz_localize('Europe/London')
print(ts_dst)

Example 5: Using `pd.to_datetime` with Time Zones

python

# Create a time series from a string with a time zone

ts_from_string = pd.to_datetime('2023-01-01 12:00:00 UTC')
print(ts_from_string)

Example 6: Operations with Time Zone Aware Datetime

python

# Create two time series with different time zones

ts_utc1 = pd.Series([1, 2, 3], index=pd.date_range('2023-01-01', periods=3,
freq='H')).tz_localize('UTC')
ts_utc2 = pd.Series([4, 5, 6], index=pd.date_range('2023-01-01', periods=3,
freq='H')).tz_localize('America/New_York')

# Convert to UTC for operations

ts_utc2_converted = ts_utc2.tz_convert('UTC')
result = ts_utc1 + ts_utc2_converted
print(result)

Example 7: Plotting Time Series with Time Zones

python

import matplotlib.pyplot as plt

# Plotting time series with time zones

ts_utc.plot(title='Time Series in UTC', marker='o')
plt.xlabel('Date')
plt.ylabel('Values')
plt.show()
Example 8: Resampling with Time Zones

python

# Resample time series with time zone

ts_resampled = ts_dst.resample('2D').mean()
print(ts_resampled)

Example 9: Converting Timestamps to Local Time

python

# Convert a timestamp to local time

local_time = timestamp.tz_convert('America/New_York')
print(local_time)

Example 10: Handling Multiple Time Zones

python

# Create a DataFrame with different time zones

data = {
'UTC': pd.date_range('2023-01-01', periods=3, freq='H').tz_localize('UTC'),
'New_York': pd.date_range('2023-01-01', periods=3,
freq='H').tz_localize('America/New_York')
}
df_timezones = pd.DataFrame(data)
print(df_timezones)

Assessments (15 Questions)

1. Write code to localize a naive time series to UTC.

2. How can you convert a time series from UTC to a different time zone?
3. Write code to create a timestamp that is aware of a specific time zone.
4. How would you handle daylight saving time when creating a time series?
5. Write code to create a time series from a string that includes a time zone.
6. How can you perform arithmetic operations on time zone-aware time series?
7. Write code to plot a time series that includes time zone information.
8. How would you resample a time series that includes time zone data?
9. Write code to convert a timestamp from one time zone to another.
10. How can you create a DataFrame that includes timestamps in multiple time zones?
11. Write a function to check if a timestamp is in a specific time zone.
12. How would you extract the timezone information from a timestamp?
13. Write code to calculate the difference between two time zone-aware timestamps.
14. How can you handle missing values in a time series that includes time zones?
15. Write code to visualize a time series in different time zones on the same plot.
13. Data Aggregation and Grouping with Pandas

Data aggregation and grouping allow you to summarize and analyze large datasets eﬃciently. Pandas
provides powerful methods for grouping data, applying functions to groups, and aggregating results.

Examples

Example 1: Basic GroupBy Operation

python

import pandas as pd

# Create a DataFrame
data = {
'Category': ['A', 'A', 'B', 'B', 'C', 'C'],
'Values': [10, 15, 10, 25, 30, 35]
}
df = pd.DataFrame(data)

# Group by 'Category' and calculate the sum

grouped_sum = df.groupby('Category').sum()
print(grouped_sum)

Example 2: Grouping with Multiple Aggregations

python

# Group by 'Category' and apply multiple aggregations

grouped_agg = df.groupby('Category').agg({'Values': ['sum', 'mean', 'max']})
print(grouped_agg)

Example 3: Grouping with a Custom Function

python

# Define a custom function

def custom_function(x):
return x.max() - x.min()

# Group by 'Category' and apply the custom function

grouped_custom = df.groupby('Category')['Values'].agg(custom_function)
print(grouped_custom)

Example 4: Grouping with DateTime Index

python

# Create a DataFrame with a DateTime index

date_rng = pd.date_range(start='2023-01-01', end='2023-01-10', freq='D')
df_time = pd.DataFrame({'Values': range(10)}, index=date_rng)

# Group by day and calculate the sum

daily_sum = df_time.resample('D').sum()
print(daily_sum)

Example 5: Grouping with `transform`

python

# Using transform to get the mean for each group

df['Mean'] = df.groupby('Category')['Values'].transform('mean')
print(df)

Example 6: Grouping with Filtering

python

# Filter groups based on a condition

filtered_groups = df.groupby('Category').filter(lambda x: x['Values'].sum() > 40)
print(filtered_groups)

Example 7: Grouping by Multiple Columns

python

# Create a DataFrame with multiple columns

data2 = {
'Category': ['A', 'A', 'B', 'B', 'C', 'C'],
'Subcategory': ['X', 'Y', 'X', 'Y', 'X', 'Y'],
'Values': [10, 15, 10, 25, 30, 35]
}
df_multi = pd.DataFrame(data2)

# Group by 'Category' and 'Subcategory'

grouped_multi = df_multi.groupby(['Category', 'Subcategory']).sum()
print(grouped_multi)

Example 8: Pivot Table

python

# Create a pivot table from the DataFrame

pivot_table = df_multi.pivot_table(values='Values', index='Category',
columns='Subcategory', aggfunc='sum', fill_value=0)
print(pivot_table)

Example 9: GroupBy with Sort

python

# Group by 'Category' and sort by the sum of values

sorted_groups = df.groupby('Category')['Values'].sum().sort_values(ascending=False)
print(sorted_groups)

Example 10: Combining Aggregation and Transformation

python

# Combine aggregation and transformation

df['Normalized'] = df.groupby('Category')['Values'].transform(lambda x: (x - x.mean()) /
x.std())
print(df)

Assessments (15 Questions)

1. Write code to group a DataFrame by a single column and calculate the sum of another column.
2. How can you apply multiple aggregation functions to a grouped DataFrame?
3. Write a custom aggregation function and apply it to a group.
4. How would you group a DataFrame with a DateTime index and calculate the daily sum?
5. Write code to use the `transform` function after grouping.
6. How can you filter groups based on a specific condition?
7. Write code to group a DataFrame by multiple columns and sum the values.
8. How can you create a pivot table from a DataFrame?
9. Write code to sort groups based on the sum of values in descending order.
10. How would you normalize values within each group using `transform`?
11. Write a function to calculate the range of values in each group.
12. How can you reset the index of a grouped DataFrame?
13. Write code to create a grouped DataFrame and visualize it with a bar plot.
14. How can you group by a time series and calculate the weekly average?
15. Write code to group data and export the summary to a CSV file.
14. Saving DataFrames

Pandas provides various methods to save DataFrames to diﬀerent ﬁle formats. This is essential for
preserving data analysis results, sharing with others, or loading them in other applications.

Examples

Example 1: Saving to CSV

python

import pandas as pd

# Create a DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)

# Save DataFrame to a CSV file

df.to_csv('people.csv', index=False)

Example 2: Saving to Excel

python

# Save DataFrame to an Excel file

df.to_excel('people.xlsx', index=False)

Example 3: Saving to JSON

python

# Save DataFrame to a JSON file

df.to_json('people.json', orient='records', lines=True)

Example 4: Saving to Parquet

python

# Save DataFrame to a Parquet file

df.to_parquet('people.parquet')

Example 5: Saving to HDF5

python

# Save DataFrame to an HDF5 file

df.to_hdf('people.h5', key='people', mode='w')

Example 6: Saving with Compression

python

# Save DataFrame to a compressed CSV file

df.to_csv('people_compressed.csv.gz', index=False, compression='gzip')

Example 7: Saving with a Custom Separator

python

# Save DataFrame to a CSV file with a custom separator

df.to_csv('people_semicolon.csv', sep=';', index=False)

Example 8: Saving Selected Columns

python

# Save only selected columns to a CSV file

df[['Name', 'City']].to_csv('people_selected.csv', index=False)

Example 9: Appending to an Existing CSV

python

# Append DataFrame to an existing CSV file

new_data = {
'Name': ['David', 'Eva'],
'Age': [40, 45],
'City': ['Houston', 'Phoenix']
}
df_new = pd.DataFrame(new_data)

df_new.to_csv('people.csv', mode='a', header=False, index=False)

Example 10: Saving with a Speciﬁc Encoding

python

# Save DataFrame to a CSV file with a specific encoding

df.to_csv('people_utf8.csv', index=False, encoding='utf-8')

Assessments (15 Questions)

1. Write code to save a DataFrame to a CSV ﬁle.

2. How can you save a DataFrame to an Excel file?
3. Write code to save a DataFrame in JSON format.
4. How can you save a DataFrame to a Parquet file?
5. Write code to save a DataFrame to an HDF5 file.
6. How would you save a DataFrame to a compressed CSV file?
7. Write code to save a DataFrame with a custom separator (e.g., semicolon).
8. How can you save only specific columns from a DataFrame to a CSV file?
9. Write code to append data to an existing CSV file.
10. How can you save a DataFrame using a specific character encoding?
11. Write a function to save a DataFrame to a given file format based on user input.
12. How would you check if a file already exists before saving a DataFrame?
13. Write code to read a saved CSV file back into a DataFrame.
14. How can you save multiple DataFrames to different sheets in an Excel file?
15. Write code to handle exceptions when saving a DataFrame to a file.

05 Pandas Data Frames
No ratings yet
05 Pandas Data Frames
33 pages
Pandas PDF
No ratings yet
Pandas PDF
25 pages
14oct Pandas 2024
No ratings yet
14oct Pandas 2024
13 pages
FDS Exp4
No ratings yet
FDS Exp4
5 pages
Pandas DataFrame Features Guide
No ratings yet
Pandas DataFrame Features Guide
13 pages
Chapter Notes - Data Handling Using Pandas DataFrame
No ratings yet
Chapter Notes - Data Handling Using Pandas DataFrame
16 pages
Pandas - Digitalocean
No ratings yet
Pandas - Digitalocean
15 pages
Pandas Dataframe
No ratings yet
Pandas Dataframe
8 pages
IP 12th Chapter 3
No ratings yet
IP 12th Chapter 3
9 pages
Pandas Handbook
No ratings yet
Pandas Handbook
33 pages
Data Frame
No ratings yet
Data Frame
95 pages
FDS Module 2 Notes
No ratings yet
FDS Module 2 Notes
24 pages
Pandas
No ratings yet
Pandas
12 pages
Pandas Library: Data Manipulation & Analysis Guide
No ratings yet
Pandas Library: Data Manipulation & Analysis Guide
9 pages
DataFrame Basics in Data Analytics
No ratings yet
DataFrame Basics in Data Analytics
9 pages
Data Handling Using Pandas-1
No ratings yet
Data Handling Using Pandas-1
60 pages
UNIT II Notes
No ratings yet
UNIT II Notes
23 pages
Ilovepdf Merged
No ratings yet
Ilovepdf Merged
16 pages
Pandas
No ratings yet
Pandas
13 pages
Pandas Essentials for Data Scientists
No ratings yet
Pandas Essentials for Data Scientists
22 pages
Pandas & PyNumS Essentials
No ratings yet
Pandas & PyNumS Essentials
10 pages
Pandas
No ratings yet
Pandas
5 pages
Unit 3 (FODS)
No ratings yet
Unit 3 (FODS)
34 pages
Data Pre Practice
No ratings yet
Data Pre Practice
9 pages
3Y3Z2Xzqn7 U Y%K : 2. How To Create A Data Frame Using A Dictionary of Pre-Existing Columns or Numpy 2D Arrays?
No ratings yet
3Y3Z2Xzqn7 U Y%K : 2. How To Create A Data Frame Using A Dictionary of Pre-Existing Columns or Numpy 2D Arrays?
8 pages
Pandas
No ratings yet
Pandas
27 pages
DAP 3 Module
No ratings yet
DAP 3 Module
62 pages
Pandas, Numpy, Matplotlib
No ratings yet
Pandas, Numpy, Matplotlib
11 pages
Pandas Research
No ratings yet
Pandas Research
14 pages
Python Pandas Dataframe
No ratings yet
Python Pandas Dataframe
3 pages
Introduction To Pandas
No ratings yet
Introduction To Pandas
27 pages
Data Manipulation With Pandas
No ratings yet
Data Manipulation With Pandas
39 pages
The Pandas Library
No ratings yet
The Pandas Library
39 pages
Lab 9
No ratings yet
Lab 9
9 pages
DataFrame Ac Win Final
No ratings yet
DataFrame Ac Win Final
30 pages
Introduction To Pandas For Data Analysis
No ratings yet
Introduction To Pandas For Data Analysis
6 pages
Experiment 678910
No ratings yet
Experiment 678910
12 pages
Pandas Data Structures: Sections
No ratings yet
Pandas Data Structures: Sections
13 pages
Data Frames
No ratings yet
Data Frames
60 pages
Pandas 1705297450
No ratings yet
Pandas 1705297450
21 pages
Pandas
No ratings yet
Pandas
25 pages
Python For ML
No ratings yet
Python For ML
41 pages
Data Aggregation and Group Operations
No ratings yet
Data Aggregation and Group Operations
34 pages
Pandas for Data Science Beginners
No ratings yet
Pandas for Data Science Beginners
2 pages
Chapter 1 Python Pandas - I
No ratings yet
Chapter 1 Python Pandas - I
35 pages
Data Analysis With Pandas
No ratings yet
Data Analysis With Pandas
122 pages
Python Data Science 101
100% (1)
Python Data Science 101
41 pages
Ainotes
No ratings yet
Ainotes
5 pages
18 Pandas
No ratings yet
18 Pandas
33 pages
Day08-Pandas-Tutorial: Pandas - by Punith V T
No ratings yet
Day08-Pandas-Tutorial: Pandas - by Punith V T
8 pages
Pandas DataFrame Basics Guide
No ratings yet
Pandas DataFrame Basics Guide
9 pages
En - PY0101EN - Loading Data With Pandas
No ratings yet
En - PY0101EN - Loading Data With Pandas
2 pages
Pandas Basics
No ratings yet
Pandas Basics
84 pages
Pandas
No ratings yet
Pandas
50 pages
Dataframes-I (Create - Selection)
No ratings yet
Dataframes-I (Create - Selection)
12 pages
Pandas
No ratings yet
Pandas
4 pages
Python Pandas ch-2
No ratings yet
Python Pandas ch-2
56 pages
Unit 4
No ratings yet
Unit 4
36 pages
Bhava Chandrika Missing Links of Hindu Astrology
No ratings yet
Bhava Chandrika Missing Links of Hindu Astrology
32 pages
Purplsoc 2017 Pursuit of Pattern Languages For Societal Change Richard Sickinger Download
No ratings yet
Purplsoc 2017 Pursuit of Pattern Languages For Societal Change Richard Sickinger Download
28 pages
Learning Activity4.1 (Science Grade 8) : Name: Grade/Score: Year and Section: Date
100% (1)
Learning Activity4.1 (Science Grade 8) : Name: Grade/Score: Year and Section: Date
2 pages
KAR1801-CEC-PR-040-PM-0001 - Permit To Work Procedure
No ratings yet
KAR1801-CEC-PR-040-PM-0001 - Permit To Work Procedure
15 pages
Threshold 14
No ratings yet
Threshold 14
194 pages
Hawk's Ego and Power
No ratings yet
Hawk's Ego and Power
4 pages
Brochure Kulswamini Bheri Bhavani-3
No ratings yet
Brochure Kulswamini Bheri Bhavani-3
9 pages
ADA Adult Pain Guideline 2023 Extraction
No ratings yet
ADA Adult Pain Guideline 2023 Extraction
1 page
Nur 145 (Exam Gina) : Concealed Hemorrhage
100% (1)
Nur 145 (Exam Gina) : Concealed Hemorrhage
26 pages
Internship Report
No ratings yet
Internship Report
25 pages
Columns (Section 410 NSCP 2015) : Materials (410.2.1)
100% (2)
Columns (Section 410 NSCP 2015) : Materials (410.2.1)
8 pages
Losoid Lo 20
No ratings yet
Losoid Lo 20
6 pages
Athletic Liturgy for Faithful Competitors
No ratings yet
Athletic Liturgy for Faithful Competitors
3 pages
2000T Hydraulic Press Draft 9 12 01 18
0% (1)
2000T Hydraulic Press Draft 9 12 01 18
18 pages
Descriptio: Model GXO Sliding Sleeve
No ratings yet
Descriptio: Model GXO Sliding Sleeve
4 pages
Static-GK-Capsule ASO Odisha 2018
No ratings yet
Static-GK-Capsule ASO Odisha 2018
66 pages
Slick Service Letter: Champion Aerospace LLC
No ratings yet
Slick Service Letter: Champion Aerospace LLC
34 pages
Compliance Sheet-Foundation Waterproofing MFGC
No ratings yet
Compliance Sheet-Foundation Waterproofing MFGC
2 pages
' SEPAKAT SETIA PERUNDING (SDN) BHD, ,, MM, ,",,, - "
No ratings yet
' SEPAKAT SETIA PERUNDING (SDN) BHD, ,, MM, ,",,, - "
1 page
Anspach & Hobday Brewery Expansion Plan
No ratings yet
Anspach & Hobday Brewery Expansion Plan
29 pages
Ra 9522 Archipelagic Baseline
No ratings yet
Ra 9522 Archipelagic Baseline
5 pages
Quantitative Ability
No ratings yet
Quantitative Ability
9 pages
Resume For Freshers Sample Download
100% (2)
Resume For Freshers Sample Download
4 pages
Install Manual For KEIHIN PE28
0% (1)
Install Manual For KEIHIN PE28
4 pages
MS6001FA
No ratings yet
MS6001FA
14 pages
Dedication Booklet
No ratings yet
Dedication Booklet
16 pages
26 5200 Safety Lighting
No ratings yet
26 5200 Safety Lighting
12 pages
Tropospheric Wave Propagation
No ratings yet
Tropospheric Wave Propagation
17 pages
Manual Controlador Fisher
No ratings yet
Manual Controlador Fisher
124 pages
Localized Wind Systems - Final
No ratings yet
Localized Wind Systems - Final
6 pages