KEMBAR78
Pandas Guide | PDF | Time Series | Computer Programming
0% found this document useful (0 votes)
29 views50 pages

Pandas Guide

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views50 pages

Pandas Guide

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Pandas for Beginners

Pandas Overview
Pandas is a powerful and flexible open-source data analysis and manipulation library in Python. It is
widely used for data analysis tasks such as cleaning, transforming, and visualizing data. Here's a guide
to get you started with Pandas.

1. Installing Pandas

To install Pandas, you can use pip:

bash

pip install pandas

2. Importing Pandas

You start by importing the library:

python

import pandas as pd

3. Creating DataFrames

A DataFrame is a 2-dimensional labeled data structure (like a table with rows and columns).

From a Dictionary:

python

data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)

From a CSV File:

python

df = pd.read_csv('file.csv')

From an Excel File:

python

df = pd.read_excel('file.xlsx')

4. Viewing Data

Head & Tail: View the first or last few rows.

python

df.head() # Default shows the first 5 rows


df.tail(3) # Shows the last 3 rows

Shape: Get the dimensions (rows, columns).

python

print(df.shape)

Info: Get a summary of the DataFrame.

python

print(df.info())

Describe: Get statistical summary of numerical columns.

python

print(df.describe())
5. Selecting Data

Select Columns:

python

print(df['Name']) # Select single column


print(df[['Name', 'Age']]) # Select multiple columns

Select Rows:

Using `iloc` for index-based selection:

python

print(df.iloc[0]) # Select first row


print(df.iloc[0:2]) # Select first two rows

Using `loc` for label-based selection:

python

print(df.loc[0]) # Select first row

Conditional Selection:

python

# Select rows where Age is greater than 30


print(df[df['Age'] > 30])

6. Data Cleaning

Handling Missing Values:

Check for missing values:

python

print(df.isnull().sum())

Drop missing values:


python

df.dropna(inplace=True)

Fill missing values:

python

df.fillna(0, inplace=True) # Fill with a specific value

Rename Columns:

python

df.rename(columns={'Name': 'Full Name'}, inplace=True)

7. Data Manipulation

Adding a New Column:

python

df['Salary'] = [50000, 60000, 70000]

Deleting Columns:

python

df.drop(columns=['Salary'], inplace=True)

Sorting Data:

python

df.sort_values(by='Age', ascending=False, inplace=True)

8. GroupBy and Aggregation

You can group data by certain columns and apply aggregation functions.
python

grouped = df.groupby('City').mean()
print(grouped)

For more complex operations, you can apply custom aggregation:

python

df.groupby('City').agg({'Age': 'mean', 'Salary': 'sum'})

9. Merging and Joining DataFrames

Merging:

You can merge DataFrames similarly to SQL joins.

python

df1 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]})


df2 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Salary': [50000, 60000]})

merged_df = pd.merge(df1, df2, on='Name')

Concatenating:

Concatenate two DataFrames:

python

df_concat = pd.concat([df1, df2], axis=0) # axis=0 for rows, axis=1 for columns

10. Saving DataFrames

To CSV:

python

df.to_csv('output.csv', index=False)

To Excel:
python

df.to_excel('output.xlsx', index=False)

11. Advanced Topics

Pivot Tables:

python

df.pivot_table(values='Salary', index='City', columns='Age', aggfunc='mean')

Apply Functions: You can apply functions to DataFrame elements.

python

df['Salary'] = df['Salary'].apply(lambda x: x + 1000)

Pandas is a highly flexible tool with a rich API that covers a wide range of data manipulation tasks. This
tutorial covers the most commonly used features to get you started.

I'll break this down by each Pandas topic with 10 examples and then follow up with 15 assessment
questions per topic for practice. I'll first walk through one topic thoroughly, including examples and
assessments.

1. Creating DataFrames

Examples:

Example 1: Creating a DataFrame from a dictionary


python

data = {'Name': ['John', 'Sara', 'Tom'], 'Age': [28, 22, 35]}


df = pd.DataFrame(data)
print(df)

Example 2: Creating a DataFrame with specific indexes

python

data = {'Name': ['John', 'Sara', 'Tom'], 'Age': [28, 22, 35]}


df = pd.DataFrame(data, index=['a', 'b', 'c'])
print(df)

Example 3: Creating a DataFrame from a list of lists

python

data = [['John', 28], ['Sara', 22], ['Tom', 35]]


df = pd.DataFrame(data, columns=['Name', 'Age'])
print(df)

Example 4: Creating a DataFrame from a list of dictionaries

python

data = [{'Name': 'John', 'Age': 28}, {'Name': 'Sara', 'Age': 22}, {'Name': 'Tom', 'Age':
35}]
df = pd.DataFrame(data)
print(df)

Example 5: Creating an empty DataFrame and adding data later

python

df = pd.DataFrame(columns=['Name', 'Age'])
df.loc[0] = ['John', 28]
df.loc[1] = ['Sara', 22]
print(df)

Example 6: Creating a DataFrame from NumPy arrays

python

import numpy as np
data = np.array([['John', 28], ['Sara', 22], ['Tom', 35]])
df = pd.DataFrame(data, columns=['Name', 'Age'])
print(df)

Example 7: Creating a DataFrame from a CSV file

python

df = pd.read_csv('people.csv')
print(df)

Example 8: Creating a DataFrame from an Excel file

python

df = pd.read_excel('people.xlsx')
print(df)

Example 9: Creating a DataFrame from a list of tuples

python

data = [('John', 28), ('Sara', 22), ('Tom', 35)]


df = pd.DataFrame(data, columns=['Name', 'Age'])
print(df)

Example 10: Creating a DataFrame from another DataFrame

python

df1 = pd.DataFrame({'Name': ['John', 'Sara'], 'Age': [28, 22]})


df2 = pd.DataFrame(df1)
print(df2)

Assessments (15 Questions)

1. Create a DataFrame from a dictionary with the keys `['Name', 'Age']` and values `[['John',
'Tom'], [28, 35]]`.
2. Add a column `['City']` with values `['New York', 'Chicago']` to the DataFrame created in
Question 1.
3. Convert a list of lists `[['John', 28], ['Sara', 22], ['Tom', 35]]` into a DataFrame with column
names `['Name', 'Age']`.
4. Create a DataFrame from an empty DataFrame and manually add a row with values `['John',
28]`.
5. Create a DataFrame from a CSV file `data.csv` where the first row is the header.
6. Use NumPy to create a DataFrame with an array `[['Sara', 22], ['Tom', 35]]` and assign
column names.
7. Convert a dictionary `{'Name': ['Sara', 'Tom'], 'Age': [22, 35]}` into a DataFrame with
specified row indexes `['row1', 'row2']`.
8. Use a list of tuples `('Sara', 22)` and `('Tom', 35)` to create a DataFrame with column names
`['Name', 'Age']`.
9. Create a DataFrame from a list of dictionaries `{'Name': 'Sara', 'Age': 22}` and `{'Name':
'Tom', 'Age': 35}`.
10. Add a new column called `['Gender']` with values `['Female', 'Male']` to an existing
DataFrame.
11. Save the DataFrame created in Question 10 to a CSV file.
12. Create a DataFrame using an Excel file named `people_data.xlsx`.
13. Create a DataFrame by merging two existing DataFrames.
14. Load a DataFrame from a JSON file.
15. Create a DataFrame from another DataFrame and modify one of its columns.
2. Viewing Data

This topic focuses on how to explore and examine a DataFrame, which is crucial in data analysis. Let's
walk through some common methods for viewing the contents of a DataFrame, and then I'll provide
assessment questions afterward.

Examples

Example 1: Viewing the first few rows with `head()`


python

import pandas as pd

# Sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 30, 35, 40, 22]}
df = pd.DataFrame(data)

# View the first 3 rows


print(df.head(3))

Example 2: Viewing the last few rows with `tail()`

python

# View the last 2 rows


print(df.tail(2))

Example 3: Checking the shape of the DataFrame

python

# Get the number of rows and columns


print(df.shape)

Example 4: Getting information about the DataFrame using `info()`

python

# Summary info about the DataFrame


print(df.info())

Example 5: Viewing basic statistics with `describe()`

python

# Get statistics of numeric columns


print(df.describe())

Example 6: Viewing the index of the DataFrame

python

# Get the index of the DataFrame


print(df.index)

Example 7: Viewing the column names of the DataFrame

python

# Get column names


print(df.columns)

Example 8: Viewing data types of each column

python

# Get data types of each column


print(df.dtypes)

Example 9: Viewing specific elements by position using `iloc[]`

python

# Select the element in the first row and first column


print(df.iloc[0, 0])

Example 10: Viewing specific rows by label using `loc[]`

python

# Select the row where the index is 0


print(df.loc[0])

Assessments (15 Questions)

1. Using the `head()` function, display the first 4 rows of a DataFrame.


2. Use the `tail()` function to display the last 3 rows of a DataFrame.
3. How can you find out how many rows and columns are in a DataFrame?
4. Which function would you use to get a summary of data types and memory usage of a
DataFrame?
5. Get a statistical summary of all numeric columns in a DataFrame.
6. Display the names of all columns in a DataFrame.
7. How would you access the index of a DataFrame?
8. Write a function to get the data type of each column in a DataFrame.
9. How can you extract the element from the second row and second column using `iloc[]`?
10. Use the `loc[]` function to get the row where the index is `2`.
11. How can you view the first row of a DataFrame by position?
12. How would you select the last row of a DataFrame without using `tail()`?
13. Write a function to check if there are any missing values in the DataFrame.
14. How can you view the first 5 rows of only one column (e.g., 'Age')?
15. Get the maximum and minimum values for a numeric column using a DataFrame method.

3. Selecting Data

In this section, we'll explore different ways to select data from a DataFrame, whether it’s selecting
columns, rows, or subsets based on certain conditions.

Examples

Example 1: Selecting a single column

python

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'],


'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)

# Select the 'Name' column


print(df['Name'])

Example 2: Selecting multiple columns

python

# Select the 'Name' and 'City' columns


print(df[['Name', 'City']])

Example 3: Selecting rows by index using `iloc[]`


python

# Select the first row


print(df.iloc[0])

# Select the first two rows


print(df.iloc[0:2])

Example 4: Selecting rows by label using `loc[]`

python

# Select the row with index 1


print(df.loc[1])

Example 5: Selecting rows where a column’s value meets a condition

python

# Select rows where 'Age' is greater than 28


print(df[df['Age'] > 28])

Example 6: Selecting rows where multiple conditions are met

python

# Select rows where 'Age' > 28 and 'City' is 'Chicago'


print(df[(df['Age'] > 28) & (df['City'] == 'Chicago')])

Example 7: Selecting rows based on string conditions

python

# Select rows where 'Name' starts with 'A'


print(df[df['Name'].str.startswith('A')])

Example 8: Using `isin()` to filter rows

python

# Select rows where 'City' is either 'New York' or 'Chicago'


print(df[df['City'].isin(['New York', 'Chicago'])])

Example 9: Selecting a subset of the DataFrame based on rows and columns


python

# Select the first two rows and 'Name' and 'Age' columns
print(df.loc[0:1, ['Name', 'Age']])

Example 10: Selecting specific rows and columns using `iloc[]`

python

# Select the element in the second row and third column


print(df.iloc[1, 2])

# Select the first two rows and first two columns


print(df.iloc[0:2, 0:2])

Assessments (15 Questions)

1. Select the column 'Age' from a DataFrame.


2. How would you select multiple columns, say 'Age' and 'City', from a DataFrame?
3. Write code to select the third row from a DataFrame using `iloc[]`.
4. How can you select the first three rows using `iloc[]`?
5. Write a function to select the row with index 2 using `loc[]`.
6. How can you select rows where the 'Age' column is less than 30?
7. How would you select rows where 'City' is 'New York' and 'Age' is greater than 30?
8. Write code to select rows where 'Name' ends with the letter 'e'.
9. Use the `isin()` method to filter rows where 'City' is 'Chicago' or 'Los Angeles'.
10. How would you select specific rows (first and second) and specific columns ('Name' and 'Age')?
11. Select the element in the third row and second column using `iloc[]`.
12. How would you select rows where the 'Age' column is between 25 and 35?
13. Write code to select rows where 'Name' contains the substring 'Bob'.
14. How can you select the first two rows and all columns from a DataFrame?
15. Write a function to filter rows where 'Age' is greater than 25 and 'Name' does not start with 'C'.

4. Handling Missing Data


Dealing with missing or null data is essential when working with real-world datasets. Pandas provides
various tools to handle missing values, such as detecting them, filling them, or removing them.

Examples

Example 1: Detecting missing values with `isna()`

python

import pandas as pd

data = {'Name': ['Alice', 'Bob', None],


'Age': [25, None, 35],
'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)

# Detect missing values


print(df.isna())

Example 2: Detecting missing values with `isnull()`

python

# Alternative way to detect missing values


print(df.isnull())

Example 3: Counting missing values in each column

python

# Count the number of missing values in each column


print(df.isna().sum())

Example 4: Dropping rows with missing values

python

# Drop rows where any value is missing


df_dropped = df.dropna()
print(df_dropped)

Example 5: Dropping columns with missing values

python

# Drop columns where any value is missing


df_dropped_col = df.dropna(axis=1)
print(df_dropped_col)

Example 6: Filling missing values with a constant

python

# Fill missing values with a specific value


df_filled = df.fillna(0)
print(df_filled)

Example 7: Filling missing values in specific columns

python

# Fill missing values in the 'Age' column with the mean age
df['Age'] = df['Age'].fillna(df['Age'].mean())
print(df)

Example 8: Forward filling missing values

python

# Fill missing values using the previous non-null value (forward fill)
df_ffill = df.fillna(method='ffill')
print(df_ffill)

Example 9: Backward filling missing values

python

# Fill missing values using the next non-null value (backward fill)
df_bfill = df.fillna(method='bfill')
print(df_bfill)

Example 10: Interpolating missing values

python

# Interpolate missing numeric values


df['Age'] = df['Age'].interpolate()
print(df)

Assessments (15 Questions)


1. How would you detect missing values in a DataFrame using Pandas?
2. Write a function to count missing values in each column of a DataFrame.
3. How can you drop all rows that have at least one missing value?
4. How would you drop columns that contain missing values?
5. Write code to fill missing values with the constant `0`.
6. How can you replace missing values in the 'Age' column with the mean of that column?
7. Write code to forward-fill missing values in a DataFrame.
8. How can you backward-fill missing values in a DataFrame?
9. Use the `interpolate()` function to fill missing values in a numeric column.
10. Write code to fill missing values in a column with the median of that column.
11. How would you fill missing values in the 'City' column with the string 'Unknown'?
12. Write a function to drop rows where all values are missing.
13. How can you fill missing values in a DataFrame using the previous valid value from the same
row?
14. How would you replace missing values in a column with the mode (most frequent value) of that
column?
15. Write code to detect and count missing values in a specific column (e.g., 'Name').

5. Data Aggregation and Grouping

Data aggregation and grouping are fundamental for summarizing and analyzing data in a structured
way. In Pandas, you can use the `groupby()` function to group data and then apply aggregate
functions like `sum()`, `mean()`, etc.

Examples

Example 1: Grouping data by a single column

python

import pandas as pd

data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob'],
'Age': [25, 30, 35, 28, 32],
'Score': [85, 90, 88, 92, 79]
}
df = pd.DataFrame(data)

# Group by 'Name' and calculate the mean of 'Age' and 'Score'


grouped = df.groupby('Name').mean()
print(grouped)

Example 2: Grouping by multiple columns

python

# Group by 'Name' and 'Age', and calculate the sum of 'Score'


grouped_multi = df.groupby(['Name', 'Age']).sum()
print(grouped_multi)

Example 3: Using aggregation functions with `agg()`

python

# Group by 'Name' and aggregate 'Score' using multiple functions


agg_result = df.groupby('Name').agg({'Score': ['mean', 'max', 'min']})
print(agg_result)

Example 4: Counting the number of occurrences

python

# Count the number of occurrences of each name


count_result = df['Name'].value_counts()
print(count_result)

Example 5: Grouping and applying custom functions

python

# Define a custom function to calculate range


def age_range(x):
return x.max() - x.min()

# Group by 'Name' and apply the custom age_range function


age_range_result = df.groupby('Name')['Age'].agg(age_range)
print(age_range_result)

Example 6: Resetting index after grouping

python

# Group by 'Name' and get mean, then reset index


reset_result = df.groupby('Name').mean().reset_index()
print(reset_result)
Example 7: Grouping and filtering groups

python

# Group by 'Name' and filter groups where the mean Score is greater than 85
filtered = df.groupby('Name').filter(lambda x: x['Score'].mean() > 85)
print(filtered)

Example 8: Grouping with multiple aggregations

python

# Group by 'Name' and calculate the sum and mean of 'Score'


agg_multi = df.groupby('Name').agg({'Score': ['sum', 'mean']})
print(agg_multi)

Example 9: Using `size()` to count the number of entries per group

python

# Count the number of entries per name


size_result = df.groupby('Name').size()
print(size_result)

Example 10: Pivot tables as an alternative to grouping

python

# Create a pivot table for Score based on Name and Age


pivot_table = df.pivot_table(values='Score', index='Name', columns='Age', aggfunc='mean')
print(pivot_table)

Assessments (15 Questions)

1. How would you group a DataFrame by a single column and calculate the mean of another
column?
2. Write code to group data by two columns and calculate the sum of a third column.
3. How can you apply multiple aggregation functions to a grouped DataFrame using `agg()`?
4. Write a function to count the number of occurrences of each unique value in a column.
5. How would you group by a column and apply a custom aggregation function?
6. Write code to reset the index of a grouped DataFrame.
7. How can you filter groups based on the mean of a column after grouping?
8. Write code to create a pivot table that summarizes data based on two columns.
9. How would you group a DataFrame and count the size of each group?
10. Write a function to find the maximum and minimum values of a numeric column after grouping
by a categorical column.
11. How can you group by multiple columns and calculate the average of one column?
12. Write code to create a new DataFrame from the aggregation results of a grouped DataFrame.
13. How would you create a DataFrame that only includes groups where the total count is greater
than 1?
14. Write code to summarize data by calculating the sum and mean for multiple columns in a
grouped DataFrame.
15. How can you create a summary table showing the average score for each name in the original
DataFrame?

6. Merging and Joining DataFrames

Merging and joining DataFrames are essential skills in data manipulation, allowing you to combine
datasets based on common keys or indices. Pandas provides various functions for these operations,
including `merge()`, `join()`, and `concat()`.

Examples

Example 1: Merging DataFrames using `merge()`

python

import pandas as pd

# Creating two DataFrames


df1 = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie']
})

df2 = pd.DataFrame({
'ID': [1, 2, 4],
'Age': [25, 30, 35]
})

# Merge the DataFrames on 'ID'


merged_df = pd.merge(df1, df2, on='ID', how='inner')
print(merged_df)

Example 2: Merging with different types of joins


python

# Left join
left_join = pd.merge(df1, df2, on='ID', how='left')
print(left_join)

# Right join
right_join = pd.merge(df1, df2, on='ID', how='right')
print(right_join)

# Outer join
outer_join = pd.merge(df1, df2, on='ID', how='outer')
print(outer_join)

Example 3: Joining DataFrames using `join()`

python

# Creating two DataFrames with indices


df3 = pd.DataFrame({
'Name': ['Alice', 'Bob'],
'Age': [25, 30]
}, index=[1, 2])

df4 = pd.DataFrame({
'City': ['New York', 'Los Angeles']
}, index=[1, 3])

# Join the DataFrames


joined_df = df3.join(df4, how='outer')
print(joined_df)

Example 4: Concatenating DataFrames using `concat()`

python

# Creating two DataFrames for concatenation


df5 = pd.DataFrame({
'ID': [1, 2],
'Name': ['Alice', 'Bob']
})

df6 = pd.DataFrame({
'ID': [3, 4],
'Name': ['Charlie', 'David']
})

# Concatenate the DataFrames


concatenated_df = pd.concat([df5, df6], ignore_index=True)
print(concatenated_df)

Example 5: Concatenating along columns

python

# Concatenate DataFrames along columns


df7 = pd.DataFrame({
'ID': [1, 2],
'Name': ['Alice', 'Bob']
})

df8 = pd.DataFrame({
'Age': [25, 30]
})

concatenated_columns = pd.concat([df7, df8], axis=1)


print(concatenated_columns)

Example 6: Merging on multiple keys

python

# Creating two DataFrames


df9 = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie'],
'City': ['NY', 'LA', 'SF']
})

df10 = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]
})

# Merge on both 'ID' and 'Name'


merged_multi_key = pd.merge(df9, df10, on=['ID', 'Name'])
print(merged_multi_key)

Example 7: Using suffixes to differentiate columns

python

# Merging with suffixes for overlapping column names


df11 = pd.DataFrame({
'ID': [1, 2],
'Value': [100, 200]
})

df12 = pd.DataFrame({
'ID': [1, 2],
'Value': [300, 400]
})

merged_suffixes = pd.merge(df11, df12, on='ID', suffixes=('_left', '_right'))


print(merged_suffixes)

Example 8: Joining with different indices

python

# Create DataFrames with different indices


df13 = pd.DataFrame({
'Name': ['Alice', 'Bob'],
'Age': [25, 30]
}, index=[1, 2])
df14 = pd.DataFrame({
'City': ['New York', 'Los Angeles']
}, index=[2, 3])

# Join the DataFrames with different indices


joined_diff_index = df13.join(df14, how='outer')
print(joined_diff_index)

Example 9: Using `concat()` with keys

python

# Using concat with keys to create a hierarchical index


df15 = pd.DataFrame({
'ID': [1, 2],
'Name': ['Alice', 'Bob']
})

df16 = pd.DataFrame({
'ID': [3, 4],
'Name': ['Charlie', 'David']
})

concatenated_keys = pd.concat([df15, df16], keys=['Group1', 'Group2'])


print(concatenated_keys)

Example 10: Merging with indicator

python

# Merge with indicator to see the source of each row


merged_with_indicator = pd.merge(df1, df2, on='ID', how='outer', indicator=True)
print(merged_with_indicator)

Assessments (15 Questions)

1. Write code to merge two DataFrames on a single key.


2. How would you perform a left join between two DataFrames?
3. Write code to join two DataFrames using the `join()` method.
4. How can you concatenate two DataFrames vertically?
5. Write code to concatenate two DataFrames horizontally.
6. How can you merge two DataFrames on multiple keys?
7. Write a function to merge two DataFrames and include suffixes for overlapping column names.
8. How would you join two DataFrames with different indices?
9. Write code to use `concat()` with keys to create a hierarchical index.
10. How can you merge two DataFrames and see the source of each row?
11. Write code to perform an outer join on two DataFrames.
12. How would you concatenate a list of DataFrames into a single DataFrame?
13. Write a function to merge two DataFrames and drop any duplicates that arise.
14. How can you merge two DataFrames without any duplicate rows?
15. Write code to merge two DataFrames and fill missing values in the resulting DataFrame.
7. Data Visualization with Pandas

Data visualization is an essential part of data analysis, allowing you to see trends, patterns, and
insights that may not be immediately obvious from raw data. Pandas integrates well with Matplotlib
and Seaborn for plotting.

Examples

Example 1: Basic line plot

python

import pandas as pd
import matplotlib.pyplot as plt

# Create a simple DataFrame


data = {'Year': [2018, 2019, 2020, 2021, 2022],
'Sales': [150, 200, 250, 300, 350]}
df = pd.DataFrame(data)

# Basic line plot


df.plot(x='Year', y='Sales', kind='line', title='Sales Over Years')
plt.xlabel('Year')
plt.ylabel('Sales')
plt.show()

Example 2: Bar plot

python

# Bar plot
df.plot(x='Year', y='Sales', kind='bar', title='Sales Over Years')
plt.xlabel('Year')
plt.ylabel('Sales')
plt.show()

Example 3: Histogram

python

# Create a DataFrame with random data


data2 = {'Scores': [82, 85, 87, 90, 95, 70, 75, 80, 88, 92]}
df2 = pd.DataFrame(data2)

# Histogram of Scores
df2['Scores'].plot(kind='hist', bins=5, title='Score Distribution')
plt.xlabel('Scores')
plt.show()

Example 4: Scatter plot

python

# Create a DataFrame for scatter plot


data3 = {
'Height': [5.1, 5.5, 5.8, 6.0, 5.7],
'Weight': [100, 150, 130, 165, 155]
}
df3 = pd.DataFrame(data3)

# Scatter plot
df3.plot(kind='scatter', x='Height', y='Weight', title='Height vs Weight')
plt.xlabel('Height (in feet)')
plt.ylabel('Weight (in lbs)')
plt.show()

Example 5: Box plot

python

# Box plot
df2['Scores'].plot(kind='box', title='Box Plot of Scores')
plt.ylabel('Scores')
plt.show()

Example 6: Area plot

python

# Area plot
df.plot(x='Year', y='Sales', kind='area', alpha=0.4, title='Sales Area Plot')
plt.xlabel('Year')
plt.ylabel('Sales')
plt.show()

Example 7: Pie chart

python

# Create a DataFrame for pie chart


data4 = {'Fruits': ['Apple', 'Banana', 'Cherry'],
'Quantity': [30, 25, 45]}
df4 = pd.DataFrame(data4)

# Pie chart
df4.plot.pie(y='Quantity', labels=df4['Fruits'], autopct='%1.1f%%', title='Fruit
Distribution', legend=False)
plt.show()

Example 8: Customizing plots

python

# Line plot with customization


ax = df.plot(x='Year', y='Sales', kind='line', title='Sales Over Years', color='orange')
ax.set_xlabel('Year')
ax.set_ylabel('Sales')
ax.grid()
plt.show()

Example 9: Using Seaborn for enhanced visualization

python

import seaborn as sns

# Scatter plot using Seaborn


sns.scatterplot(data=df3, x='Height', y='Weight', hue='Weight', palette='deep',
size='Weight', sizes=(20, 200))
plt.title('Height vs Weight with Seaborn')
plt.show()

Example 10: Pairplot with Seaborn

python

# Creating a DataFrame with more data for pairplot


data5 = {
'Height': [5.1, 5.5, 5.8, 6.0, 5.7],
'Weight': [100, 150, 130, 165, 155],
'Age': [20, 22, 23, 25, 30]
}
df5 = pd.DataFrame(data5)

# Pairplot using Seaborn


sns.pairplot(df5)
plt.show()

Assessments (15 Questions)

1. Write code to create a line plot for sales data over years.
2. How would you create a bar plot for the same sales data?
3. Write code to plot a histogram of randomly generated scores.
4. How can you create a scatter plot to visualize the relationship between height and weight?
5. Write a function to create a box plot for a set of scores.
6. How would you create an area plot for sales data?
7. Write code to create a pie chart showing the distribution of fruit quantities.
8. How can you customize a line plot with specific colors and gridlines?
9. Write code to create a scatter plot using Seaborn, including a color palette.
10. How can you create a pairplot to visualize relationships in a DataFrame with multiple numeric
columns?
11. Write a function to plot a histogram with a specified number of bins.
12. How would you create a customized pie chart with percentages displayed?
13. Write code to create a scatter plot with varying point sizes based on another column.
14. How can you add labels to axes and a title to a plot in Matplotlib?
15. Write a function to visualize the distribution of a DataFrame column using a box plot.

8. Time Series Analysis with Pandas

Time series analysis involves analyzing data points collected or recorded at specific time intervals.
Pandas provides powerful tools for working with time series data, making it easy to manipulate,
visualize, and analyze trends over time.

Examples

Example 1: Creating a Time Series

python

import pandas as pd

# Create a date range


date_range = pd.date_range(start='2020-01-01', periods=5, freq='D')

# Create a DataFrame with a time series


data = {'Value': [10, 20, 15, 25, 30]}
df = pd.DataFrame(data, index=date_range)
print(df)

Example 2: Plotting a Time Series

python

# Plot the time series


df.plot(title='Simple Time Series')
plt.xlabel('Date')
plt.ylabel('Value')
plt.show()

Example 3: Resampling Time Series Data

python

# Create a time series with random data


date_range = pd.date_range(start='2020-01-01', end='2020-01-10', freq='D')
data = {'Value': [10, 15, 20, 25, 30, 35, 40, 45, 50, 55]}
df = pd.DataFrame(data, index=date_range)

# Resample to weekly frequency and calculate the mean


weekly_mean = df.resample('W').mean()
print(weekly_mean)

Example 4: Shifting Time Series Data

python

# Shift the time series data by 1 day


df_shifted = df.shift(1)
print(df_shifted)

Example 5: Rolling Window Calculations

python

# Calculate the rolling mean with a window of 3 days


rolling_mean = df['Value'].rolling(window=3).mean()
print(rolling_mean)

Example 6: Time Series Indexing

python

# Select data for a specific date


specific_date = df.loc['2020-01-03']
print(specific_date)

Example 7: Handling Missing Dates

python

# Create a time series with missing dates


date_range = pd.date_range(start='2020-01-01', end='2020-01-10', freq='D')
data = {'Value': [10, None, 20, 25, None, 30, 35, None, 50, 55]}
df_missing = pd.DataFrame(data, index=date_range)
# Fill missing values using forward fill
df_filled = df_missing.fillna(method='ffill')
print(df_filled)

Example 8: Time Series Decomposition

python

from statsmodels.tsa.seasonal import seasonal_decompose

# Decompose the time series


result = seasonal_decompose(df['Value'], model='additive', period=2)
result.plot()
plt.show()

Example 9: Date Range Generation

python

# Create a date range with business days


business_days = pd.date_range(start='2020-01-01', end='2020-01-10', freq='B')
print(business_days)

Example 10: Converting a Column to DateTime

python

# Create a DataFrame with dates as strings


data2 = {'Date': ['2020-01-01', '2020-01-02', '2020-01-03'],
'Value': [10, 20, 30]}
df2 = pd.DataFrame(data2)

# Convert the 'Date' column to datetime


df2['Date'] = pd.to_datetime(df2['Date'])
print(df2)

Assessments (15 Questions)

1. How would you create a time series DataFrame with daily frequency?
2. Write code to plot a time series graph.
3. How can you resample time series data to a weekly frequency and calculate the mean?
4. Write a function to shift a time series by a specified number of periods.
5. How would you calculate the rolling mean over a time series with a window of 3 days?
6. Write code to select data for a specific date in a time series DataFrame.
7. How can you handle missing values in a time series using forward fill?
8. Write code to decompose a time series into its trend, seasonal, and residual components.
9. How would you generate a date range with business days only?
10. Write code to convert a column of strings into datetime format in a DataFrame.
11. How can you plot a time series with different colors for different time periods?
12. Write a function to calculate the cumulative sum of a time series.
13. How would you extract the month and year from a date column in a time series DataFrame?
14. Write code to resample a time series to a monthly frequency and calculate the sum.
15. How can you check for missing dates in a time series?

9. Data Cleaning and Preparation with Pandas

Data cleaning and preparation are essential steps in the data analysis process, ensuring that the data
is accurate, complete, and ready for analysis. Pandas offers various functions and methods for
cleaning, transforming, and preparing data.

Examples

Example 1: Removing Duplicates

python

import pandas as pd

# Create a DataFrame with duplicate rows


data = {
'Name': ['Alice', 'Bob', 'Alice', 'Charlie'],
'Age': [25, 30, 25, 35]
}
df = pd.DataFrame(data)

# Remove duplicate rows


df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates)

Example 2: Filling Missing Values

python

# Create a DataFrame with missing values


data2 = {
'Name': ['Alice', 'Bob', 'Charlie', None],
'Age': [25, None, 35, 40]
}
df2 = pd.DataFrame(data2)

# Fill missing values with a specified value


df_filled = df2.fillna({'Name': 'Unknown', 'Age': df2['Age'].mean()})
print(df_filled)

Example 3: Dropping Missing Values

python

# Drop rows with missing values


df_dropped = df2.dropna()
print(df_dropped)

Example 4: Renaming Columns

python

# Rename columns
df.rename(columns={'Name': 'Full Name', 'Age': 'Age in Years'}, inplace=True)
print(df)

Example 5: Changing Data Types

python

# Create a DataFrame with mixed data types


data3 = {
'ID': ['1', '2', '3'],
'Age': ['25', '30', '35']
}
df3 = pd.DataFrame(data3)

# Convert data types


df3['ID'] = df3['ID'].astype(int)
df3['Age'] = df3['Age'].astype(int)
print(df3.dtypes)

Example 6: String Manipulation

python

# Create a DataFrame with strings


data4 = {
'Names': [' Alice ', 'Bob ', ' Charlie ']
}
df4 = pd.DataFrame(data4)

# Strip whitespace from strings


df4['Names'] = df4['Names'].str.strip()
print(df4)

Example 7: Changing Indexes


python

# Create a DataFrame
data5 = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]
}
df5 = pd.DataFrame(data5)

# Set 'Name' as the index


df5.set_index('Name', inplace=True)
print(df5)

Example 8: Filtering Data

python

# Filter data based on a condition


filtered_df = df5[df5['Age'] > 30]
print(filtered_df)

Example 9: Categorizing Data

python

# Create a DataFrame with categorical data


data6 = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Score': [85, 70, 95]
}
df6 = pd.DataFrame(data6)

# Convert 'Score' to categorical


df6['Grade'] = pd.cut(df6['Score'], bins=[0, 60, 80, 100], labels=['F', 'C', 'A'])
print(df6)

Example 10: Merging and Concatenating DataFrames

python

# Create two DataFrames


df7 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})
df8 = pd.DataFrame({'ID': [2, 3], 'Name': ['Bob', 'Charlie']})

# Merge the DataFrames


merged_df = pd.merge(df7, df8, on='ID', how='outer')
print(merged_df)

Assessments (15 Questions)

1. Write code to remove duplicate rows from a DataFrame.


2. How can you fill missing values in a DataFrame with a specific value?
3. Write code to drop rows with missing values from a DataFrame.
4. How would you rename columns in a DataFrame?
5. Write code to change the data types of specific columns in a DataFrame.
6. How can you strip whitespace from string entries in a DataFrame column?
7. Write code to change the index of a DataFrame to a specific column.
8. How would you filter a DataFrame to show rows that meet a specific condition?
9. Write code to categorize a numerical column into bins.
10. How can you concatenate two DataFrames vertically?
11. Write a function to check for missing values in a DataFrame.
12. How would you replace specific values in a DataFrame with new values?
13. Write code to sort a DataFrame by a specific column.
14. How can you check the data types of all columns in a DataFrame?
15. Write code to extract a specific column from a DataFrame as a Series.

10. Advanced Data Manipulation with Pandas

Advanced data manipulation techniques are essential for more complex data analysis tasks. This topic
covers advanced features in Pandas that allow you to transform, aggregate, and reshape your data.

Examples

Example 1: Pivot Tables

python

import pandas as pd

# Create a DataFrame for pivot table


data = {
'Date': ['2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02'],
'Category': ['A', 'B', 'A', 'B'],
'Sales': [100, 200, 150, 250]
}
df = pd.DataFrame(data)

# Create a pivot table


pivot_table = df.pivot_table(values='Sales', index='Date', columns='Category',
aggfunc='sum')
print(pivot_table)

Example 2: Group By
python

# Group by Category and calculate the total sales


grouped = df.groupby('Category')['Sales'].sum()
print(grouped)

Example 3: Applying Functions

python

# Apply a custom function to calculate the square of sales


df['Sales Squared'] = df['Sales'].apply(lambda x: x ** 2)
print(df)

Example 4: Chaining Methods

python

# Chaining methods for data manipulation


result = (df[df['Sales'] > 100]
.groupby('Category')['Sales']
.sum()
.reset_index())
print(result)

Example 5: Merging with Different Join Types

python

# Create two DataFrames for merging


df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [2, 3, 4], 'Score': [90, 85, 80]})

# Merge DataFrames with inner join


inner_merge = pd.merge(df1, df2, on='ID', how='inner')
print(inner_merge)

Example 6: Concatenating DataFrames

python

# Create two DataFrames


df3 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})
df4 = pd.DataFrame({'ID': [3, 4], 'Name': ['Charlie', 'David']})

# Concatenate DataFrames
concatenated = pd.concat([df3, df4], ignore_index=True)
print(concatenated)
Example 7: Reshaping Data with Melt

python

# Create a DataFrame to reshape


data2 = {
'ID': [1, 2],
'Math': [90, 80],
'Science': [85, 75]
}
df2 = pd.DataFrame(data2)

# Reshape with melt


melted = df2.melt(id_vars=['ID'], value_vars=['Math', 'Science'], var_name='Subject',
value_name='Score')
print(melted)

Example 8: Using `stack` and `unstack`

python

# Create a multi-index DataFrame


data3 = {
'Category': ['A', 'A', 'B', 'B'],
'Type': ['X', 'Y', 'X', 'Y'],
'Values': [10, 20, 15, 25]
}
df3 = pd.DataFrame(data3).set_index(['Category', 'Type'])

# Use stack
stacked = df3.stack()
print(stacked)

# Use unstack
unstacked = df3.unstack()
print(unstacked)

Example 9: Using `pivot` for Reshaping Data

python

# Create a DataFrame for pivot


data4 = {
'Date': ['2021-01-01', '2021-01-01', '2021-01-02'],
'Product': ['A', 'B', 'A'],
'Sales': [100, 200, 150]
}
df4 = pd.DataFrame(data4)

# Create a pivot table


pivot_result = df4.pivot(index='Date', columns='Product', values='Sales')
print(pivot_result)

Example 10: Time Series Aggregation


python

# Create a time series DataFrame


date_range = pd.date_range(start='2021-01-01', periods=5, freq='D')
data5 = {'Sales': [10, 20, 15, 25, 30]}
df5 = pd.DataFrame(data5, index=date_range)

# Resample to weekly frequency and sum sales


weekly_sales = df5.resample('W').sum()
print(weekly_sales)

Assessments (15 Questions)

1. Write code to create a pivot table summarizing sales by date and category.
2. How can you group a DataFrame by a specific column and calculate the sum?
3. Write a function to apply a custom operation on a DataFrame column.
4. How can you chain multiple methods to filter and group data in one statement?
5. Write code to perform an outer join on two DataFrames based on a common column.
6. How would you concatenate two DataFrames and reset the index?
7. Write code to melt a DataFrame from wide to long format.
8. How can you use `stack` and `unstack` to reshape a multi-index DataFrame?
9. Write code to create a pivot table from a DataFrame with date and product data.
10. How would you aggregate time series data to a monthly frequency and calculate the mean?
11. Write a function to merge two DataFrames with a left join.
12. How can you extract unique values from a specific column in a DataFrame?
13. Write code to sort a DataFrame based on multiple columns.
14. How can you calculate cumulative sums for a DataFrame column?
15. Write code to filter a DataFrame based on a condition from multiple columns.

11. Data Visualization with Pandas

Data visualization is a crucial aspect of data analysis that helps in understanding patterns, trends, and
insights. Pandas provides built-in capabilities for creating basic visualizations, leveraging libraries like
Matplotlib and Seaborn for more advanced visualizations.

Examples

Example 1: Basic Line Plot


python

import pandas as pd
import matplotlib.pyplot as plt

# Create a simple DataFrame


data = {
'Year': [2020, 2021, 2022],
'Sales': [100, 150, 200]
}
df = pd.DataFrame(data)

# Plot a line graph


df.plot(x='Year', y='Sales', kind='line', title='Sales Over Years', marker='o')
plt.xlabel('Year')
plt.ylabel('Sales')
plt.show()

Example 2: Bar Plot

python

# Create a bar plot


df.plot(x='Year', y='Sales', kind='bar', title='Sales by Year')
plt.xlabel('Year')
plt.ylabel('Sales')
plt.show()

Example 3: Histogram

python

# Create a DataFrame with random data


data2 = {
'Scores': [88, 92, 85, 95, 70, 76, 89, 90, 83, 94]
}
df2 = pd.DataFrame(data2)

# Plot a histogram
df2['Scores'].plot(kind='hist', bins=5, title='Score Distribution')
plt.xlabel('Scores')
plt.show()

Example 4: Scatter Plot

python

# Create another DataFrame


data3 = {
'Height': [150, 160, 170, 180, 190],
'Weight': [50, 60, 70, 80, 90]
}
df3 = pd.DataFrame(data3)

# Create a scatter plot


df3.plot(kind='scatter', x='Height', y='Weight', title='Height vs. Weight')
plt.show()

Example 5: Box Plot

python

# Create a DataFrame with some data


data4 = {
'Category': ['A', 'A', 'B', 'B', 'C', 'C'],
'Values': [10, 15, 12, 18, 22, 25]
}
df4 = pd.DataFrame(data4)

# Create a box plot


df4.boxplot(column='Values', by='Category')
plt.title('Box Plot of Values by Category')
plt.suptitle('')
plt.xlabel('Category')
plt.ylabel('Values')
plt.show()

Example 6: Area Plot

python

# Create a DataFrame with cumulative data


data5 = {
'Year': [2020, 2021, 2022],
'Sales': [100, 150, 200],
'Expenses': [80, 120, 160]
}
df5 = pd.DataFrame(data5)

# Plot an area plot


df5.plot(x='Year', kind='area', stacked=True, title='Sales and Expenses Over Years')
plt.xlabel('Year')
plt.ylabel('Amount')
plt.show()

Example 7: Pie Chart

python

# Create a pie chart for category distribution


data6 = {
'Category': ['A', 'B', 'C'],
'Values': [30, 50, 20]
}
df6 = pd.DataFrame(data6)

# Plot a pie chart


df6.set_index('Category').plot(kind='pie', y='Values', autopct='%1.1f%%', title='Category
Distribution')
plt.ylabel('')
plt.show()
Example 8: Heatmap with Seaborn

python

import seaborn as sns

# Create a correlation matrix


data7 = {
'A': [1, 2, 3, 4],
'B': [4, 5, 6, 7],
'C': [7, 8, 9, 10]
}
df7 = pd.DataFrame(data7)

# Create a heatmap
sns.heatmap(df7.corr(), annot=True, cmap='coolwarm', title='Correlation Heatmap')
plt.title('Correlation Heatmap')
plt.show()

Example 9: Customizing Plots

python

# Create a line plot with customization


df.plot(x='Year', y='Sales', kind='line', title='Sales Over Years', marker='o',
color='red', linestyle='--')
plt.xlabel('Year')
plt.ylabel('Sales')
plt.grid()
plt.show()

Example 10: Subplots

python

# Create multiple subplots


fig, axs = plt.subplots(2, 1, figsize=(8, 8))

df.plot(x='Year', y='Sales', kind='line', ax=axs[0], title='Sales Over Years', marker='o')


axs[0].set_ylabel('Sales')

df2['Scores'].plot(kind='hist', bins=5, ax=axs[1], title='Score Distribution')


axs[1].set_xlabel('Scores')

plt.tight_layout()
plt.show()

Assessments (15 Questions)

1. Write code to create a line plot for sales data over several years.
2. How can you create a bar plot from a DataFrame?
3. Write code to generate a histogram of a numerical column.
4. How can you create a scatter plot to analyze the relationship between two variables?
5. Write code to create a box plot for data categorized by another variable.
6. How can you create an area plot to visualize cumulative data over time?
7. Write code to generate a pie chart from categorical data.
8. How can you create a heatmap to visualize correlations between multiple variables?
9. Write code to customize a plot with colors, markers, and line styles.
10. How can you create multiple subplots in a single figure?
11. Write a function to save a plot as a PNG file.
12. How would you plot a rolling average on a time series data?
13. Write code to change the title and labels of a plot after creating it.
14. How can you adjust the size of a plot in Matplotlib?
15. Write code to display a legend in a plot.
12. Working with Time Zones in Pandas

Time zone handling is an essential part of time series analysis, especially when dealing with data from
multiple time zones. Pandas provides robust tools for converting, localizing, and performing
operations on time zone-aware datetime objects.

Examples

Example 1: Localizing a Time Series

python

import pandas as pd

# Create a time series


date_range = pd.date_range(start='2023-01-01', periods=5, freq='D')
ts = pd.Series([10, 15, 20, 25, 30], index=date_range)

# Localize to UTC
ts_utc = ts.tz_localize('UTC')
print(ts_utc)

Example 2: Converting Time Zones

python

# Convert time series to a different time zone


ts_new_york = ts_utc.tz_convert('America/New_York')
print(ts_new_york)

Example 3: Creating Time Zone Aware Timestamps


python

# Create a timestamp with a specific time zone


timestamp = pd.Timestamp('2023-01-01 12:00:00', tz='Europe/London')
print(timestamp)

Example 4: Handling Daylight Saving Time (DST)

python

# Create a time series that includes a DST change


date_range = pd.date_range(start='2023-03-01', periods=10, freq='D')
ts_dst = pd.Series(range(10), index=date_range).dt.tz_localize('Europe/London')
print(ts_dst)

Example 5: Using `pd.to_datetime` with Time Zones

python

# Create a time series from a string with a time zone


ts_from_string = pd.to_datetime('2023-01-01 12:00:00 UTC')
print(ts_from_string)

Example 6: Operations with Time Zone Aware Datetime

python

# Create two time series with different time zones


ts_utc1 = pd.Series([1, 2, 3], index=pd.date_range('2023-01-01', periods=3,
freq='H')).tz_localize('UTC')
ts_utc2 = pd.Series([4, 5, 6], index=pd.date_range('2023-01-01', periods=3,
freq='H')).tz_localize('America/New_York')

# Convert to UTC for operations


ts_utc2_converted = ts_utc2.tz_convert('UTC')
result = ts_utc1 + ts_utc2_converted
print(result)

Example 7: Plotting Time Series with Time Zones

python

import matplotlib.pyplot as plt

# Plotting time series with time zones


ts_utc.plot(title='Time Series in UTC', marker='o')
plt.xlabel('Date')
plt.ylabel('Values')
plt.show()
Example 8: Resampling with Time Zones

python

# Resample time series with time zone


ts_resampled = ts_dst.resample('2D').mean()
print(ts_resampled)

Example 9: Converting Timestamps to Local Time

python

# Convert a timestamp to local time


local_time = timestamp.tz_convert('America/New_York')
print(local_time)

Example 10: Handling Multiple Time Zones

python

# Create a DataFrame with different time zones


data = {
'UTC': pd.date_range('2023-01-01', periods=3, freq='H').tz_localize('UTC'),
'New_York': pd.date_range('2023-01-01', periods=3,
freq='H').tz_localize('America/New_York')
}
df_timezones = pd.DataFrame(data)
print(df_timezones)

Assessments (15 Questions)

1. Write code to localize a naive time series to UTC.


2. How can you convert a time series from UTC to a different time zone?
3. Write code to create a timestamp that is aware of a specific time zone.
4. How would you handle daylight saving time when creating a time series?
5. Write code to create a time series from a string that includes a time zone.
6. How can you perform arithmetic operations on time zone-aware time series?
7. Write code to plot a time series that includes time zone information.
8. How would you resample a time series that includes time zone data?
9. Write code to convert a timestamp from one time zone to another.
10. How can you create a DataFrame that includes timestamps in multiple time zones?
11. Write a function to check if a timestamp is in a specific time zone.
12. How would you extract the timezone information from a timestamp?
13. Write code to calculate the difference between two time zone-aware timestamps.
14. How can you handle missing values in a time series that includes time zones?
15. Write code to visualize a time series in different time zones on the same plot.
13. Data Aggregation and Grouping with Pandas

Data aggregation and grouping allow you to summarize and analyze large datasets efficiently. Pandas
provides powerful methods for grouping data, applying functions to groups, and aggregating results.

Examples

Example 1: Basic GroupBy Operation

python

import pandas as pd

# Create a DataFrame
data = {
'Category': ['A', 'A', 'B', 'B', 'C', 'C'],
'Values': [10, 15, 10, 25, 30, 35]
}
df = pd.DataFrame(data)

# Group by 'Category' and calculate the sum


grouped_sum = df.groupby('Category').sum()
print(grouped_sum)

Example 2: Grouping with Multiple Aggregations

python

# Group by 'Category' and apply multiple aggregations


grouped_agg = df.groupby('Category').agg({'Values': ['sum', 'mean', 'max']})
print(grouped_agg)

Example 3: Grouping with a Custom Function

python

# Define a custom function


def custom_function(x):
return x.max() - x.min()

# Group by 'Category' and apply the custom function


grouped_custom = df.groupby('Category')['Values'].agg(custom_function)
print(grouped_custom)

Example 4: Grouping with DateTime Index

python

# Create a DataFrame with a DateTime index


date_rng = pd.date_range(start='2023-01-01', end='2023-01-10', freq='D')
df_time = pd.DataFrame({'Values': range(10)}, index=date_rng)

# Group by day and calculate the sum


daily_sum = df_time.resample('D').sum()
print(daily_sum)

Example 5: Grouping with `transform`

python

# Using transform to get the mean for each group


df['Mean'] = df.groupby('Category')['Values'].transform('mean')
print(df)

Example 6: Grouping with Filtering

python

# Filter groups based on a condition


filtered_groups = df.groupby('Category').filter(lambda x: x['Values'].sum() > 40)
print(filtered_groups)

Example 7: Grouping by Multiple Columns

python

# Create a DataFrame with multiple columns


data2 = {
'Category': ['A', 'A', 'B', 'B', 'C', 'C'],
'Subcategory': ['X', 'Y', 'X', 'Y', 'X', 'Y'],
'Values': [10, 15, 10, 25, 30, 35]
}
df_multi = pd.DataFrame(data2)

# Group by 'Category' and 'Subcategory'


grouped_multi = df_multi.groupby(['Category', 'Subcategory']).sum()
print(grouped_multi)

Example 8: Pivot Table


python

# Create a pivot table from the DataFrame


pivot_table = df_multi.pivot_table(values='Values', index='Category',
columns='Subcategory', aggfunc='sum', fill_value=0)
print(pivot_table)

Example 9: GroupBy with Sort

python

# Group by 'Category' and sort by the sum of values


sorted_groups = df.groupby('Category')['Values'].sum().sort_values(ascending=False)
print(sorted_groups)

Example 10: Combining Aggregation and Transformation

python

# Combine aggregation and transformation


df['Normalized'] = df.groupby('Category')['Values'].transform(lambda x: (x - x.mean()) /
x.std())
print(df)

Assessments (15 Questions)

1. Write code to group a DataFrame by a single column and calculate the sum of another column.
2. How can you apply multiple aggregation functions to a grouped DataFrame?
3. Write a custom aggregation function and apply it to a group.
4. How would you group a DataFrame with a DateTime index and calculate the daily sum?
5. Write code to use the `transform` function after grouping.
6. How can you filter groups based on a specific condition?
7. Write code to group a DataFrame by multiple columns and sum the values.
8. How can you create a pivot table from a DataFrame?
9. Write code to sort groups based on the sum of values in descending order.
10. How would you normalize values within each group using `transform`?
11. Write a function to calculate the range of values in each group.
12. How can you reset the index of a grouped DataFrame?
13. Write code to create a grouped DataFrame and visualize it with a bar plot.
14. How can you group by a time series and calculate the weekly average?
15. Write code to group data and export the summary to a CSV file.
14. Saving DataFrames

Pandas provides various methods to save DataFrames to different file formats. This is essential for
preserving data analysis results, sharing with others, or loading them in other applications.

Examples

Example 1: Saving to CSV

python

import pandas as pd

# Create a DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)

# Save DataFrame to a CSV file


df.to_csv('people.csv', index=False)

Example 2: Saving to Excel

python

# Save DataFrame to an Excel file


df.to_excel('people.xlsx', index=False)

Example 3: Saving to JSON

python

# Save DataFrame to a JSON file


df.to_json('people.json', orient='records', lines=True)

Example 4: Saving to Parquet

python

# Save DataFrame to a Parquet file


df.to_parquet('people.parquet')

Example 5: Saving to HDF5

python

# Save DataFrame to an HDF5 file


df.to_hdf('people.h5', key='people', mode='w')

Example 6: Saving with Compression

python

# Save DataFrame to a compressed CSV file


df.to_csv('people_compressed.csv.gz', index=False, compression='gzip')

Example 7: Saving with a Custom Separator

python

# Save DataFrame to a CSV file with a custom separator


df.to_csv('people_semicolon.csv', sep=';', index=False)

Example 8: Saving Selected Columns

python

# Save only selected columns to a CSV file


df[['Name', 'City']].to_csv('people_selected.csv', index=False)

Example 9: Appending to an Existing CSV

python

# Append DataFrame to an existing CSV file


new_data = {
'Name': ['David', 'Eva'],
'Age': [40, 45],
'City': ['Houston', 'Phoenix']
}
df_new = pd.DataFrame(new_data)

df_new.to_csv('people.csv', mode='a', header=False, index=False)

Example 10: Saving with a Specific Encoding


python

# Save DataFrame to a CSV file with a specific encoding


df.to_csv('people_utf8.csv', index=False, encoding='utf-8')

Assessments (15 Questions)

1. Write code to save a DataFrame to a CSV file.


2. How can you save a DataFrame to an Excel file?
3. Write code to save a DataFrame in JSON format.
4. How can you save a DataFrame to a Parquet file?
5. Write code to save a DataFrame to an HDF5 file.
6. How would you save a DataFrame to a compressed CSV file?
7. Write code to save a DataFrame with a custom separator (e.g., semicolon).
8. How can you save only specific columns from a DataFrame to a CSV file?
9. Write code to append data to an existing CSV file.
10. How can you save a DataFrame using a specific character encoding?
11. Write a function to save a DataFrame to a given file format based on user input.
12. How would you check if a file already exists before saving a DataFrame?
13. Write code to read a saved CSV file back into a DataFrame.
14. How can you save multiple DataFrames to different sheets in an Excel file?
15. Write code to handle exceptions when saving a DataFrame to a file.

You might also like