Pandas for Beginners
Pandas Overview
Pandas is a powerful and flexible open-source data analysis and manipulation library in Python. It is
widely used for data analysis tasks such as cleaning, transforming, and visualizing data. Here's a guide
to get you started with Pandas.
1. Installing Pandas
To install Pandas, you can use pip:
bash
pip install pandas
2. Importing Pandas
You start by importing the library:
python
import pandas as pd
3. Creating DataFrames
A DataFrame is a 2-dimensional labeled data structure (like a table with rows and columns).
From a Dictionary:
python
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
print(df)
From a CSV File:
python
df = pd.read_csv('file.csv')
From an Excel File:
python
df = pd.read_excel('file.xlsx')
4. Viewing Data
Head & Tail: View the first or last few rows.
python
df.head() # Default shows the first 5 rows
df.tail(3) # Shows the last 3 rows
Shape: Get the dimensions (rows, columns).
python
print(df.shape)
Info: Get a summary of the DataFrame.
python
print(df.info())
Describe: Get statistical summary of numerical columns.
python
print(df.describe())
5. Selecting Data
Select Columns:
python
print(df['Name']) # Select single column
print(df[['Name', 'Age']]) # Select multiple columns
Select Rows:
Using `iloc` for index-based selection:
python
print(df.iloc[0]) # Select first row
print(df.iloc[0:2]) # Select first two rows
Using `loc` for label-based selection:
python
print(df.loc[0]) # Select first row
Conditional Selection:
python
# Select rows where Age is greater than 30
print(df[df['Age'] > 30])
6. Data Cleaning
Handling Missing Values:
Check for missing values:
python
print(df.isnull().sum())
Drop missing values:
python
df.dropna(inplace=True)
Fill missing values:
python
df.fillna(0, inplace=True) # Fill with a specific value
Rename Columns:
python
df.rename(columns={'Name': 'Full Name'}, inplace=True)
7. Data Manipulation
Adding a New Column:
python
df['Salary'] = [50000, 60000, 70000]
Deleting Columns:
python
df.drop(columns=['Salary'], inplace=True)
Sorting Data:
python
df.sort_values(by='Age', ascending=False, inplace=True)
8. GroupBy and Aggregation
You can group data by certain columns and apply aggregation functions.
python
grouped = df.groupby('City').mean()
print(grouped)
For more complex operations, you can apply custom aggregation:
python
df.groupby('City').agg({'Age': 'mean', 'Salary': 'sum'})
9. Merging and Joining DataFrames
Merging:
You can merge DataFrames similarly to SQL joins.
python
df1 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]})
df2 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Salary': [50000, 60000]})
merged_df = pd.merge(df1, df2, on='Name')
Concatenating:
Concatenate two DataFrames:
python
df_concat = pd.concat([df1, df2], axis=0) # axis=0 for rows, axis=1 for columns
10. Saving DataFrames
To CSV:
python
df.to_csv('output.csv', index=False)
To Excel:
python
df.to_excel('output.xlsx', index=False)
11. Advanced Topics
Pivot Tables:
python
df.pivot_table(values='Salary', index='City', columns='Age', aggfunc='mean')
Apply Functions: You can apply functions to DataFrame elements.
python
df['Salary'] = df['Salary'].apply(lambda x: x + 1000)
Pandas is a highly flexible tool with a rich API that covers a wide range of data manipulation tasks. This
tutorial covers the most commonly used features to get you started.
I'll break this down by each Pandas topic with 10 examples and then follow up with 15 assessment
questions per topic for practice. I'll first walk through one topic thoroughly, including examples and
assessments.
1. Creating DataFrames
Examples:
Example 1: Creating a DataFrame from a dictionary
python
data = {'Name': ['John', 'Sara', 'Tom'], 'Age': [28, 22, 35]}
df = pd.DataFrame(data)
print(df)
Example 2: Creating a DataFrame with specific indexes
python
data = {'Name': ['John', 'Sara', 'Tom'], 'Age': [28, 22, 35]}
df = pd.DataFrame(data, index=['a', 'b', 'c'])
print(df)
Example 3: Creating a DataFrame from a list of lists
python
data = [['John', 28], ['Sara', 22], ['Tom', 35]]
df = pd.DataFrame(data, columns=['Name', 'Age'])
print(df)
Example 4: Creating a DataFrame from a list of dictionaries
python
data = [{'Name': 'John', 'Age': 28}, {'Name': 'Sara', 'Age': 22}, {'Name': 'Tom', 'Age':
35}]
df = pd.DataFrame(data)
print(df)
Example 5: Creating an empty DataFrame and adding data later
python
df = pd.DataFrame(columns=['Name', 'Age'])
df.loc[0] = ['John', 28]
df.loc[1] = ['Sara', 22]
print(df)
Example 6: Creating a DataFrame from NumPy arrays
python
import numpy as np
data = np.array([['John', 28], ['Sara', 22], ['Tom', 35]])
df = pd.DataFrame(data, columns=['Name', 'Age'])
print(df)
Example 7: Creating a DataFrame from a CSV file
python
df = pd.read_csv('people.csv')
print(df)
Example 8: Creating a DataFrame from an Excel file
python
df = pd.read_excel('people.xlsx')
print(df)
Example 9: Creating a DataFrame from a list of tuples
python
data = [('John', 28), ('Sara', 22), ('Tom', 35)]
df = pd.DataFrame(data, columns=['Name', 'Age'])
print(df)
Example 10: Creating a DataFrame from another DataFrame
python
df1 = pd.DataFrame({'Name': ['John', 'Sara'], 'Age': [28, 22]})
df2 = pd.DataFrame(df1)
print(df2)
Assessments (15 Questions)
1. Create a DataFrame from a dictionary with the keys `['Name', 'Age']` and values `[['John',
'Tom'], [28, 35]]`.
2. Add a column `['City']` with values `['New York', 'Chicago']` to the DataFrame created in
Question 1.
3. Convert a list of lists `[['John', 28], ['Sara', 22], ['Tom', 35]]` into a DataFrame with column
names `['Name', 'Age']`.
4. Create a DataFrame from an empty DataFrame and manually add a row with values `['John',
28]`.
5. Create a DataFrame from a CSV file `data.csv` where the first row is the header.
6. Use NumPy to create a DataFrame with an array `[['Sara', 22], ['Tom', 35]]` and assign
column names.
7. Convert a dictionary `{'Name': ['Sara', 'Tom'], 'Age': [22, 35]}` into a DataFrame with
specified row indexes `['row1', 'row2']`.
8. Use a list of tuples `('Sara', 22)` and `('Tom', 35)` to create a DataFrame with column names
`['Name', 'Age']`.
9. Create a DataFrame from a list of dictionaries `{'Name': 'Sara', 'Age': 22}` and `{'Name':
'Tom', 'Age': 35}`.
10. Add a new column called `['Gender']` with values `['Female', 'Male']` to an existing
DataFrame.
11. Save the DataFrame created in Question 10 to a CSV file.
12. Create a DataFrame using an Excel file named `people_data.xlsx`.
13. Create a DataFrame by merging two existing DataFrames.
14. Load a DataFrame from a JSON file.
15. Create a DataFrame from another DataFrame and modify one of its columns.
2. Viewing Data
This topic focuses on how to explore and examine a DataFrame, which is crucial in data analysis. Let's
walk through some common methods for viewing the contents of a DataFrame, and then I'll provide
assessment questions afterward.
Examples
Example 1: Viewing the first few rows with `head()`
python
import pandas as pd
# Sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 30, 35, 40, 22]}
df = pd.DataFrame(data)
# View the first 3 rows
print(df.head(3))
Example 2: Viewing the last few rows with `tail()`
python
# View the last 2 rows
print(df.tail(2))
Example 3: Checking the shape of the DataFrame
python
# Get the number of rows and columns
print(df.shape)
Example 4: Getting information about the DataFrame using `info()`
python
# Summary info about the DataFrame
print(df.info())
Example 5: Viewing basic statistics with `describe()`
python
# Get statistics of numeric columns
print(df.describe())
Example 6: Viewing the index of the DataFrame
python
# Get the index of the DataFrame
print(df.index)
Example 7: Viewing the column names of the DataFrame
python
# Get column names
print(df.columns)
Example 8: Viewing data types of each column
python
# Get data types of each column
print(df.dtypes)
Example 9: Viewing specific elements by position using `iloc[]`
python
# Select the element in the first row and first column
print(df.iloc[0, 0])
Example 10: Viewing specific rows by label using `loc[]`
python
# Select the row where the index is 0
print(df.loc[0])
Assessments (15 Questions)
1. Using the `head()` function, display the first 4 rows of a DataFrame.
2. Use the `tail()` function to display the last 3 rows of a DataFrame.
3. How can you find out how many rows and columns are in a DataFrame?
4. Which function would you use to get a summary of data types and memory usage of a
DataFrame?
5. Get a statistical summary of all numeric columns in a DataFrame.
6. Display the names of all columns in a DataFrame.
7. How would you access the index of a DataFrame?
8. Write a function to get the data type of each column in a DataFrame.
9. How can you extract the element from the second row and second column using `iloc[]`?
10. Use the `loc[]` function to get the row where the index is `2`.
11. How can you view the first row of a DataFrame by position?
12. How would you select the last row of a DataFrame without using `tail()`?
13. Write a function to check if there are any missing values in the DataFrame.
14. How can you view the first 5 rows of only one column (e.g., 'Age')?
15. Get the maximum and minimum values for a numeric column using a DataFrame method.
3. Selecting Data
In this section, we'll explore different ways to select data from a DataFrame, whether it’s selecting
columns, rows, or subsets based on certain conditions.
Examples
Example 1: Selecting a single column
python
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)
# Select the 'Name' column
print(df['Name'])
Example 2: Selecting multiple columns
python
# Select the 'Name' and 'City' columns
print(df[['Name', 'City']])
Example 3: Selecting rows by index using `iloc[]`
python
# Select the first row
print(df.iloc[0])
# Select the first two rows
print(df.iloc[0:2])
Example 4: Selecting rows by label using `loc[]`
python
# Select the row with index 1
print(df.loc[1])
Example 5: Selecting rows where a column’s value meets a condition
python
# Select rows where 'Age' is greater than 28
print(df[df['Age'] > 28])
Example 6: Selecting rows where multiple conditions are met
python
# Select rows where 'Age' > 28 and 'City' is 'Chicago'
print(df[(df['Age'] > 28) & (df['City'] == 'Chicago')])
Example 7: Selecting rows based on string conditions
python
# Select rows where 'Name' starts with 'A'
print(df[df['Name'].str.startswith('A')])
Example 8: Using `isin()` to filter rows
python
# Select rows where 'City' is either 'New York' or 'Chicago'
print(df[df['City'].isin(['New York', 'Chicago'])])
Example 9: Selecting a subset of the DataFrame based on rows and columns
python
# Select the first two rows and 'Name' and 'Age' columns
print(df.loc[0:1, ['Name', 'Age']])
Example 10: Selecting specific rows and columns using `iloc[]`
python
# Select the element in the second row and third column
print(df.iloc[1, 2])
# Select the first two rows and first two columns
print(df.iloc[0:2, 0:2])
Assessments (15 Questions)
1. Select the column 'Age' from a DataFrame.
2. How would you select multiple columns, say 'Age' and 'City', from a DataFrame?
3. Write code to select the third row from a DataFrame using `iloc[]`.
4. How can you select the first three rows using `iloc[]`?
5. Write a function to select the row with index 2 using `loc[]`.
6. How can you select rows where the 'Age' column is less than 30?
7. How would you select rows where 'City' is 'New York' and 'Age' is greater than 30?
8. Write code to select rows where 'Name' ends with the letter 'e'.
9. Use the `isin()` method to filter rows where 'City' is 'Chicago' or 'Los Angeles'.
10. How would you select specific rows (first and second) and specific columns ('Name' and 'Age')?
11. Select the element in the third row and second column using `iloc[]`.
12. How would you select rows where the 'Age' column is between 25 and 35?
13. Write code to select rows where 'Name' contains the substring 'Bob'.
14. How can you select the first two rows and all columns from a DataFrame?
15. Write a function to filter rows where 'Age' is greater than 25 and 'Name' does not start with 'C'.
4. Handling Missing Data
Dealing with missing or null data is essential when working with real-world datasets. Pandas provides
various tools to handle missing values, such as detecting them, filling them, or removing them.
Examples
Example 1: Detecting missing values with `isna()`
python
import pandas as pd
data = {'Name': ['Alice', 'Bob', None],
'Age': [25, None, 35],
'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)
# Detect missing values
print(df.isna())
Example 2: Detecting missing values with `isnull()`
python
# Alternative way to detect missing values
print(df.isnull())
Example 3: Counting missing values in each column
python
# Count the number of missing values in each column
print(df.isna().sum())
Example 4: Dropping rows with missing values
python
# Drop rows where any value is missing
df_dropped = df.dropna()
print(df_dropped)
Example 5: Dropping columns with missing values
python
# Drop columns where any value is missing
df_dropped_col = df.dropna(axis=1)
print(df_dropped_col)
Example 6: Filling missing values with a constant
python
# Fill missing values with a specific value
df_filled = df.fillna(0)
print(df_filled)
Example 7: Filling missing values in specific columns
python
# Fill missing values in the 'Age' column with the mean age
df['Age'] = df['Age'].fillna(df['Age'].mean())
print(df)
Example 8: Forward filling missing values
python
# Fill missing values using the previous non-null value (forward fill)
df_ffill = df.fillna(method='ffill')
print(df_ffill)
Example 9: Backward filling missing values
python
# Fill missing values using the next non-null value (backward fill)
df_bfill = df.fillna(method='bfill')
print(df_bfill)
Example 10: Interpolating missing values
python
# Interpolate missing numeric values
df['Age'] = df['Age'].interpolate()
print(df)
Assessments (15 Questions)
1. How would you detect missing values in a DataFrame using Pandas?
2. Write a function to count missing values in each column of a DataFrame.
3. How can you drop all rows that have at least one missing value?
4. How would you drop columns that contain missing values?
5. Write code to fill missing values with the constant `0`.
6. How can you replace missing values in the 'Age' column with the mean of that column?
7. Write code to forward-fill missing values in a DataFrame.
8. How can you backward-fill missing values in a DataFrame?
9. Use the `interpolate()` function to fill missing values in a numeric column.
10. Write code to fill missing values in a column with the median of that column.
11. How would you fill missing values in the 'City' column with the string 'Unknown'?
12. Write a function to drop rows where all values are missing.
13. How can you fill missing values in a DataFrame using the previous valid value from the same
row?
14. How would you replace missing values in a column with the mode (most frequent value) of that
column?
15. Write code to detect and count missing values in a specific column (e.g., 'Name').
5. Data Aggregation and Grouping
Data aggregation and grouping are fundamental for summarizing and analyzing data in a structured
way. In Pandas, you can use the `groupby()` function to group data and then apply aggregate
functions like `sum()`, `mean()`, etc.
Examples
Example 1: Grouping data by a single column
python
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob'],
'Age': [25, 30, 35, 28, 32],
'Score': [85, 90, 88, 92, 79]
}
df = pd.DataFrame(data)
# Group by 'Name' and calculate the mean of 'Age' and 'Score'
grouped = df.groupby('Name').mean()
print(grouped)
Example 2: Grouping by multiple columns
python
# Group by 'Name' and 'Age', and calculate the sum of 'Score'
grouped_multi = df.groupby(['Name', 'Age']).sum()
print(grouped_multi)
Example 3: Using aggregation functions with `agg()`
python
# Group by 'Name' and aggregate 'Score' using multiple functions
agg_result = df.groupby('Name').agg({'Score': ['mean', 'max', 'min']})
print(agg_result)
Example 4: Counting the number of occurrences
python
# Count the number of occurrences of each name
count_result = df['Name'].value_counts()
print(count_result)
Example 5: Grouping and applying custom functions
python
# Define a custom function to calculate range
def age_range(x):
return x.max() - x.min()
# Group by 'Name' and apply the custom age_range function
age_range_result = df.groupby('Name')['Age'].agg(age_range)
print(age_range_result)
Example 6: Resetting index after grouping
python
# Group by 'Name' and get mean, then reset index
reset_result = df.groupby('Name').mean().reset_index()
print(reset_result)
Example 7: Grouping and filtering groups
python
# Group by 'Name' and filter groups where the mean Score is greater than 85
filtered = df.groupby('Name').filter(lambda x: x['Score'].mean() > 85)
print(filtered)
Example 8: Grouping with multiple aggregations
python
# Group by 'Name' and calculate the sum and mean of 'Score'
agg_multi = df.groupby('Name').agg({'Score': ['sum', 'mean']})
print(agg_multi)
Example 9: Using `size()` to count the number of entries per group
python
# Count the number of entries per name
size_result = df.groupby('Name').size()
print(size_result)
Example 10: Pivot tables as an alternative to grouping
python
# Create a pivot table for Score based on Name and Age
pivot_table = df.pivot_table(values='Score', index='Name', columns='Age', aggfunc='mean')
print(pivot_table)
Assessments (15 Questions)
1. How would you group a DataFrame by a single column and calculate the mean of another
column?
2. Write code to group data by two columns and calculate the sum of a third column.
3. How can you apply multiple aggregation functions to a grouped DataFrame using `agg()`?
4. Write a function to count the number of occurrences of each unique value in a column.
5. How would you group by a column and apply a custom aggregation function?
6. Write code to reset the index of a grouped DataFrame.
7. How can you filter groups based on the mean of a column after grouping?
8. Write code to create a pivot table that summarizes data based on two columns.
9. How would you group a DataFrame and count the size of each group?
10. Write a function to find the maximum and minimum values of a numeric column after grouping
by a categorical column.
11. How can you group by multiple columns and calculate the average of one column?
12. Write code to create a new DataFrame from the aggregation results of a grouped DataFrame.
13. How would you create a DataFrame that only includes groups where the total count is greater
than 1?
14. Write code to summarize data by calculating the sum and mean for multiple columns in a
grouped DataFrame.
15. How can you create a summary table showing the average score for each name in the original
DataFrame?
6. Merging and Joining DataFrames
Merging and joining DataFrames are essential skills in data manipulation, allowing you to combine
datasets based on common keys or indices. Pandas provides various functions for these operations,
including `merge()`, `join()`, and `concat()`.
Examples
Example 1: Merging DataFrames using `merge()`
python
import pandas as pd
# Creating two DataFrames
df1 = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie']
})
df2 = pd.DataFrame({
'ID': [1, 2, 4],
'Age': [25, 30, 35]
})
# Merge the DataFrames on 'ID'
merged_df = pd.merge(df1, df2, on='ID', how='inner')
print(merged_df)
Example 2: Merging with different types of joins
python
# Left join
left_join = pd.merge(df1, df2, on='ID', how='left')
print(left_join)
# Right join
right_join = pd.merge(df1, df2, on='ID', how='right')
print(right_join)
# Outer join
outer_join = pd.merge(df1, df2, on='ID', how='outer')
print(outer_join)
Example 3: Joining DataFrames using `join()`
python
# Creating two DataFrames with indices
df3 = pd.DataFrame({
'Name': ['Alice', 'Bob'],
'Age': [25, 30]
}, index=[1, 2])
df4 = pd.DataFrame({
'City': ['New York', 'Los Angeles']
}, index=[1, 3])
# Join the DataFrames
joined_df = df3.join(df4, how='outer')
print(joined_df)
Example 4: Concatenating DataFrames using `concat()`
python
# Creating two DataFrames for concatenation
df5 = pd.DataFrame({
'ID': [1, 2],
'Name': ['Alice', 'Bob']
})
df6 = pd.DataFrame({
'ID': [3, 4],
'Name': ['Charlie', 'David']
})
# Concatenate the DataFrames
concatenated_df = pd.concat([df5, df6], ignore_index=True)
print(concatenated_df)
Example 5: Concatenating along columns
python
# Concatenate DataFrames along columns
df7 = pd.DataFrame({
'ID': [1, 2],
'Name': ['Alice', 'Bob']
})
df8 = pd.DataFrame({
'Age': [25, 30]
})
concatenated_columns = pd.concat([df7, df8], axis=1)
print(concatenated_columns)
Example 6: Merging on multiple keys
python
# Creating two DataFrames
df9 = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie'],
'City': ['NY', 'LA', 'SF']
})
df10 = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]
})
# Merge on both 'ID' and 'Name'
merged_multi_key = pd.merge(df9, df10, on=['ID', 'Name'])
print(merged_multi_key)
Example 7: Using suffixes to differentiate columns
python
# Merging with suffixes for overlapping column names
df11 = pd.DataFrame({
'ID': [1, 2],
'Value': [100, 200]
})
df12 = pd.DataFrame({
'ID': [1, 2],
'Value': [300, 400]
})
merged_suffixes = pd.merge(df11, df12, on='ID', suffixes=('_left', '_right'))
print(merged_suffixes)
Example 8: Joining with different indices
python
# Create DataFrames with different indices
df13 = pd.DataFrame({
'Name': ['Alice', 'Bob'],
'Age': [25, 30]
}, index=[1, 2])
df14 = pd.DataFrame({
'City': ['New York', 'Los Angeles']
}, index=[2, 3])
# Join the DataFrames with different indices
joined_diff_index = df13.join(df14, how='outer')
print(joined_diff_index)
Example 9: Using `concat()` with keys
python
# Using concat with keys to create a hierarchical index
df15 = pd.DataFrame({
'ID': [1, 2],
'Name': ['Alice', 'Bob']
})
df16 = pd.DataFrame({
'ID': [3, 4],
'Name': ['Charlie', 'David']
})
concatenated_keys = pd.concat([df15, df16], keys=['Group1', 'Group2'])
print(concatenated_keys)
Example 10: Merging with indicator
python
# Merge with indicator to see the source of each row
merged_with_indicator = pd.merge(df1, df2, on='ID', how='outer', indicator=True)
print(merged_with_indicator)
Assessments (15 Questions)
1. Write code to merge two DataFrames on a single key.
2. How would you perform a left join between two DataFrames?
3. Write code to join two DataFrames using the `join()` method.
4. How can you concatenate two DataFrames vertically?
5. Write code to concatenate two DataFrames horizontally.
6. How can you merge two DataFrames on multiple keys?
7. Write a function to merge two DataFrames and include suffixes for overlapping column names.
8. How would you join two DataFrames with different indices?
9. Write code to use `concat()` with keys to create a hierarchical index.
10. How can you merge two DataFrames and see the source of each row?
11. Write code to perform an outer join on two DataFrames.
12. How would you concatenate a list of DataFrames into a single DataFrame?
13. Write a function to merge two DataFrames and drop any duplicates that arise.
14. How can you merge two DataFrames without any duplicate rows?
15. Write code to merge two DataFrames and fill missing values in the resulting DataFrame.
7. Data Visualization with Pandas
Data visualization is an essential part of data analysis, allowing you to see trends, patterns, and
insights that may not be immediately obvious from raw data. Pandas integrates well with Matplotlib
and Seaborn for plotting.
Examples
Example 1: Basic line plot
python
import pandas as pd
import matplotlib.pyplot as plt
# Create a simple DataFrame
data = {'Year': [2018, 2019, 2020, 2021, 2022],
'Sales': [150, 200, 250, 300, 350]}
df = pd.DataFrame(data)
# Basic line plot
df.plot(x='Year', y='Sales', kind='line', title='Sales Over Years')
plt.xlabel('Year')
plt.ylabel('Sales')
plt.show()
Example 2: Bar plot
python
# Bar plot
df.plot(x='Year', y='Sales', kind='bar', title='Sales Over Years')
plt.xlabel('Year')
plt.ylabel('Sales')
plt.show()
Example 3: Histogram
python
# Create a DataFrame with random data
data2 = {'Scores': [82, 85, 87, 90, 95, 70, 75, 80, 88, 92]}
df2 = pd.DataFrame(data2)
# Histogram of Scores
df2['Scores'].plot(kind='hist', bins=5, title='Score Distribution')
plt.xlabel('Scores')
plt.show()
Example 4: Scatter plot
python
# Create a DataFrame for scatter plot
data3 = {
'Height': [5.1, 5.5, 5.8, 6.0, 5.7],
'Weight': [100, 150, 130, 165, 155]
}
df3 = pd.DataFrame(data3)
# Scatter plot
df3.plot(kind='scatter', x='Height', y='Weight', title='Height vs Weight')
plt.xlabel('Height (in feet)')
plt.ylabel('Weight (in lbs)')
plt.show()
Example 5: Box plot
python
# Box plot
df2['Scores'].plot(kind='box', title='Box Plot of Scores')
plt.ylabel('Scores')
plt.show()
Example 6: Area plot
python
# Area plot
df.plot(x='Year', y='Sales', kind='area', alpha=0.4, title='Sales Area Plot')
plt.xlabel('Year')
plt.ylabel('Sales')
plt.show()
Example 7: Pie chart
python
# Create a DataFrame for pie chart
data4 = {'Fruits': ['Apple', 'Banana', 'Cherry'],
'Quantity': [30, 25, 45]}
df4 = pd.DataFrame(data4)
# Pie chart
df4.plot.pie(y='Quantity', labels=df4['Fruits'], autopct='%1.1f%%', title='Fruit
Distribution', legend=False)
plt.show()
Example 8: Customizing plots
python
# Line plot with customization
ax = df.plot(x='Year', y='Sales', kind='line', title='Sales Over Years', color='orange')
ax.set_xlabel('Year')
ax.set_ylabel('Sales')
ax.grid()
plt.show()
Example 9: Using Seaborn for enhanced visualization
python
import seaborn as sns
# Scatter plot using Seaborn
sns.scatterplot(data=df3, x='Height', y='Weight', hue='Weight', palette='deep',
size='Weight', sizes=(20, 200))
plt.title('Height vs Weight with Seaborn')
plt.show()
Example 10: Pairplot with Seaborn
python
# Creating a DataFrame with more data for pairplot
data5 = {
'Height': [5.1, 5.5, 5.8, 6.0, 5.7],
'Weight': [100, 150, 130, 165, 155],
'Age': [20, 22, 23, 25, 30]
}
df5 = pd.DataFrame(data5)
# Pairplot using Seaborn
sns.pairplot(df5)
plt.show()
Assessments (15 Questions)
1. Write code to create a line plot for sales data over years.
2. How would you create a bar plot for the same sales data?
3. Write code to plot a histogram of randomly generated scores.
4. How can you create a scatter plot to visualize the relationship between height and weight?
5. Write a function to create a box plot for a set of scores.
6. How would you create an area plot for sales data?
7. Write code to create a pie chart showing the distribution of fruit quantities.
8. How can you customize a line plot with specific colors and gridlines?
9. Write code to create a scatter plot using Seaborn, including a color palette.
10. How can you create a pairplot to visualize relationships in a DataFrame with multiple numeric
columns?
11. Write a function to plot a histogram with a specified number of bins.
12. How would you create a customized pie chart with percentages displayed?
13. Write code to create a scatter plot with varying point sizes based on another column.
14. How can you add labels to axes and a title to a plot in Matplotlib?
15. Write a function to visualize the distribution of a DataFrame column using a box plot.
8. Time Series Analysis with Pandas
Time series analysis involves analyzing data points collected or recorded at specific time intervals.
Pandas provides powerful tools for working with time series data, making it easy to manipulate,
visualize, and analyze trends over time.
Examples
Example 1: Creating a Time Series
python
import pandas as pd
# Create a date range
date_range = pd.date_range(start='2020-01-01', periods=5, freq='D')
# Create a DataFrame with a time series
data = {'Value': [10, 20, 15, 25, 30]}
df = pd.DataFrame(data, index=date_range)
print(df)
Example 2: Plotting a Time Series
python
# Plot the time series
df.plot(title='Simple Time Series')
plt.xlabel('Date')
plt.ylabel('Value')
plt.show()
Example 3: Resampling Time Series Data
python
# Create a time series with random data
date_range = pd.date_range(start='2020-01-01', end='2020-01-10', freq='D')
data = {'Value': [10, 15, 20, 25, 30, 35, 40, 45, 50, 55]}
df = pd.DataFrame(data, index=date_range)
# Resample to weekly frequency and calculate the mean
weekly_mean = df.resample('W').mean()
print(weekly_mean)
Example 4: Shifting Time Series Data
python
# Shift the time series data by 1 day
df_shifted = df.shift(1)
print(df_shifted)
Example 5: Rolling Window Calculations
python
# Calculate the rolling mean with a window of 3 days
rolling_mean = df['Value'].rolling(window=3).mean()
print(rolling_mean)
Example 6: Time Series Indexing
python
# Select data for a specific date
specific_date = df.loc['2020-01-03']
print(specific_date)
Example 7: Handling Missing Dates
python
# Create a time series with missing dates
date_range = pd.date_range(start='2020-01-01', end='2020-01-10', freq='D')
data = {'Value': [10, None, 20, 25, None, 30, 35, None, 50, 55]}
df_missing = pd.DataFrame(data, index=date_range)
# Fill missing values using forward fill
df_filled = df_missing.fillna(method='ffill')
print(df_filled)
Example 8: Time Series Decomposition
python
from statsmodels.tsa.seasonal import seasonal_decompose
# Decompose the time series
result = seasonal_decompose(df['Value'], model='additive', period=2)
result.plot()
plt.show()
Example 9: Date Range Generation
python
# Create a date range with business days
business_days = pd.date_range(start='2020-01-01', end='2020-01-10', freq='B')
print(business_days)
Example 10: Converting a Column to DateTime
python
# Create a DataFrame with dates as strings
data2 = {'Date': ['2020-01-01', '2020-01-02', '2020-01-03'],
'Value': [10, 20, 30]}
df2 = pd.DataFrame(data2)
# Convert the 'Date' column to datetime
df2['Date'] = pd.to_datetime(df2['Date'])
print(df2)
Assessments (15 Questions)
1. How would you create a time series DataFrame with daily frequency?
2. Write code to plot a time series graph.
3. How can you resample time series data to a weekly frequency and calculate the mean?
4. Write a function to shift a time series by a specified number of periods.
5. How would you calculate the rolling mean over a time series with a window of 3 days?
6. Write code to select data for a specific date in a time series DataFrame.
7. How can you handle missing values in a time series using forward fill?
8. Write code to decompose a time series into its trend, seasonal, and residual components.
9. How would you generate a date range with business days only?
10. Write code to convert a column of strings into datetime format in a DataFrame.
11. How can you plot a time series with different colors for different time periods?
12. Write a function to calculate the cumulative sum of a time series.
13. How would you extract the month and year from a date column in a time series DataFrame?
14. Write code to resample a time series to a monthly frequency and calculate the sum.
15. How can you check for missing dates in a time series?
9. Data Cleaning and Preparation with Pandas
Data cleaning and preparation are essential steps in the data analysis process, ensuring that the data
is accurate, complete, and ready for analysis. Pandas offers various functions and methods for
cleaning, transforming, and preparing data.
Examples
Example 1: Removing Duplicates
python
import pandas as pd
# Create a DataFrame with duplicate rows
data = {
'Name': ['Alice', 'Bob', 'Alice', 'Charlie'],
'Age': [25, 30, 25, 35]
}
df = pd.DataFrame(data)
# Remove duplicate rows
df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates)
Example 2: Filling Missing Values
python
# Create a DataFrame with missing values
data2 = {
'Name': ['Alice', 'Bob', 'Charlie', None],
'Age': [25, None, 35, 40]
}
df2 = pd.DataFrame(data2)
# Fill missing values with a specified value
df_filled = df2.fillna({'Name': 'Unknown', 'Age': df2['Age'].mean()})
print(df_filled)
Example 3: Dropping Missing Values
python
# Drop rows with missing values
df_dropped = df2.dropna()
print(df_dropped)
Example 4: Renaming Columns
python
# Rename columns
df.rename(columns={'Name': 'Full Name', 'Age': 'Age in Years'}, inplace=True)
print(df)
Example 5: Changing Data Types
python
# Create a DataFrame with mixed data types
data3 = {
'ID': ['1', '2', '3'],
'Age': ['25', '30', '35']
}
df3 = pd.DataFrame(data3)
# Convert data types
df3['ID'] = df3['ID'].astype(int)
df3['Age'] = df3['Age'].astype(int)
print(df3.dtypes)
Example 6: String Manipulation
python
# Create a DataFrame with strings
data4 = {
'Names': [' Alice ', 'Bob ', ' Charlie ']
}
df4 = pd.DataFrame(data4)
# Strip whitespace from strings
df4['Names'] = df4['Names'].str.strip()
print(df4)
Example 7: Changing Indexes
python
# Create a DataFrame
data5 = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]
}
df5 = pd.DataFrame(data5)
# Set 'Name' as the index
df5.set_index('Name', inplace=True)
print(df5)
Example 8: Filtering Data
python
# Filter data based on a condition
filtered_df = df5[df5['Age'] > 30]
print(filtered_df)
Example 9: Categorizing Data
python
# Create a DataFrame with categorical data
data6 = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Score': [85, 70, 95]
}
df6 = pd.DataFrame(data6)
# Convert 'Score' to categorical
df6['Grade'] = pd.cut(df6['Score'], bins=[0, 60, 80, 100], labels=['F', 'C', 'A'])
print(df6)
Example 10: Merging and Concatenating DataFrames
python
# Create two DataFrames
df7 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})
df8 = pd.DataFrame({'ID': [2, 3], 'Name': ['Bob', 'Charlie']})
# Merge the DataFrames
merged_df = pd.merge(df7, df8, on='ID', how='outer')
print(merged_df)
Assessments (15 Questions)
1. Write code to remove duplicate rows from a DataFrame.
2. How can you fill missing values in a DataFrame with a specific value?
3. Write code to drop rows with missing values from a DataFrame.
4. How would you rename columns in a DataFrame?
5. Write code to change the data types of specific columns in a DataFrame.
6. How can you strip whitespace from string entries in a DataFrame column?
7. Write code to change the index of a DataFrame to a specific column.
8. How would you filter a DataFrame to show rows that meet a specific condition?
9. Write code to categorize a numerical column into bins.
10. How can you concatenate two DataFrames vertically?
11. Write a function to check for missing values in a DataFrame.
12. How would you replace specific values in a DataFrame with new values?
13. Write code to sort a DataFrame by a specific column.
14. How can you check the data types of all columns in a DataFrame?
15. Write code to extract a specific column from a DataFrame as a Series.
10. Advanced Data Manipulation with Pandas
Advanced data manipulation techniques are essential for more complex data analysis tasks. This topic
covers advanced features in Pandas that allow you to transform, aggregate, and reshape your data.
Examples
Example 1: Pivot Tables
python
import pandas as pd
# Create a DataFrame for pivot table
data = {
'Date': ['2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02'],
'Category': ['A', 'B', 'A', 'B'],
'Sales': [100, 200, 150, 250]
}
df = pd.DataFrame(data)
# Create a pivot table
pivot_table = df.pivot_table(values='Sales', index='Date', columns='Category',
aggfunc='sum')
print(pivot_table)
Example 2: Group By
python
# Group by Category and calculate the total sales
grouped = df.groupby('Category')['Sales'].sum()
print(grouped)
Example 3: Applying Functions
python
# Apply a custom function to calculate the square of sales
df['Sales Squared'] = df['Sales'].apply(lambda x: x ** 2)
print(df)
Example 4: Chaining Methods
python
# Chaining methods for data manipulation
result = (df[df['Sales'] > 100]
.groupby('Category')['Sales']
.sum()
.reset_index())
print(result)
Example 5: Merging with Different Join Types
python
# Create two DataFrames for merging
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [2, 3, 4], 'Score': [90, 85, 80]})
# Merge DataFrames with inner join
inner_merge = pd.merge(df1, df2, on='ID', how='inner')
print(inner_merge)
Example 6: Concatenating DataFrames
python
# Create two DataFrames
df3 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})
df4 = pd.DataFrame({'ID': [3, 4], 'Name': ['Charlie', 'David']})
# Concatenate DataFrames
concatenated = pd.concat([df3, df4], ignore_index=True)
print(concatenated)
Example 7: Reshaping Data with Melt
python
# Create a DataFrame to reshape
data2 = {
'ID': [1, 2],
'Math': [90, 80],
'Science': [85, 75]
}
df2 = pd.DataFrame(data2)
# Reshape with melt
melted = df2.melt(id_vars=['ID'], value_vars=['Math', 'Science'], var_name='Subject',
value_name='Score')
print(melted)
Example 8: Using `stack` and `unstack`
python
# Create a multi-index DataFrame
data3 = {
'Category': ['A', 'A', 'B', 'B'],
'Type': ['X', 'Y', 'X', 'Y'],
'Values': [10, 20, 15, 25]
}
df3 = pd.DataFrame(data3).set_index(['Category', 'Type'])
# Use stack
stacked = df3.stack()
print(stacked)
# Use unstack
unstacked = df3.unstack()
print(unstacked)
Example 9: Using `pivot` for Reshaping Data
python
# Create a DataFrame for pivot
data4 = {
'Date': ['2021-01-01', '2021-01-01', '2021-01-02'],
'Product': ['A', 'B', 'A'],
'Sales': [100, 200, 150]
}
df4 = pd.DataFrame(data4)
# Create a pivot table
pivot_result = df4.pivot(index='Date', columns='Product', values='Sales')
print(pivot_result)
Example 10: Time Series Aggregation
python
# Create a time series DataFrame
date_range = pd.date_range(start='2021-01-01', periods=5, freq='D')
data5 = {'Sales': [10, 20, 15, 25, 30]}
df5 = pd.DataFrame(data5, index=date_range)
# Resample to weekly frequency and sum sales
weekly_sales = df5.resample('W').sum()
print(weekly_sales)
Assessments (15 Questions)
1. Write code to create a pivot table summarizing sales by date and category.
2. How can you group a DataFrame by a specific column and calculate the sum?
3. Write a function to apply a custom operation on a DataFrame column.
4. How can you chain multiple methods to filter and group data in one statement?
5. Write code to perform an outer join on two DataFrames based on a common column.
6. How would you concatenate two DataFrames and reset the index?
7. Write code to melt a DataFrame from wide to long format.
8. How can you use `stack` and `unstack` to reshape a multi-index DataFrame?
9. Write code to create a pivot table from a DataFrame with date and product data.
10. How would you aggregate time series data to a monthly frequency and calculate the mean?
11. Write a function to merge two DataFrames with a left join.
12. How can you extract unique values from a specific column in a DataFrame?
13. Write code to sort a DataFrame based on multiple columns.
14. How can you calculate cumulative sums for a DataFrame column?
15. Write code to filter a DataFrame based on a condition from multiple columns.
11. Data Visualization with Pandas
Data visualization is a crucial aspect of data analysis that helps in understanding patterns, trends, and
insights. Pandas provides built-in capabilities for creating basic visualizations, leveraging libraries like
Matplotlib and Seaborn for more advanced visualizations.
Examples
Example 1: Basic Line Plot
python
import pandas as pd
import matplotlib.pyplot as plt
# Create a simple DataFrame
data = {
'Year': [2020, 2021, 2022],
'Sales': [100, 150, 200]
}
df = pd.DataFrame(data)
# Plot a line graph
df.plot(x='Year', y='Sales', kind='line', title='Sales Over Years', marker='o')
plt.xlabel('Year')
plt.ylabel('Sales')
plt.show()
Example 2: Bar Plot
python
# Create a bar plot
df.plot(x='Year', y='Sales', kind='bar', title='Sales by Year')
plt.xlabel('Year')
plt.ylabel('Sales')
plt.show()
Example 3: Histogram
python
# Create a DataFrame with random data
data2 = {
'Scores': [88, 92, 85, 95, 70, 76, 89, 90, 83, 94]
}
df2 = pd.DataFrame(data2)
# Plot a histogram
df2['Scores'].plot(kind='hist', bins=5, title='Score Distribution')
plt.xlabel('Scores')
plt.show()
Example 4: Scatter Plot
python
# Create another DataFrame
data3 = {
'Height': [150, 160, 170, 180, 190],
'Weight': [50, 60, 70, 80, 90]
}
df3 = pd.DataFrame(data3)
# Create a scatter plot
df3.plot(kind='scatter', x='Height', y='Weight', title='Height vs. Weight')
plt.show()
Example 5: Box Plot
python
# Create a DataFrame with some data
data4 = {
'Category': ['A', 'A', 'B', 'B', 'C', 'C'],
'Values': [10, 15, 12, 18, 22, 25]
}
df4 = pd.DataFrame(data4)
# Create a box plot
df4.boxplot(column='Values', by='Category')
plt.title('Box Plot of Values by Category')
plt.suptitle('')
plt.xlabel('Category')
plt.ylabel('Values')
plt.show()
Example 6: Area Plot
python
# Create a DataFrame with cumulative data
data5 = {
'Year': [2020, 2021, 2022],
'Sales': [100, 150, 200],
'Expenses': [80, 120, 160]
}
df5 = pd.DataFrame(data5)
# Plot an area plot
df5.plot(x='Year', kind='area', stacked=True, title='Sales and Expenses Over Years')
plt.xlabel('Year')
plt.ylabel('Amount')
plt.show()
Example 7: Pie Chart
python
# Create a pie chart for category distribution
data6 = {
'Category': ['A', 'B', 'C'],
'Values': [30, 50, 20]
}
df6 = pd.DataFrame(data6)
# Plot a pie chart
df6.set_index('Category').plot(kind='pie', y='Values', autopct='%1.1f%%', title='Category
Distribution')
plt.ylabel('')
plt.show()
Example 8: Heatmap with Seaborn
python
import seaborn as sns
# Create a correlation matrix
data7 = {
'A': [1, 2, 3, 4],
'B': [4, 5, 6, 7],
'C': [7, 8, 9, 10]
}
df7 = pd.DataFrame(data7)
# Create a heatmap
sns.heatmap(df7.corr(), annot=True, cmap='coolwarm', title='Correlation Heatmap')
plt.title('Correlation Heatmap')
plt.show()
Example 9: Customizing Plots
python
# Create a line plot with customization
df.plot(x='Year', y='Sales', kind='line', title='Sales Over Years', marker='o',
color='red', linestyle='--')
plt.xlabel('Year')
plt.ylabel('Sales')
plt.grid()
plt.show()
Example 10: Subplots
python
# Create multiple subplots
fig, axs = plt.subplots(2, 1, figsize=(8, 8))
df.plot(x='Year', y='Sales', kind='line', ax=axs[0], title='Sales Over Years', marker='o')
axs[0].set_ylabel('Sales')
df2['Scores'].plot(kind='hist', bins=5, ax=axs[1], title='Score Distribution')
axs[1].set_xlabel('Scores')
plt.tight_layout()
plt.show()
Assessments (15 Questions)
1. Write code to create a line plot for sales data over several years.
2. How can you create a bar plot from a DataFrame?
3. Write code to generate a histogram of a numerical column.
4. How can you create a scatter plot to analyze the relationship between two variables?
5. Write code to create a box plot for data categorized by another variable.
6. How can you create an area plot to visualize cumulative data over time?
7. Write code to generate a pie chart from categorical data.
8. How can you create a heatmap to visualize correlations between multiple variables?
9. Write code to customize a plot with colors, markers, and line styles.
10. How can you create multiple subplots in a single figure?
11. Write a function to save a plot as a PNG file.
12. How would you plot a rolling average on a time series data?
13. Write code to change the title and labels of a plot after creating it.
14. How can you adjust the size of a plot in Matplotlib?
15. Write code to display a legend in a plot.
12. Working with Time Zones in Pandas
Time zone handling is an essential part of time series analysis, especially when dealing with data from
multiple time zones. Pandas provides robust tools for converting, localizing, and performing
operations on time zone-aware datetime objects.
Examples
Example 1: Localizing a Time Series
python
import pandas as pd
# Create a time series
date_range = pd.date_range(start='2023-01-01', periods=5, freq='D')
ts = pd.Series([10, 15, 20, 25, 30], index=date_range)
# Localize to UTC
ts_utc = ts.tz_localize('UTC')
print(ts_utc)
Example 2: Converting Time Zones
python
# Convert time series to a different time zone
ts_new_york = ts_utc.tz_convert('America/New_York')
print(ts_new_york)
Example 3: Creating Time Zone Aware Timestamps
python
# Create a timestamp with a specific time zone
timestamp = pd.Timestamp('2023-01-01 12:00:00', tz='Europe/London')
print(timestamp)
Example 4: Handling Daylight Saving Time (DST)
python
# Create a time series that includes a DST change
date_range = pd.date_range(start='2023-03-01', periods=10, freq='D')
ts_dst = pd.Series(range(10), index=date_range).dt.tz_localize('Europe/London')
print(ts_dst)
Example 5: Using `pd.to_datetime` with Time Zones
python
# Create a time series from a string with a time zone
ts_from_string = pd.to_datetime('2023-01-01 12:00:00 UTC')
print(ts_from_string)
Example 6: Operations with Time Zone Aware Datetime
python
# Create two time series with different time zones
ts_utc1 = pd.Series([1, 2, 3], index=pd.date_range('2023-01-01', periods=3,
freq='H')).tz_localize('UTC')
ts_utc2 = pd.Series([4, 5, 6], index=pd.date_range('2023-01-01', periods=3,
freq='H')).tz_localize('America/New_York')
# Convert to UTC for operations
ts_utc2_converted = ts_utc2.tz_convert('UTC')
result = ts_utc1 + ts_utc2_converted
print(result)
Example 7: Plotting Time Series with Time Zones
python
import matplotlib.pyplot as plt
# Plotting time series with time zones
ts_utc.plot(title='Time Series in UTC', marker='o')
plt.xlabel('Date')
plt.ylabel('Values')
plt.show()
Example 8: Resampling with Time Zones
python
# Resample time series with time zone
ts_resampled = ts_dst.resample('2D').mean()
print(ts_resampled)
Example 9: Converting Timestamps to Local Time
python
# Convert a timestamp to local time
local_time = timestamp.tz_convert('America/New_York')
print(local_time)
Example 10: Handling Multiple Time Zones
python
# Create a DataFrame with different time zones
data = {
'UTC': pd.date_range('2023-01-01', periods=3, freq='H').tz_localize('UTC'),
'New_York': pd.date_range('2023-01-01', periods=3,
freq='H').tz_localize('America/New_York')
}
df_timezones = pd.DataFrame(data)
print(df_timezones)
Assessments (15 Questions)
1. Write code to localize a naive time series to UTC.
2. How can you convert a time series from UTC to a different time zone?
3. Write code to create a timestamp that is aware of a specific time zone.
4. How would you handle daylight saving time when creating a time series?
5. Write code to create a time series from a string that includes a time zone.
6. How can you perform arithmetic operations on time zone-aware time series?
7. Write code to plot a time series that includes time zone information.
8. How would you resample a time series that includes time zone data?
9. Write code to convert a timestamp from one time zone to another.
10. How can you create a DataFrame that includes timestamps in multiple time zones?
11. Write a function to check if a timestamp is in a specific time zone.
12. How would you extract the timezone information from a timestamp?
13. Write code to calculate the difference between two time zone-aware timestamps.
14. How can you handle missing values in a time series that includes time zones?
15. Write code to visualize a time series in different time zones on the same plot.
13. Data Aggregation and Grouping with Pandas
Data aggregation and grouping allow you to summarize and analyze large datasets efficiently. Pandas
provides powerful methods for grouping data, applying functions to groups, and aggregating results.
Examples
Example 1: Basic GroupBy Operation
python
import pandas as pd
# Create a DataFrame
data = {
'Category': ['A', 'A', 'B', 'B', 'C', 'C'],
'Values': [10, 15, 10, 25, 30, 35]
}
df = pd.DataFrame(data)
# Group by 'Category' and calculate the sum
grouped_sum = df.groupby('Category').sum()
print(grouped_sum)
Example 2: Grouping with Multiple Aggregations
python
# Group by 'Category' and apply multiple aggregations
grouped_agg = df.groupby('Category').agg({'Values': ['sum', 'mean', 'max']})
print(grouped_agg)
Example 3: Grouping with a Custom Function
python
# Define a custom function
def custom_function(x):
return x.max() - x.min()
# Group by 'Category' and apply the custom function
grouped_custom = df.groupby('Category')['Values'].agg(custom_function)
print(grouped_custom)
Example 4: Grouping with DateTime Index
python
# Create a DataFrame with a DateTime index
date_rng = pd.date_range(start='2023-01-01', end='2023-01-10', freq='D')
df_time = pd.DataFrame({'Values': range(10)}, index=date_rng)
# Group by day and calculate the sum
daily_sum = df_time.resample('D').sum()
print(daily_sum)
Example 5: Grouping with `transform`
python
# Using transform to get the mean for each group
df['Mean'] = df.groupby('Category')['Values'].transform('mean')
print(df)
Example 6: Grouping with Filtering
python
# Filter groups based on a condition
filtered_groups = df.groupby('Category').filter(lambda x: x['Values'].sum() > 40)
print(filtered_groups)
Example 7: Grouping by Multiple Columns
python
# Create a DataFrame with multiple columns
data2 = {
'Category': ['A', 'A', 'B', 'B', 'C', 'C'],
'Subcategory': ['X', 'Y', 'X', 'Y', 'X', 'Y'],
'Values': [10, 15, 10, 25, 30, 35]
}
df_multi = pd.DataFrame(data2)
# Group by 'Category' and 'Subcategory'
grouped_multi = df_multi.groupby(['Category', 'Subcategory']).sum()
print(grouped_multi)
Example 8: Pivot Table
python
# Create a pivot table from the DataFrame
pivot_table = df_multi.pivot_table(values='Values', index='Category',
columns='Subcategory', aggfunc='sum', fill_value=0)
print(pivot_table)
Example 9: GroupBy with Sort
python
# Group by 'Category' and sort by the sum of values
sorted_groups = df.groupby('Category')['Values'].sum().sort_values(ascending=False)
print(sorted_groups)
Example 10: Combining Aggregation and Transformation
python
# Combine aggregation and transformation
df['Normalized'] = df.groupby('Category')['Values'].transform(lambda x: (x - x.mean()) /
x.std())
print(df)
Assessments (15 Questions)
1. Write code to group a DataFrame by a single column and calculate the sum of another column.
2. How can you apply multiple aggregation functions to a grouped DataFrame?
3. Write a custom aggregation function and apply it to a group.
4. How would you group a DataFrame with a DateTime index and calculate the daily sum?
5. Write code to use the `transform` function after grouping.
6. How can you filter groups based on a specific condition?
7. Write code to group a DataFrame by multiple columns and sum the values.
8. How can you create a pivot table from a DataFrame?
9. Write code to sort groups based on the sum of values in descending order.
10. How would you normalize values within each group using `transform`?
11. Write a function to calculate the range of values in each group.
12. How can you reset the index of a grouped DataFrame?
13. Write code to create a grouped DataFrame and visualize it with a bar plot.
14. How can you group by a time series and calculate the weekly average?
15. Write code to group data and export the summary to a CSV file.
14. Saving DataFrames
Pandas provides various methods to save DataFrames to different file formats. This is essential for
preserving data analysis results, sharing with others, or loading them in other applications.
Examples
Example 1: Saving to CSV
python
import pandas as pd
# Create a DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']
}
df = pd.DataFrame(data)
# Save DataFrame to a CSV file
df.to_csv('people.csv', index=False)
Example 2: Saving to Excel
python
# Save DataFrame to an Excel file
df.to_excel('people.xlsx', index=False)
Example 3: Saving to JSON
python
# Save DataFrame to a JSON file
df.to_json('people.json', orient='records', lines=True)
Example 4: Saving to Parquet
python
# Save DataFrame to a Parquet file
df.to_parquet('people.parquet')
Example 5: Saving to HDF5
python
# Save DataFrame to an HDF5 file
df.to_hdf('people.h5', key='people', mode='w')
Example 6: Saving with Compression
python
# Save DataFrame to a compressed CSV file
df.to_csv('people_compressed.csv.gz', index=False, compression='gzip')
Example 7: Saving with a Custom Separator
python
# Save DataFrame to a CSV file with a custom separator
df.to_csv('people_semicolon.csv', sep=';', index=False)
Example 8: Saving Selected Columns
python
# Save only selected columns to a CSV file
df[['Name', 'City']].to_csv('people_selected.csv', index=False)
Example 9: Appending to an Existing CSV
python
# Append DataFrame to an existing CSV file
new_data = {
'Name': ['David', 'Eva'],
'Age': [40, 45],
'City': ['Houston', 'Phoenix']
}
df_new = pd.DataFrame(new_data)
df_new.to_csv('people.csv', mode='a', header=False, index=False)
Example 10: Saving with a Specific Encoding
python
# Save DataFrame to a CSV file with a specific encoding
df.to_csv('people_utf8.csv', index=False, encoding='utf-8')
Assessments (15 Questions)
1. Write code to save a DataFrame to a CSV file.
2. How can you save a DataFrame to an Excel file?
3. Write code to save a DataFrame in JSON format.
4. How can you save a DataFrame to a Parquet file?
5. Write code to save a DataFrame to an HDF5 file.
6. How would you save a DataFrame to a compressed CSV file?
7. Write code to save a DataFrame with a custom separator (e.g., semicolon).
8. How can you save only specific columns from a DataFrame to a CSV file?
9. Write code to append data to an existing CSV file.
10. How can you save a DataFrame using a specific character encoding?
11. Write a function to save a DataFrame to a given file format based on user input.
12. How would you check if a file already exists before saving a DataFrame?
13. Write code to read a saved CSV file back into a DataFrame.
14. How can you save multiple DataFrames to different sheets in an Excel file?
15. Write code to handle exceptions when saving a DataFrame to a file.