3.
Combining Datasets: Merge and Join, Aggregation and Grouping, Pivot Tables, Vectorized String
Operations, Working with Time Series. High-Performance Pandas: eval()and query()
1. Combining Datasets: Merge and Join
Pandas provides several methods to combine or merge multiple datasets. The most common
methods are merge() and join(), both of which are used to combine data from multiple
DataFrames based on a common column.
Merge:
The merge() function is used to combine two DataFrames based on common columns or
indices. It is similar to SQL joins (e.g., inner, left, right, and outer joins).
Example:
import pandas as pd
# Creating two DataFrames
df1 = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie']
})
df2 = pd.DataFrame({
'ID': [1, 2, 4],
'Age': [25, 30, 35]
})
# Merging the DataFrames on 'ID'
merged_df = pd.merge(df1, df2, on='ID', how='inner')
print(merged_df)
Output:
ID Name Age
0 1 Alice 25
1 2 Bob 30
Explanation:
The merge() function combines df1 and df2 on the common column ID.
The how='inner' argument specifies that we want an inner join, meaning only the
rows with common IDs will be kept.
Join:
The join() function combines DataFrames by their index or columns. It is more commonly
used when the two DataFrames have a shared index.
Example:
df1 = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]
}, index=[1, 2, 3])
df2 = pd.DataFrame({
'Country': ['USA', 'Canada', 'UK']
}, index=[1, 2, 3])
# Using join to combine the DataFrames based on index
joined_df = df1.join(df2)
print(joined_df)
Output:
Name Age Country
1 Alice 25 USA
2 Bob 30 Canada
3 Charlie 35 UK
Explanation:
The join() function combines df1 and df2 using their index.
The resulting DataFrame contains the columns from both df1 and df2.
2. Aggregation and Grouping
Aggregation and grouping allow you to perform calculations (such as sum, mean, or count)
on subsets of your data.
GroupBy:
The groupby() function splits the data into groups based on some criteria, applies a function
to each group, and then combines the results.
Example:
df = pd.DataFrame({
'Team': ['A', 'B', 'A', 'B', 'A', 'B'],
'Points': [10, 20, 15, 25, 30, 35]
})
# Grouping by 'Team' and calculating the sum of 'Points'
grouped = df.groupby('Team').sum()
print(grouped)
Output:
Points
Team
A 55
B 80
Explanation:
The groupby() function groups the data by the column Team.
The sum() function is applied to each group to calculate the total Points for each
team.
3. Pivot Tables
Pivot tables allow you to reshape data and perform aggregations. They are similar to Excel
pivot tables.
Example:
df = pd.DataFrame({
'Date': ['2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02'],
'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles'],
'Temperature': [32, 75, 30, 77]
})
# Creating a pivot table to find the average temperature by date and city
pivot_table = df.pivot_table(values='Temperature', index='Date',
columns='City', aggfunc='mean')
print(pivot_table)
Output:
City Los Angeles New York
Date
2021-01-01 75.0 32.0
2021-01-02 77.0 30.0
Explanation:
The pivot_table() function reshapes the data.
It aggregates Temperature by Date and City and calculates the mean temperature for
each group.
4. Vectorized String Operations
Pandas provides powerful vectorized operations for string manipulation. These operations are
applied directly to entire columns or Series, making them very efficient.
Example:
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'City': ['New York', 'Los Angeles', 'Chicago']
})
# Convert all names to uppercase using vectorized string operations
df['Name'] = df['Name'].str.upper()
print(df)
Output:
Name City
0 ALICE New York
1 BOB Los Angeles
2 CHARLIE Chicago
Explanation:
The str.upper() method is applied to the entire Name column, converting all the
names to uppercase.
5. Working with Time Series
Pandas provides extensive functionality for working with time series data, including
generating ranges of dates, resampling data, and performing date/time operations.
Example:
# Creating a DateTime index
dates = pd.date_range('20210101', periods=6)
df = pd.DataFrame({
'Date': dates,
'Temperature': [32, 35, 31, 30, 29, 28]
})
# Setting 'Date' as the index
df.set_index('Date', inplace=True)
# Resampling the data to get the average temperature per month
monthly_avg = df.resample('M').mean()
print(monthly_avg)
Output:
Temperature
Date
2021-01-31 31.666667
Explanation:
The date_range() function generates a range of dates starting from '2021-01-01'.
We resample the data by month using .resample('M') and calculate the mean of the
Temperature column.
6. High-Performance Pandas: eval() and query()
Pandas provides two high-performance functions, eval() and query(), for efficiently
evaluating expressions and filtering data.
eval():
The eval() function allows you to evaluate an expression as a string and operate on large
datasets efficiently.
Example:
import pandas as pd
df = pd.DataFrame({
'A': [1, 2, 3, 4],
'B': [10, 20, 30, 40]
})
# Using eval() to perform arithmetic operations on columns
df['C'] = pd.eval('df.A + df.B')
print(df)
Output:
A B C
0 1 10 11
1 2 20 22
2 3 30 33
3 4 40 44
Explanation:
The eval() function evaluates the expression 'df.A + df.B' and computes the
result in a new column C.
query():
The query() function allows you to filter rows of a DataFrame based on a condition
expressed as a string.
Example:
# Using query() to filter rows where A is greater than 2
filtered_df = df.query('A > 2')
print(filtered_df)
Output:
A B C
2 3 30 33
3 4 40 44
Explanation:
The query() function allows you to filter the DataFrame based on a condition. Here,
we selected rows where the value in column A is greater than 2.
Questions:
1. What is the difference between the merge() and join() functions in Pandas? Provide an
example of when to use each of them.
2. What is a pivot table in Pandas? Explain how to create a pivot table and describe the role
of the aggfunc parameter.
3. Define vectorized string operations in Pandas. How are they different from using
Python’s regular string methods? Provide an example of using a vectorized string operation
on a column in a DataFrame.
4. Explain how to handle missing data in Pandas. What are the common techniques for
dealing with NaN values in a DataFrame?
5. What is the purpose of the eval() function in Pandas? How does it improve performance
compared to traditional methods for column operations?
6. Explain the query() function in Pandas. How is it used to filter data based on a specific
condition or expression?