Moderate Level Questions
Groupby Operations
Grouping data and performing aggregations are common operations in data
analysis. Here's how you can do it using Pandas:
Example DataFrame
Let's assume you have the following DataFrame:
import pandas as pd
import numpy as np
data = {
'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Helen', 'Ia
n', 'Jack'],
'Age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70],
'Salary': [50000, 55000, 60000, 65000, 70000, 75000, 80000, 85000, 9
0000, 95000],
'Department': ['HR', 'IT', 'Finance', 'Marketing', 'Sales', 'HR', 'IT', 'Financ
e', 'Marketing', 'Sales']
}
df = pd.DataFrame(data)
1. Group Data by a Column and Calculate the Mean
To group data by a column and calculate the mean of another column, you can
use the groupby() method along with mean() :
# Group by 'Department' and calculate the mean 'Salary'
mean_salary_by_department = df.groupby('Department')['Salary'].mean()
print(mean_salary_by_department)
Moderate Level Questions 1
2. Find the Sum of a Numeric Column for Each Category in
Another Column
To find the sum of a numeric column for each category in another column, you
can use the groupby() method along with sum() :
# Group by 'Department' and calculate the sum of 'Salary'
total_salary_by_department = df.groupby('Department')['Salary'].sum()
print(total_salary_by_department)
Explanation
: Groups the DataFrame by the
df.groupby('Department')['Salary'].mean()
'Department' column and calculates the mean of the 'Salary' column for
each group.
: Groups the DataFrame by the 'Department'
df.groupby('Department')['Salary'].sum()
column and calculates the sum of the 'Salary' column for each group.
These operations allow you to summarize and analyze your data based on
specific categories. You can adjust the column names and aggregation
functions as needed for your specific use case.
Merging and Joining
Merging and joining DataFrames are essential operations for combining data
from different sources. Here's how you can perform these operations using
Pandas:
Example DataFrames
Let's assume you have the following two DataFrames:
import pandas as pd
# First DataFrame
df1 = pd.DataFrame({
'ID': [1, 2, 3, 4, 5],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
Moderate Level Questions 2
'Department': ['HR', 'IT', 'Finance', 'Marketing', 'Sales']
})
# Second DataFrame
df2 = pd.DataFrame({
'ID': [3, 4, 5, 6, 7],
'Salary': [60000, 65000, 70000, 75000, 80000],
'Department': ['Finance', 'Marketing', 'Sales', 'HR', 'IT']
})
1. Merge Two DataFrames on a Common Column
To merge two DataFrames on a common column, you can use the merge()
method:
# Merge df1 and df2 on the 'ID' column
merged_df = pd.merge(df1, df2, on='ID')
print(merged_df)
2. Perform an Outer Join on Two DataFrames
To perform an outer join on two DataFrames, you can use the merge() method
with the how='outer' parameter:
# Perform an outer join on df1 and df2 based on the 'ID' column
outer_join_df = pd.merge(df1, df2, on='ID', how='outer')
print(outer_join_df)
Explanation
: Merges df1 and df2 on the 'ID' column. By default,
pd.merge(df1, df2, on='ID')
this performs an inner join, including only rows with matching 'ID' values in
both DataFrames.
: Performs an outer join, including all rows
pd.merge(df1, df2, on='ID', how='outer')
from both DataFrames. Rows with no match will have NaN in the resulting
DataFrame.
Moderate Level Questions 3
These operations allow you to combine data from different DataFrames based
on common columns, enabling more comprehensive analysis. You can adjust
the column names and join types as needed for your specific use case.
Certainly! Let's dive deeper into merging and joining DataFrames with more
examples and their outputs.
Example DataFrames
We'll use the following DataFrames for demonstration:
import pandas as pd
# First DataFrame
df1 = pd.DataFrame({
'ID': [1, 2, 3, 4, 5],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Department': ['HR', 'IT', 'Finance', 'Marketing', 'Sales']
})
# Second DataFrame
df2 = pd.DataFrame({
'ID': [3, 4, 5, 6, 7],
'Salary': [60000, 65000, 70000, 75000, 80000],
'Department': ['Finance', 'Marketing', 'Sales', 'HR', 'IT']
})
1. Inner Join (Default Merge)
An inner join returns only the rows with matching keys in both DataFrames.
# Inner join on 'ID'
inner_join_df = pd.merge(df1, df2, on='ID')
print(inner_join_df)
Output:
Moderate Level Questions 4
ID Name Department Salary
0 3 Charlie Finance 60000
1 4 David Marketing 65000
2 5 Eva Sales 70000
2. Outer Join
An outer join returns all rows from both DataFrames, filling in NaN for missing
matches.
# Outer join on 'ID'
outer_join_df = pd.merge(df1, df2, on='ID', how='outer')
print(outer_join_df)
Output:
ID Name Department_x Salary Department_y
0 1 Alice HR NaN NaN
1 2 Bob IT NaN NaN
2 3 Charlie Finance 60000 Finance
3 4 David Marketing 65000 Marketing
4 5 Eva Sales 70000 Sales
5 6 NaN NaN 75000 HR
6 7 NaN NaN 80000 IT
3. Left Join
A left join returns all rows from the left DataFrame and the matched rows from
the right DataFrame.
# Left join on 'ID'
left_join_df = pd.merge(df1, df2, on='ID', how='left')
print(left_join_df)
Output:
Moderate Level Questions 5
ID Name Department Salary
0 1 Alice HR NaN
1 2 Bob IT NaN
2 3 Charlie Finance 60000.0
3 4 David Marketing 65000.0
4 5 Eva Sales 70000.0
4. Right Join
A right join returns all rows from the right DataFrame and the matched rows
from the left DataFrame.
# Right join on 'ID'
right_join_df = pd.merge(df1, df2, on='ID', how='right')
print(right_join_df)
Output:
ID Name Department Salary
0 3 Charlie Finance 60000
1 4 David Marketing 65000
2 5 Eva Sales 70000
3 6 NaN NaN 75000
4 7 NaN NaN 80000
Explanation
Inner Join: Returns only the rows where there is a match in both
DataFrames.
Outer Join: Returns all rows from both DataFrames, filling in NaN where
there is no match.
Left Join: Returns all rows from the left DataFrame and the matched rows
from the right DataFrame.
Moderate Level Questions 6
Right Join: Returns all rows from the right DataFrame and the matched
rows from the left DataFrame.
These examples illustrate how you can merge DataFrames using different join
types to suit your data analysis needs.
Data Transformation
Data transformation is a crucial step in data preprocessing. Here's how you
can apply a custom function to a column and work with date columns in
Pandas:
Example DataFrame
Let's assume you have the following DataFrame:
import pandas as pd
data = {
'ID': [1, 2, 3, 4, 5],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Joining_Date': ['2021-05-15', '2020-06-20', '2019-07-25', '2018-08-30',
'2017-09-30'],
'Salary': [50000, 55000, 60000, 65000, 70000]
}
df = pd.DataFrame(data)
1. Apply a Custom Function to a Column
To apply a custom function to a column, you can use the apply() method. Let's
say you want to increase each salary by 10%.
# Define a custom function to increase salary by 10%
def increase_salary(salary):
return salary * 1.10
# Apply the custom function to the 'Salary' column
Moderate Level Questions 7
df['Salary'] = df['Salary'].apply(increase_salary)
print(df)
Output:
ID Name Joining_Date Salary
0 1 Alice 2021-05-15 55000.0
1 2 Bob 2020-06-20 60500.0
2 3 Charlie 2019-07-25 66000.0
3 4 David 2018-08-30 71500.0
4 5 Eva 2017-09-30 77000.0
2. Convert a Date Column to Datetime Format and Extract the
Year
To convert a date column to datetime format and extract the year, you can use
pd.to_datetime() and the dt accessor.
# Convert 'Joining_Date' to datetime format
df['Joining_Date'] = pd.to_datetime(df['Joining_Date'])
# Extract the year from 'Joining_Date'
df['Year'] = df['Joining_Date'].dt.year
print(df)
Output:
ID Name Joining_Date Salary Year
0 1 Alice 2021-05-15 55000.0 2021
1 2 Bob 2020-06-20 60500.0 2020
2 3 Charlie 2019-07-25 66000.0 2019
3 4 David 2018-08-30 71500.0 2018
4 5 Eva 2017-09-30 77000.0 2017
Explanation
Moderate Level Questions 8
: Applies the custom function
apply(increase_salary) increase_salary to each
element in the 'Salary' column.
pd.to_datetime(df['Joining_Date']) : Converts the 'Joining_Date' column to datetime
format.
df['Joining_Date'].dt.year : Extracts the year from the 'Joining_Date' column.
These operations allow you to transform and manipulate your data efficiently.
You can adjust the functions and column names as needed for your specific
use case.
Pivot Tables and Cross Tabulation
Pivot tables and cross-tabulations are powerful tools for summarizing and
analyzing data. Here's how you can create them using Pandas:
Example DataFrame
Let's assume you have the following DataFrame:
import pandas as pd
data = {
'ID': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Helen', 'Ia
n', 'Jack'],
'Department': ['HR', 'IT', 'Finance', 'Marketing', 'Sales', 'HR', 'IT', 'Financ
e', 'Marketing', 'Sales'],
'Salary': [50000, 55000, 60000, 65000, 70000, 75000, 80000, 85000, 9
0000, 95000],
'Performance_Score': [4, 5, 3, 4, 5, 4, 3, 5, 4, 3]
}
df = pd.DataFrame(data)
1. Create a Pivot Table
Moderate Level Questions 9
A pivot table summarizes data by aggregating values based on one or more
keys.
# Create a pivot table to show the average salary by department and perfo
rmance score
pivot_table = df.pivot_table(values='Salary', index='Department', columns
='Performance_Score', aggfunc='mean')
print(pivot_table)
Output:
Performance_Score 3 4 5
Department
Finance 60000.0 85000.0 NaN
HR 75000.0 50000.0 NaN
IT NaN 80000.0 55000.0
Marketing NaN 65000.0 90000.0
Sales NaN 70000.0 95000.0
2. Generate a Cross-Tabulation
Cross-tabulation shows the frequency distribution of two categorical variables.
# Generate a cross-tabulation of 'Department' and 'Performance_Score'
cross_tab = pd.crosstab(df['Department'], df['Performance_Score'])
print(cross_tab)
Output:
Performance_Score 3 4 5
Department
Finance 1 1 0
HR 0 1 1
IT 0 1 1
Moderate Level Questions 10
Marketing 0 1 1
Sales 0 1 1
Explanation
: Creates a pivot table with 'Department' as the index,
pivot_table
'Performance_Score' as the columns, and the mean of 'Salary' as the
values.
: Generates a cross-tabulation of 'Department' and
crosstab
'Performance_Score', showing the frequency of each combination.
These operations help you summarize and analyze your data effectively. You
can adjust the columns and aggregation functions as needed for your specific
use case.
Rolling and Window Functions
Rolling and window functions are useful for time series analysis and smoothing
data. Here's how you can compute a moving average and a cumulative sum
using Pandas:
Example DataFrame
Let's assume you have the following DataFrame with time series data:
import pandas as pd
data = {
'Date': pd.date_range(start='2023-01-01', periods=10, freq='D'),
'Value': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
}
df = pd.DataFrame(data)
1. Compute a Moving Average
To compute a moving average with a window size of 5, you can use the rolling()
method.
Moderate Level Questions 11
# Compute a moving average with a window size of 5
df['Moving_Average'] = df['Value'].rolling(window=5).mean()
print(df)
Output:
Date Value Moving_Average
0 2023-01-01 10 NaN
1 2023-01-02 20 NaN
2 2023-01-03 30 NaN
3 2023-01-04 40 NaN
4 2023-01-05 50 30.0
5 2023-01-06 60 40.0
6 2023-01-07 70 50.0
7 2023-01-08 80 60.0
8 2023-01-09 90 70.0
9 2023-01-10 100 80.0
2. Find the Cumulative Sum
To find the cumulative sum of a column, you can use the cumsum() method.
# Find the cumulative sum of the 'Value' column
df['Cumulative_Sum'] = df['Value'].cumsum()
print(df)
Output:
Date Value Moving_Average Cumulative_Sum
0 2023-01-01 10 NaN 10
1 2023-01-02 20 NaN 30
2 2023-01-03 30 NaN 60
3 2023-01-04 40 NaN 100
4 2023-01-05 50 30.0 150
5 2023-01-06 60 40.0 210
Moderate Level Questions 12
6 2023-01-07 70 50.0 280
7 2023-01-08 80 60.0 360
8 2023-01-09 90 70.0 450
9 2023-01-10 100 80.0 550
Explanation
: Computes the moving average of the 'Value' column
rolling(window=5).mean()
with a window size of 5. The first few values are NaN because there aren't
enough preceding values to compute the average.
: Computes the cumulative sum of the 'Value' column, which is the
cumsum()
sum of all previous values up to the current row.
These operations are useful for analyzing trends and patterns in time series
data. You can adjust the window size and column names as needed for your
specific use case.
Date Time Operations
Working with date columns often involves extracting components like day,
month, and year, as well as calculating differences between dates. Here's how
you can do these operations using Pandas:
Example DataFrame
Let's assume you have the following DataFrame with date columns:
import pandas as pd
data = {
'Start_Date': pd.to_datetime(['2023-01-01', '2023-02-15', '2023-03-20',
'2023-04-25']),
'End_Date': pd.to_datetime(['2023-01-10', '2023-02-20', '2023-03-30',
'2023-05-01'])
}
df = pd.DataFrame(data)
Moderate Level Questions 13
1. Extract the Day, Month, and Year
To extract the day, month, and year from a date column, you can use the dt
accessor.
# Extract day, month, and year from 'Start_Date'
df['Day'] = df['Start_Date'].dt.day
df['Month'] = df['Start_Date'].dt.month
df['Year'] = df['Start_Date'].dt.year
print(df)
Output:
Start_Date End_Date Day Month Year
0 2023-01-01 2023-01-10 1 1 2023
1 2023-02-15 2023-02-20 15 2 2023
2 2023-03-20 2023-03-30 20 3 2023
3 2023-04-25 2023-05-01 25 4 2023
2. Find the Difference in Days Between Two Date Columns
To find the difference in days between two date columns, you can subtract one
column from the other.
# Calculate the difference in days between 'End_Date' and 'Start_Date'
df['Difference_Days'] = (df['End_Date'] - df['Start_Date']).dt.days
print(df)
Output:
Start_Date End_Date Day Month Year Difference_Days
0 2023-01-01 2023-01-10 1 1 2023 9
1 2023-02-15 2023-02-20 15 2 2023 5
2 2023-03-20 2023-03-30 20 3 2023 10
3 2023-04-25 2023-05-01 25 4 2023 6
Moderate Level Questions 14
Explanation
dt.day , dt.month , dt.year : Extract the day, month, and year from the
'Start_Date' column.
: Calculates the difference in days between
(df['End_Date'] - df['Start_Date']).dt.days
the 'End_Date' and 'Start_Date' columns.
These operations allow you to manipulate and analyze date data efficiently.
You can adjust the column names and operations as needed for your specific
use case.
Moderate Level Questions 15