Some exercises :
Exercise 1: Data Cleaning
Objective: Cleanse the data to prepare it for analysis.
Dataset: You have a dataset with information about customers, but it contains missing values
and outliers.
Tasks:
1. Identify and handle missing values (you can use techniques like filling with
mean/median or removing rows/columns).
2. Detect and deal with outliers in numerical variables.
Exercise 2: Exploratory Data Analysis (EDA)
Objective: Understand the characteristics of the dataset through exploratory data analysis.
Dataset: Use a dataset containing information about sales transactions.
Tasks:
1. Create summary statistics (mean, median, standard deviation, etc.) for numerical
variables.
2. Generate visualizations (histograms, box plots, scatter plots) to explore the
distribution of key variables.
3. Identify correlations between variables.
Exercise 3: Regression Analysis
Objective: Explore relationships between variables and make predictions.
Dataset: Use a dataset with information about housing prices, including features like square
footage, number of bedrooms, etc.
Tasks:
1. Perform a simple linear regression to predict housing prices based on a single variable
(e.g., square footage).
2. Evaluate the model's performance using metrics like Mean Squared Error or R-
squared.
Exercise 4: Clustering
Objective: Group similar data points together based on certain criteria.
Dataset: Use a dataset containing customer behavior data.
Tasks:
1. Apply a clustering algorithm (e.g., k-means) to group customers based on their
behavior.
2. Visualize the clusters and interpret the results.
Exercise 5: Classification
Objective: Assign observations to predefined categories or classes.
Dataset: Use a dataset with information about customer purchases.
Tasks:
1. Define a target variable (e.g., whether a customer will make a repeat purchase).
2. Split the data into training and testing sets.
3. Train a classification model (e.g., logistic regression or decision tree) to predict the
target variable.
4. Evaluate the model's performance on the testing set.
Exercise 6: Data Visualization
Objective: Create visual representations of data to aid in understanding.
Dataset: Choose a dataset that interests you.
Tasks:
1. Select two or more variables and create appropriate visualizations (e.g., bar chart, line
plot, pie chart).
2. Use color and labeling to enhance the interpretability of the visualizations.
Exercises with solutions
Exercise 1: Data Cleaning
Dataset: Download the dataset
Objective: Cleanse the COVID-19 dataset to prepare it for analysis.
Tasks:
1. Load the dataset into your preferred data analysis tool.
2. Identify and handle missing values.
3. Remove unnecessary columns.
4. Ensure consistency in date formats.
import pandas as pd
# Load the dataset
url = "https://raw.githubusercontent.com/datasets/covid-19/master/data/countries-
aggregated.csv"
covid_data = pd.read_csv(url)
# Identify and handle missing values
covid_data.dropna(inplace=True)
# Remove unnecessary columns
covid_data = covid_data[['Date', 'Country', 'Confirmed', 'Recovered', 'Deaths']]
# Ensure consistency in date formats
covid_data['Date'] = pd.to_datetime(covid_data['Date'])
# Display the cleaned dataset
print(covid_data.head())
Exercise 2: Exploratory Data Analysis (EDA)
Dataset: Download the dataset
Objective: Perform exploratory data analysis on the Auto MPG dataset.
Tasks:
1. Load the dataset into your preferred data analysis tool.
2. Generate summary statistics for numerical variables.
3. Create visualizations to explore the distribution of key variables.
4. Identify correlations between variables.
Solution:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"
column_names = ["MPG", "Cylinders", "Displacement", "Horsepower", "Weight",
"Acceleration", "Model Year", "Origin", "Car Name"]
auto_data = pd.read_csv(url, delim_whitespace=True, names=column_names)
# Summary statistics
print(auto_data.describe())
# Visualizations
sns.pairplot(auto_data[['MPG', 'Cylinders', 'Displacement', 'Weight']])
plt.show()
# Correlation matrix
correlation_matrix = auto_data.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.show()
Exercise 3: Regression Analysis
Dataset: Download the dataset
Objective: Perform a simple linear regression to predict the age of abalones.
Tasks:
1. Load the dataset into your preferred data analysis tool.
2. Choose a variable to predict (e.g., Rings).
3. Split the data into training and testing sets.
4. Train a linear regression model.
5. Evaluate the model's performance using metrics like Mean Squared Error or R-
squared
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data"
column_names = ["Sex", "Length", "Diameter", "Height", "WholeWeight",
"ShuckedWeight", "VisceraWeight", "ShellWeight", "Rings"]
abalone_data = pd.read_csv(url, names=column_names)
# Choose a variable to predict
X = abalone_data[['Length', 'Diameter', 'Height', 'WholeWeight', 'ShuckedWeight',
'VisceraWeight', 'ShellWeight']]
y = abalone_data['Rings']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = model.predict(X_test)
# Evaluate the model's performance
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')
BASIC EXERCISES WITH SOLUTIONS
Exercise 1: Data Cleaning
Objective: Cleanse a simple dataset.
Dataset:
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma', 'Frank', 'Grace', 'Harry', 'Ivy', 'Jack'],
'Age': [25, 28, None, 22, 30, 35, 28, None, 24, 29],
'Salary': [50000, 60000, 75000, 48000, None, 90000, 80000, 75000, 52000, 60000]
}
df = pd.DataFrame(data)
Tasks:
1. Handle missing values in the 'Age' and 'Salary' columns.
2. Drop any rows with missing values.
3. Display the cleaned dataset.
# Handle missing values
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].median(), inplace=True)
# Drop rows with missing values
df.dropna(inplace=True)
# Display the cleaned dataset
print(df)
Exercise 2: Exploratory Data Analysis (EDA)
Objective: Explore a dataset and generate basic insights.
Dataset:
import pandas as pd
data = {
'Student_ID': [1, 2, 3, 4, 5],
'Math_Score': [85, 90, 78, 92, 88],
'English_Score': [75, 80, 85, 88, 92],
'Science_Score': [90, 85, 88, 80, 95]
}
df = pd.DataFrame(data)
Tasks:
1. Calculate the mean, median, and standard deviation for each subject.
2. Plot a bar chart to visualize the average scores for each subject.
# Calculate mean, median, and standard deviation
subject_stats = df.describe().loc[['mean', '50%', 'std']].transpose()
print(subject_stats)
# Plot a bar chart
import matplotlib.pyplot as plt
subject_stats.plot(kind='bar', y='mean', yerr='std', legend=False)
plt.title('Average Scores for Each Subject')
plt.ylabel('Score')
plt.xlabel('Subject')
plt.show()
Exercise 3: Data Visualization
Objective: Visualize a dataset using scatter plots.
Dataset:
import pandas as pd
data = {
'Hours_Studied': [2, 3, 5, 1, 4, 6, 7, 3, 2, 5],
'Exam_Score': [50, 65, 80, 40, 75, 90, 95, 60, 55, 85]
}
df = pd.DataFrame(data)
Tasks:
1. Create a scatter plot to visualize the relationship between hours studied and exam
scores.
2. Add labels and a title to the plot.
import matplotlib.pyplot as plt
# Scatter plot
plt.scatter(df['Hours_Studied'], df['Exam_Score'])
plt.title('Relationship Between Hours Studied and Exam Score')
plt.xlabel('Hours Studied')
plt.ylabel('Exam Score')
plt.show()