National University of Technology (NUTECH)
Electrical Engineering Department
EE4407-Machine Learning Lab
LAB No: 08
NAME: Muhmmad Ahmed Mustafa
ID NO: F20603040
Lab 08: Linear Regression and Data Preprocessing on Supermarket Sales Dataset
Objective:
Understand and implement data preprocessing techniques, including log transformation and one-
hot encoding.
Apply Exploratory Data Analysis (EDA) to gain insights into the dataset.
Explore linear regression for predicting sales in the supermarket dataset.
Tools/Software Requirements:
Python 3.x
Jupyter Notebook or any other Python IDE
Pandas, NumPy for data manipulation
Matplotlib, Seaborn for data visualization
Scikit-learn for linear regression modeling
Sample supermarket sales dataset (cleaned) in CSV format
Data Preprocessing and Feature Engineering
Import Necessary Libraries and Load and Inspect the Dataset:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Linear Regression
from sklearn.preprocessing import OneHotEncoder, StandardScaler
file_path = 'C:\\Users\\us\\Desktop\\cleaned\\
superstore_final_dataset_cleaned.csv' # Replace with your own file path
dataset = pd.read_csv(file_path, encoding='ISO-8859-1')
# ISO-8859-1 and cp1252 cover most scenarios.
National University of Technology (NUTECH)
Electrical Engineering Department
EE4407-Machine Learning Lab
Necessary Data Preprocessing before feeding dataset to model.
#Converting to datetime
dataset['Order_Date'] = pd.to_datetime(dataset['Order_Date'],
errors='coerce', dayfirst=True)
dataset['Ship_Date'] = pd.to_datetime(dataset['Ship_Date'],
errors='coerce', dayfirst=True)
#Extracting year, month, and day from Order_Date
dataset['Order_Year'] = dataset['Order_Date'].dt.year
dataset['Order_Month'] = dataset['Order_Date'].dt.month
dataset['Order_Day'] = dataset['Order_Date'].dt.day
#Extracting year, month, and day from ship date
dataset['Ship_Year'] = dataset['Ship_Date'].dt.year
dataset['Ship_Month'] = dataset['Ship_Date'].dt.month
dataset['Ship_Day'] = dataset['Ship_Date'].dt.day
# Dropping the original datetime columns
dataset.drop(['Order_Date', 'Ship_Date'], axis=1, inplace=True)
# Dropping unnecessary columns
columns_to_drop = ['Row_ID', 'Order_ID', 'Product_ID', 'Customer_Name',
'Customer_ID', 'Product_Name']
dataset.drop(columns=columns_to_drop, axis=1, inplace=True)
Feature Engineering:
1. Create new features like 'Date_Gap' and apply one-hot encoding to categorical variables.
2. Apply a logarithmic transformation to the 'Sales' variable to address skewness.
# Encoding categorical variables
from sklearn.preprocessing import OneHotEncoder, StandardScaler
encoder = OneHotEncoder(sparse=False)
encoded_categorical_data = encoder.fit
_transform(dataset[['Segment', 'Country', 'City', 'State', 'Category']])
National University of Technology (NUTECH)
Electrical Engineering Department
EE4407-Machine Learning Lab
encoded_categorical_df = pd.DataFrame(encoded_categorical_data,
columns=encoder.get_feature_names(['Segment', 'Country', 'City',
'State', 'Category']))
# Merging encoded data with the original dataset and dropping original
#categorical columns
dataset = dataset.join(encoded_categorical_df).drop(['Segment',
'Country', 'City', 'State', 'Category'], axis=1)
# Apply logarithmic transformation to the 'Sales' variable
# Adding 1 to avoid log(0) which is undefined
dataset['Log_Sales'] = np.log(dataset['Sales'] + 1)
# Display the first few rows of the updated dataset
print(dataset[['Sales', 'Log_Sales']].head())
# Visualization of the 'Sales' distribution and Log Transformed Sales Distribution
plt.figure(figsize=(12, 6))
# Original Sales Distribution
plt.subplot(1, 2, 1)
sns.distplot(dataset['Sales'], kde=True, bins=30)
plt.title('Original Sales Distribution')
plt.xlabel('Sales')
plt.ylabel('Frequency')
# Log Transformed Sales Distribution
plt.subplot(1, 2, 2)
sns.distplot(dataset['Log_Sales'], kde=True, bins=30)
plt.title('Log Transformed Sales Distribution')
plt.xlabel('Log(Sales)')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
Linear Regression Modeling
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Splitting the data into features (X) and target (y)
X = data.drop('Sales', axis=1) # Features
y = data['Sales'] # Target
National University of Technology (NUTECH)
Electrical Engineering Department
EE4407-Machine Learning Lab
# Splitting the data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Training a linear regression model
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
# Predicting on the test set
y_pred = lin_reg.predict(X_test)
# Evaluating the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error:", mse)
print("R² Score:", r2)
Lab Task
1. Perform data preprocessing on the supermarket sales dataset, including feature
engineering and log transformation.
2. Conduct EDA to understand the relationships in the dataset.
3. Build and evaluate a linear regression model to predict sales.
CONCLUSION
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Features (X) and Target Variable (y)
X = dataset.drop(['Sales', 'Log_Sales'], axis=1) # Assuming 'Sales' is the target variable
y = dataset['Log_Sales']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Training a linear regression model
lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
# Predicting on the test set
y_pred = lin_reg.predict(X_test)
National University of Technology (NUTECH)
Electrical Engineering Department
EE4407-Machine Learning Lab
# Evaluating the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error:", mse)
print("R² Score:", r2)
National University of Technology (NUTECH)
Electrical Engineering Department
EE4407-Machine Learning Lab