Aim: Predicting Sales from Advertisement Data Using Linear Regression for Smart Business
Modeling
Software Tools: Google Collab,
Libraries: pandas (pd) → For reading the CSV file and handling tabular data.
numpy (np) → For numerical calculations (mean, sum, array operations).
matplotlib.pyplot (plt) → For visualizing the dataset and regression results.
Description: This demonstrates predicting sales from advertisement data using a simple linear
regression model implemented from scratch.
1. Data Upload & Loading → The Advertising.csv file is uploaded, cleaned (column names
stripped of spaces), and displayed.
2. Feature Selection → TV advertising budget is chosen as the input feature (X) and sales as the
output target (Y).
3. Data Splitting → The dataset is manually split into training and test sets.
4. Model Building →
o Slope (m) and intercept (b) are calculated using statistical formulas.
o The regression line is defined as:
o 𝑦 = 𝑚𝑥 + 𝑐
5. Visualization → Scatter plots show actual data points and the fitted regression line.
6. Prediction → Sales prediction is made for a TV ad budget of ₹50,000.
7. Model Testing → Predicted sales are compared with actual test set sales using a scatter plot.
8. Performance Evaluation →
o Mean Squared Error (MSE) measures average error.
o R-squared indicates how well the model explains variance in sales.
Code:
# Step 1: Upload your Advertising.csv file
from google.colab import files
uploaded = files.upload()
# Step 2: Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Step 3: Load dataset
# Get the uploaded file name
uploaded_file_name = list(uploaded.keys())[0]
data = pd.read_csv(uploaded_file_name)
# Clean column names in case there are spaces or casing issues
data.columns = data.columns.str.strip()
# Display the dataset
print("First 5 rows of dataset:")
print(data.head())
print("Column Names:", data.columns)
# Step 4: Extract Feature (TV) and Target (Sales) variables
x = data['TV'].values
y = data['Sales'].values
# Step 5: Split dataset into training and test sets
x_train = x[:150]
x_test = x[150:]
y_train = y[:150]
y_test = y[150:]
# Step 6: Define helper functions
def errors_product(x, y):
return np.sum((x - np.mean(x)) * (y - np.mean(y)))
def squared_errors(x):
return np.sum((x - np.mean(x))**2)
# Step 7: Calculate slope and intercept
slope = errors_product(x_train, y_train) / squared_errors(x_train)
intercept = np.mean(y_train) - slope * np.mean(x_train)
print(f"Slope: {slope}")
print(f"Intercept: {intercept}")
# Step 8: Plot the best fit regression line
plt.figure(figsize=(8, 5))
plt.scatter(x, y, color='red', marker='o')
plt.plot(x, slope * x + intercept, color='black', linewidth=2)
plt.title('Regression Line: TV Advertisement vs Sales')
plt.xlabel('TV Advertisement Expense (₹1000s)')
plt.ylabel('Sales (Units in 1000s)')
plt.legend(['Best Fit Line', 'Data'])
plt.grid(True)
plt.show()
# Step 9: Predict sales for Rs. 50,000 spent on TV ads
def sales_predicted(tv_budget_k):
return slope * tv_budget_k + intercept
predicted_sales = sales_predicted(50) * 1000 # rescale back
print(f"Predicted Sales for Rs 50,000 spent on TV ads:
{predicted_sales}")
# Step 10: Compare original and predicted test data
y_predicted = slope * x_test + intercept
plt.figure(figsize=(8, 5))
plt.scatter(x_test, y_test, color='red', marker='o')
plt.scatter(x_test, y_predicted, color='black', marker='+')
plt.title('Original vs Predicted Sales (Test Set)')
plt.xlabel('TV Advertisement Expense ($1000s)')
plt.ylabel('Sales (Units in 1000s)')
plt.legend(['Original', 'Predicted'])
plt.grid(True)
plt.show()
# Step 11: Evaluate model performance
mean_error = np.mean((y_test - y_predicted)**2)
r_squared = np.corrcoef(y_test, y_predicted)[0, 1]**2
print(f"Mean Squared Error: {mean_error}")
print(f"R-squared Value: {r_squared}")
Output:
Saving Advertising.csv to Advertising (3).csv
First 5 rows of dataset:
Unnamed: 0 TV Radio Newspaper Sales
0 1 230.1 37.8 69.2 22.1
1 2 44.5 39.3 45.1 10.4
2 3 17.2 45.9 69.3 9.3
3 4 151.5 41.3 58.5 18.5
4 5 180.8 10.8 58.4 12.9
Column Names: Index(['Unnamed: 0', 'TV', 'Radio', 'Newspaper',
'Sales'], dtype='object')
Slope: 0.04906288039571123
Intercept: 7.110732084446855