0 ratings0% found this document useful (0 votes) 114 views25 pagesLinear Regression Analysis - Polynomial Regression
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
Linear Regression Analysis Tutorial - Polynomial
Regression
Creator: Muhammad Bilal Alam
What is Polynomial Regression?
Polynomial regression is a type of regression analysis in which the relationship
between the independent variable x and the dependent variable y is modeled as an
nth degree polynomial. The formula for a polynomial regression model of degree n
can be written as:
y = BO + Bix + BaxA2 +... + BntxAn +e
where
© yjis the dependent variable
‘xs the independent variable
* BO, B1, B2, ... Bn are the coefficients of the polynomial regression mode!
® is the error term or the residual
4 nis the degree of the polynomial
The goal of polynomial regression is to find the values of the coefficients BO, B1, 62, .. Br
that minimize the sum of squared errors between the predicted values of y and the actual
values of y
The California Housing Dataset for Multiple Linear Regression
The California Housing Dataset contains information on the median income, housing age
and other features for census tracts in California, The dataset was originally published by
Pace, R. Kelley and Ronald Barty in their 1997 paper "Sparse Spatial Autoregressions" and is
available in the sklearn.datasets module
The dataset consists of 20,640 instances, each representing a census tract in California
There are eight features in the dataset, including
* Medinc: Median income in the census tract
* HouseAge: Median age of houses in the census tract
* AveRooms: Average number of rooms per dwelling in the census tract
* AveBedims: Average number of bedrooms per dwelling in the census tract
* Population: Total number of people living in the census tract
* AveOccup: Average number of people per household in the census tract
* Latitude: Latitude of the center of the census tract
* Longitude: Longitude of the center of the census tract.
Step 1: Import the necessary librariesimport pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynonialFeatures
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import GridSearchcv
import warnings
warnings .ilterwarnings(' ignore’)
Step 2: Load the dataset
# Load the California Housing Dataset from seaborn
california = fetch_california_housing()
# Convert the data to a pandas datafrane
california_df = pd.DataFrame(data-california.data, column:
alifornia.feature_nane:
# Add the target variable to the dataframe
california_df[‘MedHouseVal'] = california.target
# Print the first 5 rows of the dataframe
california_df.head()
Medinc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude Medi
0 33252 410 6984127 1.023810 3220 2555556 37.88 122.23
1 83014 210 6.238137 «0971880 «2407.0 2.109842 «37.86 = -122.22
2 72574 520 8288136 1.073446 4960 2.802260 37.85 = 122.24
3 56431 520 5817352 1.073059 3580 2547988 3785-12228
4 3.8462 520 6281853 1.081081 5650 2181467 3785-12228
Step 3: Do Data Preprocessing along with Data Exploratory
Analysis
Step 3(a): Check Shape of Dataframe
Checking the Shape of Dataframe tell hows how many rows and columns we have in the
dataset
# Print the shape of the datafrane
print("Data shape:", california_df.shape)
Data shape: (20640, 9)
Step 3(b): Check Info of DataframeThis is very useful to quickly get an overview of the structure and properties of a dataset,
and to check for any missing or null values that may need to be addressed before
performing any analysis or modeling,
california_df.info()
RangeIndex: 20648 entries, @ to 2063¢
Data columns (total 9 columns)
# Column Non-Null Count
® MedInc 20640
1 HouseAge 20640 non-null floatea
2 AveRooms «20640 non-null floatéa
3. AveBedrms 20640 non-null floatea
4 Population 20648 non-null float64
5
5
7
‘non-null
AveOccup 20640 non-null floate4
Latitude 20640 non-null floatea
Longitude 20640 non-null floatea
8 MedHouseVal 20640 non-null floate4
dtypes: Floatéa(9)
nenory usage: 1.4 MB
Step 3(c): Show Descriptive Statistics of each numerical column
Looking at descriptive statistics in machine learning is important because it gives an
overview of the dataset's distribution and key characteristics. Some of the reasons why we
should look at descriptive statistics include:
* Understanding the distribution of data: Descriptive statistics provide information
about the central tendency and the spread of the data. This information is useful in
determining the type of distribution and whether the data is skewed or symmetrical
* Identifying outliers: Descriptive statistics help to identify any extreme values or
outliers in the dataset. These outliers can have a significant impact on the analysis and
should be investigated further.
From the descriptive statistics, we can observe the following
* Outliers: The ‘AveRooms, ‘AveBedrms', ‘Fopulation’, and ‘AveOccup' columns have
high maximum values, indicating the presence of outliers in the data. These outliers
may need to be treated or removed before model selection. we will create visuals to see
them more clearly
* Distribution: The ‘Medinc, ‘HouseAge’, and 'MedHouseVal’ columns appear to be
normally distributed, as the mean and median values are close to each other, and the
standard deviation is not very high. The ‘Latitude’ column is skewed to the left, as the
mean is less than the median. The ‘Longitude’ column is skewed to the right, as the
mean is greater than the median
california_df.describe().Tcount mean std min 25% 50% 7
Medine 206400 3870671 1.899822 0.499900 2.563400 3534800 4.7437
HouseAge 206400 2863948 12585558 1.000000 18000000 29,0000 37.000
AveRooms 206400 5429000 2474173 «0846154 4.440716 5.229128 8.052
AveBedims 206400 1.096675 0.473911 0333333 «1.00607 1.048780 1.099
Population 206400 1425476744 1132462122 3.000000 787.0000 1166000000 1725.000¢
AveOccup 206400 3.070655 10386050 0692308 2429741 «2.818116 3.282
Latitude 206400 35631867 2.135952 32540000 33.930000 34.260000 37.710
Longitude 20640.0 -119.569704 2.003532 -124.350000 -121.800000 -118.490000 -118.010¢
MedHouseVal 206400 2.06855€ 1.153956 0.149990 1.196000 1.797000 2.6472
»
Step 3(d): Check for missing values in the Dataframe
This is important because most machine learning algorithms cannot handle missing data
and will throw an error if missing values are present. Therefore, itis necessary to check for
missing values and impute or remove them before fitting the data into a machine learning
model, This helps to ensure that the model is trained on complete and accurate data, which
leads to better performance and more reliable predictions.
Here we have no missing values so lets move on
# Check for missing values
print("Missing values:\n", california_df.isnull().sum())
Missing values:
NedInc e
Houseage
‘AveRoons
AveBedrms
Population
AveOccup
latitude
Longitude
NedHouseVal
dtype: intea
Step 3(e): Check for duplicate values in the Dataframe
Checking for duplicate values in machine learning is important because it can affect the
accuracy of your model. Duplicate values can skew your data and lead to overfitting, where
your model is too closely fit to the training data and does not generalize well to new data
We have no duplicate values so thats good
california_df.duplicated().sum()
@
Step 3(f)(i): Check for Outliers in the DataframeWe should check for outliers as they can have a negative impact on machine learning
algorithms as they can skew the results of the analysis. Outliers can significantly alter the
mean, standard deviation, and other statistical measures, which can misrepresent the true
characteristics of the data. Linear regression models, are sensitive to outliers and can
produce inaccurate results ifthe outliers are not properly handled or removed. Therefore, it
is important to identify and handle outliers appropriately to ensure the accuracy and
reliability of the models.
Here in the plots we can clearly see very high outliers on the right hand side. So we need to
deal with them appropriately
# Create a boxplot of the ‘AveRooms' column
ax = sns.boxplot (x=california_df[ 'AveRooms" ])
# Set the titLe ond axes Labels
ax.set_title("Boxplot of Average Number of Roos")
ax.set_xlabel("AveRooms' )
ax.set_ylabel('')
# Customize the y-axis tick Labels to display values in millions
yticks = ax.get_yticks() / 1000000
ax.set_yticklabels([‘{:.2f}'.format(ytick) for ytick in yticks])
# Add grid Lines and remove top and right spines
ax.grid(axis="y', linestyle="--', alpha=0.7)
ax.spines[ top" ].set_visible(False)
ax.spines| ‘right’ ].set_visible(False)
# Show the plot
plt.show()
Boxplot of Average Number of Rooms
0.00 see oe
o 0 0 o m0 100 10 Wo
‘AveRooms
# Create a boxplot of the ‘AveBedrms' colum
ax = sns.boxplot (x=california_df[ ‘AveBedrms'])
# Set the title and axes Labels
ax.set_title("Boxplot of Average Number of Bedrooms’ )
ax.set_xlabel(‘AveBedrms')
ax. set_ylabel(’')
# Customize the y-axis tick Labels to display values in miLLions
yticks: x.get_yticks() / 1ee@¢ee
ax.set_yticklabels(["{:.2F)'.format(ytick) for ytick in yticks])# Add grid Lines and remove top and right spines
ax.grid(axis="y', linestyle='--', alpha=@.7)
ax. spines[ 'top" ]. set_visible(False)
ax.spines[ ‘right’ ].set_visible(False)
# Show the plot
pit. show()
Boxplot of Average Number of Bedrooms
0.00 ome 04 ’ ’
. 5s 0» & » 8 0D 3
‘AueBedrms
# Create a boxplot of the ‘Population’ column
ax = sns. boxplot (x=california_df[ ‘Population’ ])
# Set the title and axes Labels
ax.set_title(*Boxplot of Populations")
ax.set_xlabel (‘Population’)
ax.set_ylabel(’')
# Customize the y-axis tick Labels to display values in millions
yticks = ax.get_yticks() / 1000000
ax.set_yticklabels(["{:.2F}'.format(ytick) for ytick in yticks])
# Add grid Lines and remove top and right spines
ax.grid(axis="y', linestyle="--', alpha=0.7)
ax.spines[ ‘top’ ].set_visible(False)
ax.spines[ ‘right ].set_visible(False)
# Show the plot
plt.show()Boxplot of Populations
‘5000 10000 15000 20000 25000 30000 35000
Population
# Create a boxplot of the ‘Avedccup’ colum
ax = sns.boxplot (x=california_df[ 'AveOccup'])
# Set the title and axes Labels
ax.set_title(‘Boxplot of Average Occupancy")
ax.set_xlabel(‘AveOccup')
ax.set_ylabel(*')
# Customize the y-axis tick Labels to display values in millions
yticks = ax.get_yticks() / 1000000
ax.set_yticklabels(['{:.24}'.format(ytick) for ytick in yticks])
# Add grid Lines and remove top and right spines
ax.grid(axis="y', linestyle="--', alpha=0.7)
ax. spines[ ‘top’ ].set_visible(False)
ax.spines[ ‘right’ ].set_visible(False)
# Show the plot
plt.show()
Boxplot of Average Occupancy
m0 = 0S O=«2000 «200
‘AveOceup
# Create a boxplot of the ‘AveOccup’ column
ax = sns.boxplot (x=california_df[ 'AveOccup"])
# Set the title and axes Labels
ax.set_title(‘Boxplot of Average Occupancy’)
ax.set_xlabel(*AveOccup')
ax.set_ylabel('')# Customize the y-axis tick Labels to display values in millions
yticks = ax.get_yticks() / 1000000
ax.set_yticklabels(["{:.2F}' format (ytick) for ytick in yticks])
# Add grid Lines and remove top and right spines
ax.grid(axis="y', linestyle="--', alpha=@.7)
ax. spines[ ‘top’ ].set_visible(False)
ax.spines| 'right' ].set_visible(False)
# Show the plot
pit. show()
Boxplot of Average Occupancy
00 = 0) OSs 0S«2000 «200
‘AveOceup
# Create a boxplot of the ‘Medic’ coLum
ax = sns.boxplot (x-california_df['MedInc'])
# Set the title and axes Labels
ax.set_title(‘Boxplot of Medinc')
ax.set_xlabel(‘MedInc')
ax.set_ylabel(*')
# Customize the y-axis tick Labels to display values in millions
yticks = ax.get_yticks() / 1900000
ax.set_yticklabels(['{:.2£}'.format(ytick) for ytick in yticks])
# Add grid Lines and remove top and right spines
ax.grid(axis="y', Linestyle="--', alpha=0.7)
ax. spines[ 'top" ]. set_visible(False)
ax.spines[ ‘right’ ].set_visible(False)
# Show the plot
plt.show()Boxplot of Medinc
0.00 oe
o 2 4 6 8 0» 2
Medine
Step 2(f)(ii): Deal with Outliers in the Dataframe using Winsorization:
This method involves replacing extreme values with the nearest values that are within a
certain percentile range. For example, we replace values above the 95th percentile with the
value at the 95th percentile and values below the 1st percentile with the value at the 1st
percentile. From the visuals we can clearly see that the data is way more normally
distributed now
# Define the percentile Limits for winsorization
pet_lower = 0.1.
pct_upper = 0.95
# Apply winsorization to the five columns
california_df["AveRooms'] = np.clip(california_df[‘AveRoons"],
california_df[‘AveRoons'].quantile(pct_lower),
california_df[ ‘AveRoons'].quantile(pct_upper))
california_df["AveBedrms'] = np.clip(california_df["AveBedrns'],
california_df{ 'AveBedrms'].quantile(pct_lower
california_df| 'AveBedrns' ].quantile(pct_upper
california_df["Population’] = np.clip(california_df{ ‘Population’ ],
california df| ‘Population’ ] .quantile(pct_low:
california_df[ ‘Population’ ] .quantile(pct_upp:
california_df["AveOccup'] = np.clip(california_df[ ‘AveOccup'],
california df['AveOccup'].quantile(pct_lower),
california_df[‘AveOccup" ] .quantile(pct_upper))
california_df["MedInc'] = np.clip(california_df[ ‘Medinc'],
california_df['MedInc'].quantile(pet_lower),
california_df["Medinc’ ].quantile(pct_upper))
# Create a boxplot of the ‘AveRooms’ column
ax = sns.boxplot (x=california_df[ ‘AveRooms" ])
# Set the title and axes Labels
ax.set_title(‘Boxplot of Average Number of Rooms")
ax.set_xlabel(*AveRooms')
ax.set_ylabel('')
# Customize the y-axis tick Labels to display values in millions
yticks = ax.get_yticks() / 1000000
ax.set_yticklabels(['{:.2F}' format (ytick) for ytick in yticks])
# Add grid Lines and remove top and right spines
ax.grid(axis="y', linestyle="--', alpha=@.7)In
In
ax. spines['top" ].set_visible(False)
ax.spines[ ‘right’ ].set_visible(False)
# Show the plot
pit. show()
Boxplot of Average Number of Rooms
0.00
3 a 5 5 7
‘AveRooms
# Create a boxplot of the ‘AveBedrms’ column
ax = sns.boxplot (x=california_df[ 'AveBedrns’ ])
# Set the title and axes Labels
ax.set_title("Boxplot of Average Number of Bedrooms’)
ax.set_xlabel(*AveBedrms')
ax.set_ylabel('')
# Customize the y-axis tick Labels to display values in millions
yticks = ax.get_yticks() / 1000000
ax.set_yticklabels(['{:.2F}'.format(ytick) for ytick in yticks])
# Add grid Lines and remove top and right spines
ax.grid(axis="y', linestyle='--', alpha=@.7)
ax. spines[ 'top'].set_visible(False)
ax. spines[ ‘right '].set_visible(False)
# Show the plot
plt.show()
Boxplot of Average Number of Bedrooms
0.00
030 095 100 105 110 115 120 125
‘AveBedrms
# Create a boxplot of the ‘Population’ column
ax = sns.boxplot (x=california_df[ ‘Population’ })In
# Set the title and axes Labels
ax.set_title(‘Boxplot of Populations’)
ax.set_xlabel (‘Population’)
ax.set_ylabel('')
# Customize the y-axis tick Labels to display values in millions
yticks = ax.get_yticks() / 1000000
ax.set_yticklabels(['{:.2F}'.format(ytick) for ytick in yticks])
# Add grid Lines and remove top and right spines
ax.grid(axis='y’, linestyle="--', alpha=0.7)
ax. spines[ '‘top’].set_visible(False)
ax. spines[ ‘right’ ].set_visible(False)
# Show the plot
plt.show()
Boxplot of Populations
0.00
500 1000 +1500 +2000 2500 +5000,
Population
# Create a boxplot of the ‘MedInc’ column
ax = sns. boxplot (x-california_df[‘Medinc'])
# Set the title and axes Labels
ax.set_title(*Boxplot of MedInc')
ax.set_xlabel("MedInc')
ax.set_ylabel('')
# Customize the y-axis tick Labels to display values in millions
yticks = ax.get_yticks() / 1000000
ax.set_yticklabels(['{:.2f}' format (ytick) for ytick in yticks])
# Add grid Lines and remove top and right spines
ax.grid(axis="y', linestyle="--', alpha=0.7)
ax.spines[ ‘top’ ].set_visible(False)
ax.spines[ ‘right’ ].set_visible(False)
# Show the plot
plt.show()In
Boxplot of Medinc
0.00
T 2 3 a 5 6 7
Meaine
Step 3(g): Check for Skewness using a Histogram
Skewed data can result in biased estimates of model parameters and reduce the accuracy of
predictions. Therefore, it is important to assess the distribution of features and target
variables to identify any potential issues and take appropriate measures to address them,
Here almost all the features and target look normally distributed. There is some Skewness In
MedHouseVal but not enough to do Transformation on it
Note: For learning purposes | have shown how to do MedHouseVal transformation for
skewness in my previous tutorial of Simple Linear Regression. Feel free to check that out
# Set figure size and font scale
plt.figure(figsize=(8, 6))
sns.set(font_scale=1.5)
# Create histogram
sns.histplot(data-california df, x='MedHouseVal', kde=True, bins=50, colo
#7097
# Set x and y axis Labels and title
plt.xlabel( ‘Median House Value’, fontsize=16)
plt.ylabel( ‘Frequency’, fontsize-16)
plt.title( ‘Distribution of Median House Value in California’, fontsize=20)
# Customize x and y axis tick marks and Labels
Plt .xticks (fontsize=12)
plt.yticks(fontsize=12)
# Add vertical Line for mean
mean = california_df['MedHouseVal' ].mean()
plt.axvline(mean, color='red’, linestyle='--", label=f'Mean: {mean:.2f}')
plt.legend(fontsize=14)
# Show the plot
plt.show()Distribution of Median House Value in California
+1000
Frequency
o 1 2 3 4 5
Median House Value
## Set figure size and font scale
plt.figure(figsize=(8, 6))
sns.set(font_scale=1.5)
# Create histogram
sns.histplot(data-california df, x='Medinc’, kde=True, bins=50, color="#beaeda", e
# Set x and y axis Labels and title
plt.xlabel(‘Median Incone Value’, fontsize=16)
plt.ylabel("Frequency’, fontsize=16)
plt.title( ‘Distribution of Median MedInc Value in California’, fontsize=20)
# Customize x and y axis tick marks and Labels
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
# Add vertical Line for MedIne
mean ~ california_df["MedInc’].mean()
plt.axvline(mean, color='red', linestyle='--', label=F'Mean: {mean:.2f}')
plt.legend(fontsize=14)
# Show the plot
plt.show()Distribution of Median Medinc Value in California
---- Mean: 3.77
1000
Frequency
7
&
20
hi
1 2 3 4 5
Median Income Value
# Set figure size and font scale
plt.figure(figsize-(8, 6))
sns.set(font_scale=1.5)
# Create histogram
sns.histplot(data=california df, x="HouseAge’, kde-True, bins=50, color="#fdco86",
# Set x and y axis Labels and title
plt.xlabel(‘Houseage’, fontsize=16)
plt.ylabel('Frequency', fontsize=16)
plt.title( ‘Distribution of House Age in California’, fontsize=20)
# Customize x and y axis tick marks and Labels
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
# Add vertical Line for mean
mean = california df[‘Houseage’ ].nean()
plt.axvline(mean, color="red’, linestyl
pit. legend(fontsize=14)
‘22, label=f'Mean: {mean:.2#}')
# Show the plot
plt.show()Distribution of House Age in California
---- Mean: 28.64
1200
+000
8
Frequency
8
200
© 10 Pa » 4 =0
HouseAge
# Set figure size and font scale
plt.figure(figsize=(8, 6))
sns.set(font_scale=1.5)
# Create histogram
sns.histplot(data-california df, x="AveRooms', kde=True, bins=50, color="#fFFF99",
# Set x and y axis Labels and title
plt.xlabel(‘AveRooms’, fontsize=16)
plt.ylabel( ‘Frequency’, fontsize=16)
plt.title( ‘Distribution of AveRooms in California’, fontsize=20)
# Customize x and y axis tick marks and Labels
pit. xticks (fontsize=12)
plt.yticks(fontsize=12)
# Add vertical Line for mean
mean = california_df['AveRooms’ ].nean()
plt.axvline(mean, color="red", linestyl
pit. legend(fontsize=14)
‘--", label=f'Mean: {mean:.2F}')
# Show the plot
plt.show()Distribution of AveRooms in California
---- Mean: 5.28
1000
Frequency
g
=)
&
20
> linil
3 4 5 6 7
AveRooms
# Set figure size and font scale
plt.figure(figsize-(8, 6))
sns.set(font_scale=1.5)
# Create histogram
sns.histplot(data-california df,
weBedrns’, kde=True, bins=50, color="#386cb0"
# Set x and y axis Labels and title
plt.xlabel(‘AveBedrms", fontsizé )
plt.ylabel('Frequency’, fontsize-16)
plt.title( ‘Distribution of AveBedrms in California’, fontsize=20)
# Customize x and y axis tick marks and Labels
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
# Add vertical Line for mean
mean = california df[‘AveBedrns'’ }.mean()
plt-axvline(mean, color="red’, Linestyli
pit. legend(fontsize=14)
", label=f'mean: {mean:.2F}')
# Show the plot
plt.show()In
Distribution of AveBedrms in California
o-=- Mean: 1.06 |
+1000
Frequency
7
&
200
oso 095 100105, 404.451.0125
AveBedrms
## Set figure size and font scale
plt. figure(figsize-(8, 6))
sns.set(font_scale=1.5)
# Create histogram
sns.histplot(data-california df, x="Population’, kde=True, bin:
8, color="#F0027F
# Set x and y axis Labels and title
plt.xlabel( ‘Population’, fontsize=16)
plt.ylabel(‘Frequency’, fontsize=16)
plt.title( ‘Distribution of Population in California’, fontsize=20)
# Customize x and y axis tick marks and Labels
plt .xticks(fontsize=12)
plt.yticks(fontsize=12)
# Add vertical Line for mean
mean = california_df[ ‘Population’ ].mean()
plt.axvline(mean, color='red', linestyle='--", label=f'Mean: {mean:.2f}')
plt. legend(fontsize=14)
# Show the plot
plt.show()Distribution of Population in California
= Mean: 134578 |
1000
Frequency
a
&
200
° 0 10001500 «= aon. 2500 00
Population
# Set figure size and font scale
plt.figure(figsize=(8, 6))
sns.set(font_scale=1.5)
# Create histogram
sns.histplot(data=california df, x="AveOccup’, kde=True, bin:
|, color="#bfSb17",
# Set x and y axis Labels and title
plt.xlabel(‘AveOccup", fontsize=16)
plt.ylabel(‘Frequency’, fontsize=16)
plt.title( ‘Distribution of AveOccup in California’, fontsize=20)
# Customize x and y axis tick marks and Labels
pit. xticks(fontsize=12)
plt.yticks(fontsize=12)
# Add vertical Line for mean
mean ~ california_df['Avedccup' ].mean()
plt.axvline(mean, color='red', linestyle='--", label=f'Mean: {mean:.2f}')
plt. legend(fontsize=14)
# Show the plot
plt.show()In
Distribution of AveOccup in California
---- Mean: 2.89
1000
Frequency
g
400
200
°
45 20 25 30 35 40
AveOccup
# Set figure size and font scale
plt.figure(figsize=(8, 6))
sns.set(font_scale=1.5)
# Create histogram
sns.histplot(data=california df, x="Latitude’, kde-True, bin:
‘wpaccee’
# Set x and y axis Labels and title
plt.xlabel(‘AveOccup", fontsize=16)
plt.ylabel( ‘Frequency’, fontsize-16)
plt.title( ‘Distribution of Latitude’, fontsize=20)
# Customize x and y axis tick marks and Labels
pit. xticks(fontsize=12)
plt.yticks(fontsize=12)
# Add vertical Line for mean
mean ~ california_df[' Latitude’ ].mean()
plt.axvline(mean, color='blue’, linestyle="--", label=f'Mean: {mean:.2f}')
plt.1egend(fontsize-14)
# Show the plot
plt.show()Distribution of Latitude
i Mean: 35.63
3000
2500
a
+000
8
Ss
Frequency
500
ll
“0 2
At
hetll
:
AveOccup
# Set figure size and font scale
plt.figure(figsize-(8, 6))
sns.set(font_scale=1.5)
# Create histogram
sns.histplot(data=california df, x="Longitude’, kde-True, bins=50, color="#FF5733°
# Set x and y axis Labels and title
plt.xlabel(‘AveOccup", fontsize=16)
plt.ylabel( ‘Longitude’, fontsize=16)
pit. title( ‘Distribution of Longitude’, fontsize=20)
# Customize x and y axis tick marks and Labels
plt.xticks (fontsize=12)
plt.yticks(fontsize=12)
# Add vertical Line for mean
mean = california_df[' Longitude’ ].mean()
plt.axvline(mean, color='blue", Linestyle:
plt. legend(fontsize=14)
"Mean: {mean:.2f}')
# Show the plot
plt.show()Distribution of Longitude
2500 i
Mean: -119.57
2000
g
Longitude
1000
500
a4 “192 120 1a 16 18
‘AveOccup
Step 3(h): Create a Vertical Correlation Heatmap
The correlation matrix shows the correlation coefficients between every pair of variables in
the dataset. A correlation coefficient ranges from -1 to 1 and measures the strength and
direction of the linear relationship between two variables. A coefficient of -1 indicates 2
perfect negative correlation, a coefficient of 0 indicates no correlation, and a coefficient of 1
indicates a perfect positive correlation
# Calculate the corretation matrix
corr_matrix = california_df.corr()
# Set up the matpLotlib figure
fig, ax = plt.subplots(figsize=(6, 12))
# Create the heatmap
wsns.heatmap(corr_matrix, cmap='BrBG', annot=True, fnt='.2f', Linewidths=.5, axzax,
sns.heatnap(corr_matrix.corr()[['MedHouseVal' ]].sort_values(by="MedHouseVal’, ascei
# Set the title and axis Labels
ax.set_title(‘Correlation Heatmap for California Housing Dataset’, fontsize=16)
ax.set_xlabel(‘Features', fontsize=14)
ax.set_ylabel("Features', fontsize-14)
# Rotate the x-axis Labels for readability
plt.xticks(rotation=@, ha="right")
# Show the plot
plt.show()Correlation Heatmap for California Housing Dataset_ 4.00
MedHouseVal
0.75
Medinc
0.50
AveRooms
HouseAge 0.031 028
8
2 Longitude 0.05 -0.00
&
Latitude 0.4 025
Population 0.24
-0.50
AveBedrms
0.75
AveOccup
1.00
MedHouseVal
Features
Step 3(i): Perform Feature Scaling
Feature scaling is the process of transforming numerical features in a dataset to have similar
scales or ranges of values. The purpose of feature scaling is to ensure that all features have
the same level of impact on the model and to prevent certain features from dominating the
model simply because they have larger values. In linear regression, feature scaling is
particularly important because the coefficients of the model represent the change in the
dependent variable associated with a one-unit change in the independent variable. Scaling
the features to have similar ranges can result in a more accurate and reliable model with
more accurate representations of the relationships between the independent variables and
the dependent variable.
scaler = Standardscaler()
california_df_scaled = scaler. fit_transform(california_df)california_df_scaled = pd.DataFrane(california_df_scaled, columns=california_df.co:
Step 3(j) Check for Assumptions using Scatter Plots
From the scatter plots, we can observe that there isa linear relationship between the
dependent variable (Median House Value) and some of the independent variables like
Median Income and Total Rooms. However, we can also see that some of the independent
variables like Longitude, Latitude, and Housing Median Age do not have a clear linear
relationship with the dependent variable. This suggests that a linear regression model might
not be the best ft for predicting the Median House Value based on these variables
# Create scatter plots
fig, axs = plt.subplots(nrows=2, ncols=4, figsize=(30,15))
axs[0,0] scatter (california_df_scaled{'Latitude’], california_df_scaled[ 'MedHouseV.
axs[0,0].set_xlabel(‘Latitude")
axs[@,0].set_ylabel (‘Median House Value)
axs[@,0].set_title(’Latitude vs Median House Value" )
axs[@,1].scatter(california_df_scaled{' Longitude’ ], california_df_scaled[‘MedHouse!
axs[0,1] .set_xlabel (‘ Longitude" )
axs[0,1].set_ylabel (‘Median House Value’)
axs[@,1].set_title('Longitude vs Median House Value' )
axs[0,2].scatter(california_df_scaled{'Houseage’], california_df_scaled['MedHouseV.
axs[@,2].set_xlabel (‘Housing Median Age’)
axs[@,2].set_ylabel(‘Median House Value’)
axs[@,2].set_title( ‘Housing Median Age vs Median House Value")
axs[0,3].scatter(california_df_scaled[ ‘AveRooms'], california_df_scaled[ 'MedHouseV.
axs[0,3].set_xlabel( ‘Total Roons' )
axs[0,3].set_ylabel (‘Median House Value" )
axs[0,3].set_title( ‘Total Rooms vs Median House Value’)
axs[1,0] scatter (california_df_scaled{ ‘AveBedrms’], california_df_scaled[‘MedHouse!
axs[1,0].set_xlabel (‘Total Bedrooms’)
axs[1,0].set_ylabel (‘Median House Value’)
axs[1,0].set_title('Total Bedrooms vs Median House Value’)
axs[1,1].scatter(california_df_scaled{ ‘Population’ ], california_df_scaled['MedHous:
axs[1,1] .set_xlabel (‘Population’)
axs[1,1].set_ylabel (‘Median House Value’)
axs[1,1].set_title('Population vs Median House Value’)
axs[1,2].scatter(california_df_scaled{'AveOccup'], california_df_scaled[ 'MedHouseV.
axs[1,2] .set_xlabel (‘Households’)
axs[1,2].set_ylabel (‘Median House Value’)
axs[1,2].set_title(‘Households vs Median House Value’)
axs[1,3].scatter(california_df_scaled['MedInc'], california_df_scaled[ 'MedHouseVal
axs[1,3].set_xlabel(‘Median Income’)
axs[1,3].set_ylabel (‘Median House Value’)
axs[1,3].set_title(‘Nedian Income vs Median House Value")
plt.show()Step 4: Define Dependant and Independant Variable
californi
californi
f_scaled.drop([ 'MecHouseVal"],axis=1)
d#_scaled| 'MedHouseVal'']
Step 5: Do Train-Test Split in the ratio 70-30
# Split the data into training and testing sets
X train, Xtest, y_train, y_test = train_test_split(x, y, test_size-0.3, random st,
Step 6: Searching for Best Polynomial Degreee Based on R2
# initialize variables for storing best model and score
best_score = @
best_degree = 1
# Loop through degrees of polynomial. from 1 to 10
for degree in range(1, 11):
# create polynomial features
poly_features = PolynomialFeatures(degree=degree)
X poly_train = poly_features.fit_transform(X train)
x(poly_test = poly Features. transform(X_test)
# fit Linear regression model to polynomial features
poly_reg = LinearRegression()
poly_reg.fit(X_poly_train, y train)
# evaluate model on test set
y_pred = poly_reg.predict(x_poly_test)
r2_score(y test, y_pred)
mean_squared_error(y_test, y_pred)
# check if this is the best model so far
if score > best_score:
best_score = score
best_mse = mse
best_degree = degree
best_model = poly_reg# print best degree and score
print("Best degree:", best_degree)
print("R*2 score:", best_score)
print("Mean Squared Error:", best_mse)
@.7564759864738175
Mean Squared Error: @.2438324527757844
# plot actual vs predicted values
pit. figure(figsize=(12, 6))
plt.scatter(y test, y_pred, color="blue’)
# add Labels and title
plt.xlabel( ‘Actual Values")
plt.ylabel (‘Predicted Values" )
plt.title('Polynonial Regression Results (Degree
+ str(best_degree) +", R°2 =
# add diagonal Line for reference
plt.plot([y_test-min(), y_test.max()], [y_test.min(), y_test.max()],
# display the plot
plt.show()
Polynomial Regression Results (Degree = 4, R’2 = 0.78)
19000 7
10000
20000
Predicted Values
=30000
40000
0 1 2
Actual Values
Conclusion:
In this polynomial regression tutorial, we explored how to use polynomial regression to
model the relationship between variables with a non-linear pattern with the target feature
specifically the median house value in California, We started by loading and cleaning the
dataset, and then visualizing the relationship between the variables using scatter plots
Next, we split the data into training and testing sets and used polynomial regression to fit
model to the training data, We gradually increased the degree of the polynomial until we
achieved a good balance between bias and variance. We then used the model to make
predictions on the test data and evaluated the model's performance using mean squared
error and R-squared score