Linear Regression Analysis - Polynomial Regression

Regresion lineal

Uploaded by

Victor Papa Hernandez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

114 views25 pages

Linear Regression Analysis - Polynomial Regression

Regresion lineal

Uploaded by

Victor Papa Hernandez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 25

Linear Regression Analysis Tutorial - Polynomial Regression Creator: Muhammad Bilal Alam What is Polynomial Regression? Polynomial regression is a type of regression analysis in which the relationship between the independent variable x and the dependent variable y is modeled as an nth degree polynomial. The formula for a polynomial regression model of degree n can be written as: y = BO + Bix + BaxA2 +... + BntxAn +e where © yjis the dependent variable ‘xs the independent variable * BO, B1, B2, ... Bn are the coefficients of the polynomial regression mode! ® is the error term or the residual 4 nis the degree of the polynomial The goal of polynomial regression is to find the values of the coefficients BO, B1, 62, .. Br that minimize the sum of squared errors between the predicted values of y and the actual values of y The California Housing Dataset for Multiple Linear Regression The California Housing Dataset contains information on the median income, housing age and other features for census tracts in California, The dataset was originally published by Pace, R. Kelley and Ronald Barty in their 1997 paper "Sparse Spatial Autoregressions" and is available in the sklearn.datasets module The dataset consists of 20,640 instances, each representing a census tract in California There are eight features in the dataset, including * Medinc: Median income in the census tract * HouseAge: Median age of houses in the census tract * AveRooms: Average number of rooms per dwelling in the census tract * AveBedims: Average number of bedrooms per dwelling in the census tract * Population: Total number of people living in the census tract * AveOccup: Average number of people per household in the census tract * Latitude: Latitude of the center of the census tract * Longitude: Longitude of the center of the census tract. Step 1: Import the necessary librariesimport pandas as pd import seaborn as sns import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import fetch_california_housing from sklearn.linear_model import LinearRegression from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import PolynonialFeatures from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error, r2_score from sklearn.model_selection import GridSearchcv import warnings warnings .ilterwarnings(' ignore’) Step 2: Load the dataset # Load the California Housing Dataset from seaborn california = fetch_california_housing() # Convert the data to a pandas datafrane california_df = pd.DataFrame(data-california.data, column: alifornia.feature_nane: # Add the target variable to the dataframe california_df[‘MedHouseVal'] = california.target # Print the first 5 rows of the dataframe california_df.head() Medinc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude Medi 0 33252 410 6984127 1.023810 3220 2555556 37.88 122.23 1 83014 210 6.238137 «0971880 «2407.0 2.109842 «37.86 = -122.22 2 72574 520 8288136 1.073446 4960 2.802260 37.85 = 122.24 3 56431 520 5817352 1.073059 3580 2547988 3785-12228 4 3.8462 520 6281853 1.081081 5650 2181467 3785-12228 Step 3: Do Data Preprocessing along with Data Exploratory Analysis Step 3(a): Check Shape of Dataframe Checking the Shape of Dataframe tell hows how many rows and columns we have in the dataset # Print the shape of the datafrane print("Data shape:", california_df.shape) Data shape: (20640, 9) Step 3(b): Check Info of DataframeThis is very useful to quickly get an overview of the structure and properties of a dataset, and to check for any missing or null values that may need to be addressed before performing any analysis or modeling, california_df.info() RangeIndex: 20648 entries, @ to 2063¢ Data columns (total 9 columns) # Column Non-Null Count ® MedInc 20640 1 HouseAge 20640 non-null floatea 2 AveRooms «20640 non-null floatéa 3. AveBedrms 20640 non-null floatea 4 Population 20648 non-null float64 5 5 7 ‘non-null AveOccup 20640 non-null floate4 Latitude 20640 non-null floatea Longitude 20640 non-null floatea 8 MedHouseVal 20640 non-null floate4 dtypes: Floatéa(9) nenory usage: 1.4 MB Step 3(c): Show Descriptive Statistics of each numerical column Looking at descriptive statistics in machine learning is important because it gives an overview of the dataset's distribution and key characteristics. Some of the reasons why we should look at descriptive statistics include: * Understanding the distribution of data: Descriptive statistics provide information about the central tendency and the spread of the data. This information is useful in determining the type of distribution and whether the data is skewed or symmetrical * Identifying outliers: Descriptive statistics help to identify any extreme values or outliers in the dataset. These outliers can have a significant impact on the analysis and should be investigated further. From the descriptive statistics, we can observe the following * Outliers: The ‘AveRooms, ‘AveBedrms', ‘Fopulation’, and ‘AveOccup' columns have high maximum values, indicating the presence of outliers in the data. These outliers may need to be treated or removed before model selection. we will create visuals to see them more clearly * Distribution: The ‘Medinc, ‘HouseAge’, and 'MedHouseVal’ columns appear to be normally distributed, as the mean and median values are close to each other, and the standard deviation is not very high. The ‘Latitude’ column is skewed to the left, as the mean is less than the median. The ‘Longitude’ column is skewed to the right, as the mean is greater than the median california_df.describe().Tcount mean std min 25% 50% 7 Medine 206400 3870671 1.899822 0.499900 2.563400 3534800 4.7437 HouseAge 206400 2863948 12585558 1.000000 18000000 29,0000 37.000 AveRooms 206400 5429000 2474173 «0846154 4.440716 5.229128 8.052 AveBedims 206400 1.096675 0.473911 0333333 «1.00607 1.048780 1.099 Population 206400 1425476744 1132462122 3.000000 787.0000 1166000000 1725.000¢ AveOccup 206400 3.070655 10386050 0692308 2429741 «2.818116 3.282 Latitude 206400 35631867 2.135952 32540000 33.930000 34.260000 37.710 Longitude 20640.0 -119.569704 2.003532 -124.350000 -121.800000 -118.490000 -118.010¢ MedHouseVal 206400 2.06855€ 1.153956 0.149990 1.196000 1.797000 2.6472 » Step 3(d): Check for missing values in the Dataframe This is important because most machine learning algorithms cannot handle missing data and will throw an error if missing values are present. Therefore, itis necessary to check for missing values and impute or remove them before fitting the data into a machine learning model, This helps to ensure that the model is trained on complete and accurate data, which leads to better performance and more reliable predictions. Here we have no missing values so lets move on # Check for missing values print("Missing values:\n", california_df.isnull().sum()) Missing values: NedInc e Houseage ‘AveRoons AveBedrms Population AveOccup latitude Longitude NedHouseVal dtype: intea Step 3(e): Check for duplicate values in the Dataframe Checking for duplicate values in machine learning is important because it can affect the accuracy of your model. Duplicate values can skew your data and lead to overfitting, where your model is too closely fit to the training data and does not generalize well to new data We have no duplicate values so thats good california_df.duplicated().sum() @ Step 3(f)(i): Check for Outliers in the DataframeWe should check for outliers as they can have a negative impact on machine learning algorithms as they can skew the results of the analysis. Outliers can significantly alter the mean, standard deviation, and other statistical measures, which can misrepresent the true characteristics of the data. Linear regression models, are sensitive to outliers and can produce inaccurate results ifthe outliers are not properly handled or removed. Therefore, it is important to identify and handle outliers appropriately to ensure the accuracy and reliability of the models. Here in the plots we can clearly see very high outliers on the right hand side. So we need to deal with them appropriately # Create a boxplot of the ‘AveRooms' column ax = sns.boxplot (x=california_df[ 'AveRooms" ]) # Set the titLe ond axes Labels ax.set_title("Boxplot of Average Number of Roos") ax.set_xlabel("AveRooms' ) ax.set_ylabel('') # Customize the y-axis tick Labels to display values in millions yticks = ax.get_yticks() / 1000000 ax.set_yticklabels([‘{:.2f}'.format(ytick) for ytick in yticks]) # Add grid Lines and remove top and right spines ax.grid(axis="y', linestyle="--', alpha=0.7) ax.spines[ top" ].set_visible(False) ax.spines| ‘right’ ].set_visible(False) # Show the plot plt.show() Boxplot of Average Number of Rooms 0.00 see oe o 0 0 o m0 100 10 Wo ‘AveRooms # Create a boxplot of the ‘AveBedrms' colum ax = sns.boxplot (x=california_df[ ‘AveBedrms']) # Set the title and axes Labels ax.set_title("Boxplot of Average Number of Bedrooms’ ) ax.set_xlabel(‘AveBedrms') ax. set_ylabel(’') # Customize the y-axis tick Labels to display values in miLLions yticks: x.get_yticks() / 1ee@¢ee ax.set_yticklabels(["{:.2F)'.format(ytick) for ytick in yticks])# Add grid Lines and remove top and right spines ax.grid(axis="y', linestyle='--', alpha=@.7) ax. spines[ 'top" ]. set_visible(False) ax.spines[ ‘right’ ].set_visible(False) # Show the plot pit. show() Boxplot of Average Number of Bedrooms 0.00 ome 04 ’ ’ . 5s 0» & » 8 0D 3 ‘AueBedrms # Create a boxplot of the ‘Population’ column ax = sns. boxplot (x=california_df[ ‘Population’ ]) # Set the title and axes Labels ax.set_title(*Boxplot of Populations") ax.set_xlabel (‘Population’) ax.set_ylabel(’') # Customize the y-axis tick Labels to display values in millions yticks = ax.get_yticks() / 1000000 ax.set_yticklabels(["{:.2F}'.format(ytick) for ytick in yticks]) # Add grid Lines and remove top and right spines ax.grid(axis="y', linestyle="--', alpha=0.7) ax.spines[ ‘top’ ].set_visible(False) ax.spines[ ‘right ].set_visible(False) # Show the plot plt.show()Boxplot of Populations ‘5000 10000 15000 20000 25000 30000 35000 Population # Create a boxplot of the ‘Avedccup’ colum ax = sns.boxplot (x=california_df[ 'AveOccup']) # Set the title and axes Labels ax.set_title(‘Boxplot of Average Occupancy") ax.set_xlabel(‘AveOccup') ax.set_ylabel(*') # Customize the y-axis tick Labels to display values in millions yticks = ax.get_yticks() / 1000000 ax.set_yticklabels(['{:.24}'.format(ytick) for ytick in yticks]) # Add grid Lines and remove top and right spines ax.grid(axis="y', linestyle="--', alpha=0.7) ax. spines[ ‘top’ ].set_visible(False) ax.spines[ ‘right’ ].set_visible(False) # Show the plot plt.show() Boxplot of Average Occupancy m0 = 0S O=«2000 «200 ‘AveOceup # Create a boxplot of the ‘AveOccup’ column ax = sns.boxplot (x=california_df[ 'AveOccup"]) # Set the title and axes Labels ax.set_title(‘Boxplot of Average Occupancy’) ax.set_xlabel(*AveOccup') ax.set_ylabel('')# Customize the y-axis tick Labels to display values in millions yticks = ax.get_yticks() / 1000000 ax.set_yticklabels(["{:.2F}' format (ytick) for ytick in yticks]) # Add grid Lines and remove top and right spines ax.grid(axis="y', linestyle="--', alpha=@.7) ax. spines[ ‘top’ ].set_visible(False) ax.spines| 'right' ].set_visible(False) # Show the plot pit. show() Boxplot of Average Occupancy 00 = 0) OSs 0S«2000 «200 ‘AveOceup # Create a boxplot of the ‘Medic’ coLum ax = sns.boxplot (x-california_df['MedInc']) # Set the title and axes Labels ax.set_title(‘Boxplot of Medinc') ax.set_xlabel(‘MedInc') ax.set_ylabel(*') # Customize the y-axis tick Labels to display values in millions yticks = ax.get_yticks() / 1900000 ax.set_yticklabels(['{:.2£}'.format(ytick) for ytick in yticks]) # Add grid Lines and remove top and right spines ax.grid(axis="y', Linestyle="--', alpha=0.7) ax. spines[ 'top" ]. set_visible(False) ax.spines[ ‘right’ ].set_visible(False) # Show the plot plt.show()Boxplot of Medinc 0.00 oe o 2 4 6 8 0» 2 Medine Step 2(f)(ii): Deal with Outliers in the Dataframe using Winsorization: This method involves replacing extreme values with the nearest values that are within a certain percentile range. For example, we replace values above the 95th percentile with the value at the 95th percentile and values below the 1st percentile with the value at the 1st percentile. From the visuals we can clearly see that the data is way more normally distributed now # Define the percentile Limits for winsorization pet_lower = 0.1. pct_upper = 0.95 # Apply winsorization to the five columns california_df["AveRooms'] = np.clip(california_df[‘AveRoons"], california_df[‘AveRoons'].quantile(pct_lower), california_df[ ‘AveRoons'].quantile(pct_upper)) california_df["AveBedrms'] = np.clip(california_df["AveBedrns'], california_df{ 'AveBedrms'].quantile(pct_lower california_df| 'AveBedrns' ].quantile(pct_upper california_df["Population’] = np.clip(california_df{ ‘Population’ ], california df| ‘Population’ ] .quantile(pct_low: california_df[ ‘Population’ ] .quantile(pct_upp: california_df["AveOccup'] = np.clip(california_df[ ‘AveOccup'], california df['AveOccup'].quantile(pct_lower), california_df[‘AveOccup" ] .quantile(pct_upper)) california_df["MedInc'] = np.clip(california_df[ ‘Medinc'], california_df['MedInc'].quantile(pet_lower), california_df["Medinc’ ].quantile(pct_upper)) # Create a boxplot of the ‘AveRooms’ column ax = sns.boxplot (x=california_df[ ‘AveRooms" ]) # Set the title and axes Labels ax.set_title(‘Boxplot of Average Number of Rooms") ax.set_xlabel(*AveRooms') ax.set_ylabel('') # Customize the y-axis tick Labels to display values in millions yticks = ax.get_yticks() / 1000000 ax.set_yticklabels(['{:.2F}' format (ytick) for ytick in yticks]) # Add grid Lines and remove top and right spines ax.grid(axis="y', linestyle="--', alpha=@.7)In In ax. spines['top" ].set_visible(False) ax.spines[ ‘right’ ].set_visible(False) # Show the plot pit. show() Boxplot of Average Number of Rooms 0.00 3 a 5 5 7 ‘AveRooms # Create a boxplot of the ‘AveBedrms’ column ax = sns.boxplot (x=california_df[ 'AveBedrns’ ]) # Set the title and axes Labels ax.set_title("Boxplot of Average Number of Bedrooms’) ax.set_xlabel(*AveBedrms') ax.set_ylabel('') # Customize the y-axis tick Labels to display values in millions yticks = ax.get_yticks() / 1000000 ax.set_yticklabels(['{:.2F}'.format(ytick) for ytick in yticks]) # Add grid Lines and remove top and right spines ax.grid(axis="y', linestyle='--', alpha=@.7) ax. spines[ 'top'].set_visible(False) ax. spines[ ‘right '].set_visible(False) # Show the plot plt.show() Boxplot of Average Number of Bedrooms 0.00 030 095 100 105 110 115 120 125 ‘AveBedrms # Create a boxplot of the ‘Population’ column ax = sns.boxplot (x=california_df[ ‘Population’ })In # Set the title and axes Labels ax.set_title(‘Boxplot of Populations’) ax.set_xlabel (‘Population’) ax.set_ylabel('') # Customize the y-axis tick Labels to display values in millions yticks = ax.get_yticks() / 1000000 ax.set_yticklabels(['{:.2F}'.format(ytick) for ytick in yticks]) # Add grid Lines and remove top and right spines ax.grid(axis='y’, linestyle="--', alpha=0.7) ax. spines[ '‘top’].set_visible(False) ax. spines[ ‘right’ ].set_visible(False) # Show the plot plt.show() Boxplot of Populations 0.00 500 1000 +1500 +2000 2500 +5000, Population # Create a boxplot of the ‘MedInc’ column ax = sns. boxplot (x-california_df[‘Medinc']) # Set the title and axes Labels ax.set_title(*Boxplot of MedInc') ax.set_xlabel("MedInc') ax.set_ylabel('') # Customize the y-axis tick Labels to display values in millions yticks = ax.get_yticks() / 1000000 ax.set_yticklabels(['{:.2f}' format (ytick) for ytick in yticks]) # Add grid Lines and remove top and right spines ax.grid(axis="y', linestyle="--', alpha=0.7) ax.spines[ ‘top’ ].set_visible(False) ax.spines[ ‘right’ ].set_visible(False) # Show the plot plt.show()In Boxplot of Medinc 0.00 T 2 3 a 5 6 7 Meaine Step 3(g): Check for Skewness using a Histogram Skewed data can result in biased estimates of model parameters and reduce the accuracy of predictions. Therefore, it is important to assess the distribution of features and target variables to identify any potential issues and take appropriate measures to address them, Here almost all the features and target look normally distributed. There is some Skewness In MedHouseVal but not enough to do Transformation on it Note: For learning purposes | have shown how to do MedHouseVal transformation for skewness in my previous tutorial of Simple Linear Regression. Feel free to check that out # Set figure size and font scale plt.figure(figsize=(8, 6)) sns.set(font_scale=1.5) # Create histogram sns.histplot(data-california df, x='MedHouseVal', kde=True, bins=50, colo #7097 # Set x and y axis Labels and title plt.xlabel( ‘Median House Value’, fontsize=16) plt.ylabel( ‘Frequency’, fontsize-16) plt.title( ‘Distribution of Median House Value in California’, fontsize=20) # Customize x and y axis tick marks and Labels Plt .xticks (fontsize=12) plt.yticks(fontsize=12) # Add vertical Line for mean mean = california_df['MedHouseVal' ].mean() plt.axvline(mean, color='red’, linestyle='--", label=f'Mean: {mean:.2f}') plt.legend(fontsize=14) # Show the plot plt.show()Distribution of Median House Value in California +1000 Frequency o 1 2 3 4 5 Median House Value ## Set figure size and font scale plt.figure(figsize=(8, 6)) sns.set(font_scale=1.5) # Create histogram sns.histplot(data-california df, x='Medinc’, kde=True, bins=50, color="#beaeda", e # Set x and y axis Labels and title plt.xlabel(‘Median Incone Value’, fontsize=16) plt.ylabel("Frequency’, fontsize=16) plt.title( ‘Distribution of Median MedInc Value in California’, fontsize=20) # Customize x and y axis tick marks and Labels plt.xticks(fontsize=12) plt.yticks(fontsize=12) # Add vertical Line for MedIne mean ~ california_df["MedInc’].mean() plt.axvline(mean, color='red', linestyle='--', label=F'Mean: {mean:.2f}') plt.legend(fontsize=14) # Show the plot plt.show()Distribution of Median Medinc Value in California ---- Mean: 3.77 1000 Frequency 7 & 20 hi 1 2 3 4 5 Median Income Value # Set figure size and font scale plt.figure(figsize-(8, 6)) sns.set(font_scale=1.5) # Create histogram sns.histplot(data=california df, x="HouseAge’, kde-True, bins=50, color="#fdco86", # Set x and y axis Labels and title plt.xlabel(‘Houseage’, fontsize=16) plt.ylabel('Frequency', fontsize=16) plt.title( ‘Distribution of House Age in California’, fontsize=20) # Customize x and y axis tick marks and Labels plt.xticks(fontsize=12) plt.yticks(fontsize=12) # Add vertical Line for mean mean = california df[‘Houseage’ ].nean() plt.axvline(mean, color="red’, linestyl pit. legend(fontsize=14) ‘22, label=f'Mean: {mean:.2#}') # Show the plot plt.show()Distribution of House Age in California ---- Mean: 28.64 1200 +000 8 Frequency 8 200 © 10 Pa » 4 =0 HouseAge # Set figure size and font scale plt.figure(figsize=(8, 6)) sns.set(font_scale=1.5) # Create histogram sns.histplot(data-california df, x="AveRooms', kde=True, bins=50, color="#fFFF99", # Set x and y axis Labels and title plt.xlabel(‘AveRooms’, fontsize=16) plt.ylabel( ‘Frequency’, fontsize=16) plt.title( ‘Distribution of AveRooms in California’, fontsize=20) # Customize x and y axis tick marks and Labels pit. xticks (fontsize=12) plt.yticks(fontsize=12) # Add vertical Line for mean mean = california_df['AveRooms’ ].nean() plt.axvline(mean, color="red", linestyl pit. legend(fontsize=14) ‘--", label=f'Mean: {mean:.2F}') # Show the plot plt.show()Distribution of AveRooms in California ---- Mean: 5.28 1000 Frequency g =) & 20 > linil 3 4 5 6 7 AveRooms # Set figure size and font scale plt.figure(figsize-(8, 6)) sns.set(font_scale=1.5) # Create histogram sns.histplot(data-california df, weBedrns’, kde=True, bins=50, color="#386cb0" # Set x and y axis Labels and title plt.xlabel(‘AveBedrms", fontsizé ) plt.ylabel('Frequency’, fontsize-16) plt.title( ‘Distribution of AveBedrms in California’, fontsize=20) # Customize x and y axis tick marks and Labels plt.xticks(fontsize=12) plt.yticks(fontsize=12) # Add vertical Line for mean mean = california df[‘AveBedrns'’ }.mean() plt-axvline(mean, color="red’, Linestyli pit. legend(fontsize=14) ", label=f'mean: {mean:.2F}') # Show the plot plt.show()In Distribution of AveBedrms in California o-=- Mean: 1.06 | +1000 Frequency 7 & 200 oso 095 100105, 404.451.0125 AveBedrms ## Set figure size and font scale plt. figure(figsize-(8, 6)) sns.set(font_scale=1.5) # Create histogram sns.histplot(data-california df, x="Population’, kde=True, bin: 8, color="#F0027F # Set x and y axis Labels and title plt.xlabel( ‘Population’, fontsize=16) plt.ylabel(‘Frequency’, fontsize=16) plt.title( ‘Distribution of Population in California’, fontsize=20) # Customize x and y axis tick marks and Labels plt .xticks(fontsize=12) plt.yticks(fontsize=12) # Add vertical Line for mean mean = california_df[ ‘Population’ ].mean() plt.axvline(mean, color='red', linestyle='--", label=f'Mean: {mean:.2f}') plt. legend(fontsize=14) # Show the plot plt.show()Distribution of Population in California = Mean: 134578 | 1000 Frequency a & 200 ° 0 10001500 «= aon. 2500 00 Population # Set figure size and font scale plt.figure(figsize=(8, 6)) sns.set(font_scale=1.5) # Create histogram sns.histplot(data=california df, x="AveOccup’, kde=True, bin: |, color="#bfSb17", # Set x and y axis Labels and title plt.xlabel(‘AveOccup", fontsize=16) plt.ylabel(‘Frequency’, fontsize=16) plt.title( ‘Distribution of AveOccup in California’, fontsize=20) # Customize x and y axis tick marks and Labels pit. xticks(fontsize=12) plt.yticks(fontsize=12) # Add vertical Line for mean mean ~ california_df['Avedccup' ].mean() plt.axvline(mean, color='red', linestyle='--", label=f'Mean: {mean:.2f}') plt. legend(fontsize=14) # Show the plot plt.show()In Distribution of AveOccup in California ---- Mean: 2.89 1000 Frequency g 400 200 ° 45 20 25 30 35 40 AveOccup # Set figure size and font scale plt.figure(figsize=(8, 6)) sns.set(font_scale=1.5) # Create histogram sns.histplot(data=california df, x="Latitude’, kde-True, bin: ‘wpaccee’ # Set x and y axis Labels and title plt.xlabel(‘AveOccup", fontsize=16) plt.ylabel( ‘Frequency’, fontsize-16) plt.title( ‘Distribution of Latitude’, fontsize=20) # Customize x and y axis tick marks and Labels pit. xticks(fontsize=12) plt.yticks(fontsize=12) # Add vertical Line for mean mean ~ california_df[' Latitude’ ].mean() plt.axvline(mean, color='blue’, linestyle="--", label=f'Mean: {mean:.2f}') plt.1egend(fontsize-14) # Show the plot plt.show()Distribution of Latitude i Mean: 35.63 3000 2500 a +000 8 Ss Frequency 500 ll “0 2 At hetll : AveOccup # Set figure size and font scale plt.figure(figsize-(8, 6)) sns.set(font_scale=1.5) # Create histogram sns.histplot(data=california df, x="Longitude’, kde-True, bins=50, color="#FF5733° # Set x and y axis Labels and title plt.xlabel(‘AveOccup", fontsize=16) plt.ylabel( ‘Longitude’, fontsize=16) pit. title( ‘Distribution of Longitude’, fontsize=20) # Customize x and y axis tick marks and Labels plt.xticks (fontsize=12) plt.yticks(fontsize=12) # Add vertical Line for mean mean = california_df[' Longitude’ ].mean() plt.axvline(mean, color='blue", Linestyle: plt. legend(fontsize=14) "Mean: {mean:.2f}') # Show the plot plt.show()Distribution of Longitude 2500 i Mean: -119.57 2000 g Longitude 1000 500 a4 “192 120 1a 16 18 ‘AveOccup Step 3(h): Create a Vertical Correlation Heatmap The correlation matrix shows the correlation coefficients between every pair of variables in the dataset. A correlation coefficient ranges from -1 to 1 and measures the strength and direction of the linear relationship between two variables. A coefficient of -1 indicates 2 perfect negative correlation, a coefficient of 0 indicates no correlation, and a coefficient of 1 indicates a perfect positive correlation # Calculate the corretation matrix corr_matrix = california_df.corr() # Set up the matpLotlib figure fig, ax = plt.subplots(figsize=(6, 12)) # Create the heatmap wsns.heatmap(corr_matrix, cmap='BrBG', annot=True, fnt='.2f', Linewidths=.5, axzax, sns.heatnap(corr_matrix.corr()[['MedHouseVal' ]].sort_values(by="MedHouseVal’, ascei # Set the title and axis Labels ax.set_title(‘Correlation Heatmap for California Housing Dataset’, fontsize=16) ax.set_xlabel(‘Features', fontsize=14) ax.set_ylabel("Features', fontsize-14) # Rotate the x-axis Labels for readability plt.xticks(rotation=@, ha="right") # Show the plot plt.show()Correlation Heatmap for California Housing Dataset_ 4.00 MedHouseVal 0.75 Medinc 0.50 AveRooms HouseAge 0.031 028 8 2 Longitude 0.05 -0.00 & Latitude 0.4 025 Population 0.24 -0.50 AveBedrms 0.75 AveOccup 1.00 MedHouseVal Features Step 3(i): Perform Feature Scaling Feature scaling is the process of transforming numerical features in a dataset to have similar scales or ranges of values. The purpose of feature scaling is to ensure that all features have the same level of impact on the model and to prevent certain features from dominating the model simply because they have larger values. In linear regression, feature scaling is particularly important because the coefficients of the model represent the change in the dependent variable associated with a one-unit change in the independent variable. Scaling the features to have similar ranges can result in a more accurate and reliable model with more accurate representations of the relationships between the independent variables and the dependent variable. scaler = Standardscaler() california_df_scaled = scaler. fit_transform(california_df)california_df_scaled = pd.DataFrane(california_df_scaled, columns=california_df.co: Step 3(j) Check for Assumptions using Scatter Plots From the scatter plots, we can observe that there isa linear relationship between the dependent variable (Median House Value) and some of the independent variables like Median Income and Total Rooms. However, we can also see that some of the independent variables like Longitude, Latitude, and Housing Median Age do not have a clear linear relationship with the dependent variable. This suggests that a linear regression model might not be the best ft for predicting the Median House Value based on these variables # Create scatter plots fig, axs = plt.subplots(nrows=2, ncols=4, figsize=(30,15)) axs[0,0] scatter (california_df_scaled{'Latitude’], california_df_scaled[ 'MedHouseV. axs[0,0].set_xlabel(‘Latitude") axs[@,0].set_ylabel (‘Median House Value) axs[@,0].set_title(’Latitude vs Median House Value" ) axs[@,1].scatter(california_df_scaled{' Longitude’ ], california_df_scaled[‘MedHouse! axs[0,1] .set_xlabel (‘ Longitude" ) axs[0,1].set_ylabel (‘Median House Value’) axs[@,1].set_title('Longitude vs Median House Value' ) axs[0,2].scatter(california_df_scaled{'Houseage’], california_df_scaled['MedHouseV. axs[@,2].set_xlabel (‘Housing Median Age’) axs[@,2].set_ylabel(‘Median House Value’) axs[@,2].set_title( ‘Housing Median Age vs Median House Value") axs[0,3].scatter(california_df_scaled[ ‘AveRooms'], california_df_scaled[ 'MedHouseV. axs[0,3].set_xlabel( ‘Total Roons' ) axs[0,3].set_ylabel (‘Median House Value" ) axs[0,3].set_title( ‘Total Rooms vs Median House Value’) axs[1,0] scatter (california_df_scaled{ ‘AveBedrms’], california_df_scaled[‘MedHouse! axs[1,0].set_xlabel (‘Total Bedrooms’) axs[1,0].set_ylabel (‘Median House Value’) axs[1,0].set_title('Total Bedrooms vs Median House Value’) axs[1,1].scatter(california_df_scaled{ ‘Population’ ], california_df_scaled['MedHous: axs[1,1] .set_xlabel (‘Population’) axs[1,1].set_ylabel (‘Median House Value’) axs[1,1].set_title('Population vs Median House Value’) axs[1,2].scatter(california_df_scaled{'AveOccup'], california_df_scaled[ 'MedHouseV. axs[1,2] .set_xlabel (‘Households’) axs[1,2].set_ylabel (‘Median House Value’) axs[1,2].set_title(‘Households vs Median House Value’) axs[1,3].scatter(california_df_scaled['MedInc'], california_df_scaled[ 'MedHouseVal axs[1,3].set_xlabel(‘Median Income’) axs[1,3].set_ylabel (‘Median House Value’) axs[1,3].set_title(‘Nedian Income vs Median House Value") plt.show()Step 4: Define Dependant and Independant Variable californi californi f_scaled.drop([ 'MecHouseVal"],axis=1) d#_scaled| 'MedHouseVal''] Step 5: Do Train-Test Split in the ratio 70-30 # Split the data into training and testing sets X train, Xtest, y_train, y_test = train_test_split(x, y, test_size-0.3, random st, Step 6: Searching for Best Polynomial Degreee Based on R2 # initialize variables for storing best model and score best_score = @ best_degree = 1 # Loop through degrees of polynomial. from 1 to 10 for degree in range(1, 11): # create polynomial features poly_features = PolynomialFeatures(degree=degree) X poly_train = poly_features.fit_transform(X train) x(poly_test = poly Features. transform(X_test) # fit Linear regression model to polynomial features poly_reg = LinearRegression() poly_reg.fit(X_poly_train, y train) # evaluate model on test set y_pred = poly_reg.predict(x_poly_test) r2_score(y test, y_pred) mean_squared_error(y_test, y_pred) # check if this is the best model so far if score > best_score: best_score = score best_mse = mse best_degree = degree best_model = poly_reg# print best degree and score print("Best degree:", best_degree) print("R*2 score:", best_score) print("Mean Squared Error:", best_mse) @.7564759864738175 Mean Squared Error: @.2438324527757844 # plot actual vs predicted values pit. figure(figsize=(12, 6)) plt.scatter(y test, y_pred, color="blue’) # add Labels and title plt.xlabel( ‘Actual Values") plt.ylabel (‘Predicted Values" ) plt.title('Polynonial Regression Results (Degree + str(best_degree) +", R°2 = # add diagonal Line for reference plt.plot([y_test-min(), y_test.max()], [y_test.min(), y_test.max()], # display the plot plt.show() Polynomial Regression Results (Degree = 4, R’2 = 0.78) 19000 7 10000 20000 Predicted Values =30000 40000 0 1 2 Actual Values Conclusion: In this polynomial regression tutorial, we explored how to use polynomial regression to model the relationship between variables with a non-linear pattern with the target feature specifically the median house value in California, We started by loading and cleaning the dataset, and then visualizing the relationship between the variables using scatter plots Next, we split the data into training and testing sets and used polynomial regression to fit model to the training data, We gradually increased the degree of the polynomial until we achieved a good balance between bias and variance. We then used the model to make predictions on the test data and evaluated the model's performance using mean squared error and R-squared score

Regression Analysis - Lasso and Ridge Regularization
No ratings yet
Regression Analysis - Lasso and Ridge Regularization
17 pages
Emllab
No ratings yet
Emllab
6 pages
T2 Summary VHA
No ratings yet
T2 Summary VHA
14 pages
Data Scientists' Guide to Predicting House Prices
No ratings yet
Data Scientists' Guide to Predicting House Prices
9 pages
Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
No ratings yet
Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
127 pages
Intro to ML with Sklearn & Python
No ratings yet
Intro to ML with Sklearn & Python
10 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
20 pages
ML Lab Manual
No ratings yet
ML Lab Manual
60 pages
ML Manual
No ratings yet
ML Manual
29 pages
Normialization Dataset
No ratings yet
Normialization Dataset
7 pages
Ass 1 ML
No ratings yet
Ass 1 ML
21 pages
Real Estate Price Prediction Guide
No ratings yet
Real Estate Price Prediction Guide
10 pages
Document From Jahnavi
No ratings yet
Document From Jahnavi
20 pages
Exp - 2-EDA - CaliforniaData Set - HeatMap - PairPlot-checkpoint - Jupyter Notebook
No ratings yet
Exp - 2-EDA - CaliforniaData Set - HeatMap - PairPlot-checkpoint - Jupyter Notebook
12 pages
Data Science Record - 05
No ratings yet
Data Science Record - 05
20 pages
California Housing Data Analysis
No ratings yet
California Housing Data Analysis
1 page
Machine Learning Laboratory
No ratings yet
Machine Learning Laboratory
23 pages
Intro to Pandas for Data Science
No ratings yet
Intro to Pandas for Data Science
6 pages
The Data Science Process
100% (1)
The Data Science Process
53 pages
House Price Prediction Models
No ratings yet
House Price Prediction Models
16 pages
Machine Learning (BCSL606) Lab Manual
No ratings yet
Machine Learning (BCSL606) Lab Manual
117 pages
ML Lab Manual
No ratings yet
ML Lab Manual
110 pages
Exp 1 A
No ratings yet
Exp 1 A
5 pages
USA Real Estate Price Prediction Using Decision Tree Regressor, and AdaBoost Regressor
No ratings yet
USA Real Estate Price Prediction Using Decision Tree Regressor, and AdaBoost Regressor
14 pages
For ML Lab Observation - Ex No 1-10
No ratings yet
For ML Lab Observation - Ex No 1-10
48 pages
House Price Prediction: Project Description
No ratings yet
House Price Prediction: Project Description
11 pages
Python ML for Engineers: Week 3
No ratings yet
Python ML for Engineers: Week 3
12 pages
Project 4 - House Price Prediction - Ipynb - Colab
No ratings yet
Project 4 - House Price Prediction - Ipynb - Colab
5 pages
ML Manual
No ratings yet
ML Manual
9 pages
ML Observation
No ratings yet
ML Observation
29 pages
MDS372 Lab4 2448001
No ratings yet
MDS372 Lab4 2448001
17 pages
California Housing Price Prediction .
No ratings yet
California Housing Price Prediction .
1 page
Experiment No 11
No ratings yet
Experiment No 11
19 pages
Real Estate ML Project Guide
No ratings yet
Real Estate ML Project Guide
20 pages
Module 2notes
No ratings yet
Module 2notes
44 pages
Data Clearning
No ratings yet
Data Clearning
7 pages
ML Final Prac
No ratings yet
ML Final Prac
47 pages
Setup: Chapter 2 - End-To-End Machine Learning Project
No ratings yet
Setup: Chapter 2 - End-To-End Machine Learning Project
31 pages
Neural Network Housing Price Prediction
No ratings yet
Neural Network Housing Price Prediction
30 pages
ML Merged
No ratings yet
ML Merged
28 pages
Regression Analysis Lasso and Ridge Regression 1678810035
No ratings yet
Regression Analysis Lasso and Ridge Regression 1678810035
18 pages
Machine Learning Project: TITLE: Predicting The Sale Price of A House Using Linear Regression
No ratings yet
Machine Learning Project: TITLE: Predicting The Sale Price of A House Using Linear Regression
20 pages
Ita5007 Da2
No ratings yet
Ita5007 Da2
8 pages
Machinelearning
No ratings yet
Machinelearning
26 pages
House Price Prediction with Python
No ratings yet
House Price Prediction with Python
6 pages
ML Short Code - Under Updating
No ratings yet
ML Short Code - Under Updating
4 pages
ML Manual
No ratings yet
ML Manual
30 pages
DL - LR - 1.ipynb - Colab
No ratings yet
DL - LR - 1.ipynb - Colab
5 pages
Data Analytics I: Link of The Dataset
No ratings yet
Data Analytics I: Link of The Dataset
12 pages
ML Lab Experiment Shivansh
No ratings yet
ML Lab Experiment Shivansh
29 pages
Data Cleaning
No ratings yet
Data Cleaning
7 pages
Project Linear Regression
No ratings yet
Project Linear Regression
7 pages
House Pricing
No ratings yet
House Pricing
15 pages
ML Lab - Exp1-10
No ratings yet
ML Lab - Exp1-10
4 pages
Machine Learning (BCSL606) Lab Manual
No ratings yet
Machine Learning (BCSL606) Lab Manual
117 pages
DA Lab2
No ratings yet
DA Lab2
5 pages
Chirag HOusing Price Pred
No ratings yet
Chirag HOusing Price Pred
12 pages
The Boston Housing Dataset
100% (2)
The Boston Housing Dataset
4 pages
Untitled6.Ipynb - Colab
No ratings yet
Untitled6.Ipynb - Colab
6 pages

Linear Regression Analysis - Polynomial Regression

Uploaded by

Linear Regression Analysis - Polynomial Regression

Uploaded by

You might also like