Multiple linear regression
In [1]: 1 # import pandas library
2 import pandas as pd
In [2]: 1 #import data
2 data = pd.read_excel('C:/Users/kriti/OneDrive/Desktop/\
3 machine Learning/experiments/TRUCKING.xlsx')
4 data
Out[2]:
Driving Assignmnet miles_travelled n_of_deliveries travel_time
0 1 100 4 9.3
1 2 50 3 4.8
2 3 100 4 8.9
3 4 100 2 6.5
4 5 50 2 4.2
5 6 80 2 6.2
6 7 75 3 7.4
7 8 65 4 6.0
8 9 90 3 7.6
9 10 90 2 6.1
In [3]: 1 # plot a graph between any input attribute and output label
2 import matplotlib.pyplot as plt
3 plt.scatter(data['miles_travelled'], data['travel_time'])
4 plt.ylabel = ('travel time')
5 plt.title('trend between indenpendent and dependent variable')
6 plt.show()
In [4]: 1 #Extract the input features in X and the output label in Y
2 X = data[['miles_travelled', 'n_of_deliveries']]
3 Y = data['travel_time']
4 print(X)
5 print(Y)
miles_travelled n_of_deliveries
0 100 4
1 50 3
2 100 4
3 100 2
4 50 2
5 80 2
6 75 3
7 65 4
8 90 3
9 90 2
0 9.3
1 4.8
2 8.9
3 6.5
4 4.2
5 6.2
6 7.4
7 6.0
8 7.6
9 6.1
Name: travel_time, dtype: float64
What is sklearn Scikit-learn (Sklearn) is the most useful and robust library for machine
learning in Python. It provides a selection of efficient tools for machine learning and
statistical modeling including classification, regression, clustering and dimensionality
reduction via a consistence interface in Python.
In [5]: 1 # import LinearRegression class from sklearn.linear_model
2 from sklearn.linear_model import LinearRegression
3 # import train_test_split function from sklearn.model_selection
4 from sklearn.model_selection import train_test_split
5 X_train, X_test, Y_train, Y_test = \
6 train_test_split( X, Y, test_size=0.2, random_state=45)
7 #random_state controls the shuffling applied to the data
8 #before applying the split.
In [6]: 1 X_train, X_test, Y_train, Y_test
Out[6]: ( miles_travelled n_of_deliveries
1 50 3
8 90 3
9 90 2
6 75 3
5 80 2
7 65 4
0 100 4
3 100 2,
miles_travelled n_of_deliveries
2 100 4
4 50 2,
1 4.8
8 7.6
9 6.1
6 7.4
5 6.2
7 6.0
0 9.3
3 6.5
Name: travel_time, dtype: float64,
2 8.9
4 4.2
Name: travel_time, dtype: float64)
In [7]: 1 X_train.shape, X_test.shape, Y_train.shape, Y_test.shape
Out[7]: ((8, 2), (2, 2), (8,), (2,))
Create a model and fit it
The next step is to create a linear regression model and fit it using the existing data. With
.fit(), you calculate the optimal values of the weights 𝑏₀ and 𝑏₁, using the existing input and
output (x and y) as the arguments. In other words, .fit() fits the model.
Let’s create an instance of the class LinearRegression, which will represent the regression
model:
In [8]: 1 #Create the model by creating the object of class LinearRegression
2 regr = LinearRegression()
3 #use the object to fit the model on the data
4 #First, you need to call .fit() on model:
5 regr.fit(X_train,Y_train)
Out[8]: LinearRegression()
In [9]: 1 # find intercept of the linear equation
2 regr.intercept_
Out[9]: -1.3154407102092618
In [10]: 1 # find slope for 2 independent variables
2 regr.coef_
Out[10]: array([0.06449588, 0.97831325])
In [11]: 1 # Do the prediction for test data set
2 y_pred= regr.predict(X_test)
3 y_pred
Out[11]: array([9.04740013, 3.86597971])
In [12]: 1 # create a dataframe for actual and predicted values
2 d = pd.DataFrame({"Actual value": Y_test, "predicted values": y_pred})
3 d
Out[12]:
Actual value predicted values
2 8.9 9.04740
4 4.2 3.86598
Prediction on a particular data points suppose miles_travelled= 200 n_of_deliveries=5
In [13]: 1 predicted = regr.predict([[200, 5]])
2 print(predicted)
[16.4753012]
The sklearn. metrics module implements several loss, score, and utility functions to measure
classification performance. Some metrics might require probability estimates of the positive
class, confidence values, or binary decisions values.
In [14]: 1 # calculate mean squared error
2 from sklearn.metrics import mean_squared_error
3 mean_squared_error(Y_test, y_pred)
Out[14]: 0.06664817632509885
R sqaured-coefficient of determination--goodness of fit R-squared (R2) is a statistical
measure that represents the proportion of the variance for a dependent variable that's
explained by an independent variable
In [15]: 1 # Adjusted R squared -coefficient of determination--goodness of fit
2 from sklearn.metrics import r2_score
3 r2_score(Y_test, y_pred)
Out[15]: 0.9879315208103036
In [ ]: 1