KEMBAR78
Linear Regression | PDF | Regression Analysis | Statistical Analysis
0% found this document useful (0 votes)
13 views11 pages

Linear Regression

The document outlines the process of performing linear regression using a dataset containing years of experience and corresponding salaries. It details steps such as data collection, preprocessing, model building, testing, and evaluation, including the use of libraries like pandas and scikit-learn. The model achieves a high accuracy score of approximately 0.986, and predictions can be made for new inputs.

Uploaded by

Harshitha Kv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views11 pages

Linear Regression

The document outlines the process of performing linear regression using a dataset containing years of experience and corresponding salaries. It details steps such as data collection, preprocessing, model building, testing, and evaluation, including the use of libraries like pandas and scikit-learn. The model achieves a high accuracy score of approximately 0.986, and predictions can be made for new inputs.

Uploaded by

Harshitha Kv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Linear Regression

Steps
Data collection
data wrangling / pre-processing
Model building
Testing the model
Evaluate the model performance
Prediction with random input

In [1]:

# required libraries
import numpy as np
import pandas as pd
In [2]:

# Load the dataset into pandas dataframe


data = pd.read_csv('Salary_Data.csv')
data

Out[2]:

YearsExperience Salary

0 1.1 39343.0

1 1.3 46205.0

2 1.5 37731.0

3 2.0 43525.0

4 2.2 39891.0

5 2.9 56642.0

6 3.0 60150.0

7 3.2 54445.0

8 3.2 64445.0

9 3.7 57189.0

10 3.9 63218.0

11 4.0 55794.0

12 4.0 56957.0

13 4.1 57081.0

14 4.5 61111.0

15 4.9 67938.0

16 5.1 66029.0

17 5.3 83088.0

18 5.9 81363.0

19 6.0 93940.0

20 6.8 91738.0

21 7.1 98273.0

22 7.9 101302.0

23 8.2 113812.0

24 8.7 109431.0

25 9.0 105582.0

26 9.5 116969.0

27 9.6 112635.0

28 10.3 122391.0

29 10.5 121872.0
In [3]:

data.head()

Out[3]:

YearsExperience Salary

0 1.1 39343.0

1 1.3 46205.0

2 1.5 37731.0

3 2.0 43525.0

4 2.2 39891.0

In [4]:

data.head(10)

Out[4]:

YearsExperience Salary

0 1.1 39343.0

1 1.3 46205.0

2 1.5 37731.0

3 2.0 43525.0

4 2.2 39891.0

5 2.9 56642.0

6 3.0 60150.0

7 3.2 54445.0

8 3.2 64445.0

9 3.7 57189.0

In [5]:

data.tail()

Out[5]:

YearsExperience Salary

25 9.0 105582.0

26 9.5 116969.0

27 9.6 112635.0

28 10.3 122391.0

29 10.5 121872.0
In [6]:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 YearsExperience 30 non-null float64
1 Salary 30 non-null float64
dtypes: float64(2)
memory usage: 608.0 bytes

In [7]:

# yearsExperience(x) -> Independent variable


# Salary(y) -> Dependent variable

In [8]:

data.shape

Out[8]:

(30, 2)

In [9]:

data.describe()

Out[9]:

YearsExperience Salary

count 30.000000 30.000000

mean 5.313333 76003.000000

std 2.837888 27414.429785

min 1.100000 37731.000000

25% 3.200000 56720.750000

50% 4.700000 65237.000000

75% 7.700000 100544.750000

max 10.500000 122391.000000

Data Pre-processing

Step1: Handle Missing Data

In [10]:

data.isnull().any()

Out[10]:

YearsExperience False
Salary False
dtype: bool
Step2: Convert text column if any. to numeric column

Its not required as there are no non numeric columns

Step3: Perform Data Visualization

In [11]:

import matplotlib.pyplot as plt

In [12]:

plt.scatter(data.YearsExperience, data.Salary)

Out[12]:

<matplotlib.collections.PathCollection at 0x23cae99ddf0>

Step4: Split the data into dependent and independent variable


In [13]:

x = data.iloc[:,:1]
x

Out[13]:

YearsExperience

0 1.1

1 1.3

2 1.5

3 2.0

4 2.2

5 2.9

6 3.0

7 3.2

8 3.2

9 3.7

10 3.9

11 4.0

12 4.0

13 4.1

14 4.5

15 4.9

16 5.1

17 5.3

18 5.9

19 6.0

20 6.8

21 7.1

22 7.9

23 8.2

24 8.7

25 9.0

26 9.5

27 9.6

28 10.3

29 10.5
In [14]:

y = data.iloc[:,1:2]
y

Out[14]:

Salary

0 39343.0

1 46205.0

2 37731.0

3 43525.0

4 39891.0

5 56642.0

6 60150.0

7 54445.0

8 64445.0

9 57189.0

10 63218.0

11 55794.0

12 56957.0

13 57081.0

14 61111.0

15 67938.0

16 66029.0

17 83088.0

18 81363.0

19 93940.0

20 91738.0

21 98273.0

22 101302.0

23 113812.0

24 109431.0

25 105582.0

26 116969.0

27 112635.0

28 122391.0

29 121872.0

In [15]:

type(x)

Out[15]:

pandas.core.frame.DataFrame
In [16]:

type(y)

Out[16]:

pandas.core.frame.DataFrame

In [17]:

np.shape(x)

Out[17]:

(30, 1)

In [18]:

np.shape(y)

Out[18]:

(30, 1)

Step 5: Splitting the data into training and testing dataset

In [19]:

# Scikitlearn library has a train-test-fit function

In [20]:

from sklearn.model_selection import train_test_split

In [21]:

x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2, random_state = 0)

In [22]:

print(x_train.shape) # Training input

(24, 1)

In [23]:

print(x_test.shape) # testing input

(6, 1)

In [24]:

print(y_train.shape) # training output

(24, 1)

In [25]:

print(y_test.shape) # testing output

(6, 1)
Model Building - Linear Regression

In [26]:

from sklearn.linear_model import LinearRegression

Performing Linear regression by fitting the training data to the model

In [27]:

lr = LinearRegression()
lr.fit(x_train, y_train)

Out[27]:

LinearRegression()

Perform Testing

In [28]:

y_pred = lr.predict(x_test)
y_pred

Out[28]:

array([[ 40748.96184072],
[122699.62295594],
[ 64961.65717022],
[ 63099.14214487],
[115249.56285456],
[107799.50275317]])

Compare Predicted value to actual value

In [29]:

y_pred # predicted y

Out[29]:

array([[ 40748.96184072],
[122699.62295594],
[ 64961.65717022],
[ 63099.14214487],
[115249.56285456],
[107799.50275317]])
In [30]:

y_test # actual y

Out[30]:

Salary

2 37731.0

28 122391.0

13 57081.0

10 63218.0

26 116969.0

24 109431.0

Model Evaluation

In [31]:

# importing r2score metric


from sklearn.metrics import r2_score

In [32]:

# accuracy checking
acc = r2_score(y_pred, y_test)
acc

Out[32]:

0.986482673117654

Let us predict the value of y for some random input x

In [33]:

sal = lr.predict([[15]])
sal

Out[33]:

array([[166468.72605157]])

Draw the best-fit line


In [34]:

plt.scatter(x_train, y_train)
plt.plot(x_train, lr.predict(x_train),'r')

Out[34]:

[<matplotlib.lines.Line2D at 0x23cb6ab55b0>]

In [ ]:

You might also like