Notebook
January 1, 2025
Linear Regression
[19]: import numpy as np
import matplotlib.pyplot as plt
# Data points
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y = np.array([5, 8, 9, 11, 20, 16, 17, 18, 21, 26])
# Number of observations
n = len(x)
# Mean values of x and y
mean_x = np.mean(x)
mean_y = np.mean(y)
# Calculate coefficients b1 and b0
numerator = np.sum(x * y) - (n * mean_x * mean_y)
denominator = np.sum(x**2) - (n * mean_x**2)
b1 = numerator / denominator
b0 = mean_y - b1 * mean_x
print(f"Estimated coefficients are:")
print(f"b0 = {b0}")
print(f"b1 = {b1}")
# Scatter plot
plt.scatter(x, y, color="b", label='Data', marker="o", s=100)
# Regression line
y_pred = b0 + b1 * x
plt.plot(x, y_pred, color='red', label='Regression Line', markersize=10)
plt.xlabel('x')
plt.ylabel('y')
plt.title("Simple Linear Regression", fontsize=30, color="magenta")
plt.legend()
1
plt.show()
Estimated coefficients are:
b0 = 3.799999999999999
b1 = 2.0545454545454547
Multiple Linear Regression
[15]: import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Dataset
data = pd.read_csv(r"C:\Users\P. Shrenik Kumar\Downloads\Housing.csv")
print(data)
# Load the dataset from a CSV file
file_path = r"C:\Users\P. Shrenik Kumar\Downloads\Housing.csv" # Replace with␣
↪your CSV file path
2
# Display the first few rows of the dataset
print(data.head())
# Assuming the dependent variable (target) is in a column named 'target'
# and the independent variables are in columns 'feature1', 'feature2', etc.
# Define the independent variables (features) and the dependent variable␣
↪(target)
X = data[['area', 'bedrooms', 'bathrooms']]
y= data['price']
# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42)
# Initialize the linear regression model
model = LinearRegression()
# Train the model
model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# Output the model evaluation metrics
print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')
# Plot Actual vs Predicted
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, color='blue')
plt.axline((0,0),slope=1,color='red')
price area bedrooms bathrooms stories mainroad guestroom basement \
0 13300000 7420 4 2 3 yes no no
1 12250000 8960 4 4 4 yes no no
2 12250000 9960 3 2 2 yes no yes
3 12215000 7500 4 2 2 yes no yes
4 11410000 7420 4 1 2 yes yes yes
.. … … … … … … … …
540 1820000 3000 2 1 1 yes no yes
541 1767150 2400 3 1 1 no no no
542 1750000 3620 2 1 1 yes no no
543 1750000 2910 3 1 1 no no no
544 1750000 3850 3 1 2 yes no no
hotwaterheating airconditioning parking prefarea furnishingstatus
0 no yes 2 yes furnished
1 no yes 3 no furnished
2 no no 2 yes semi-furnished
3 no yes 3 yes furnished
4 no yes 2 no furnished
3
.. … … … … …
540 no no 2 no unfurnished
541 no no 0 no semi-furnished
542 no no 0 no unfurnished
543 no no 0 no furnished
544 no no 0 no unfurnished
[545 rows x 13 columns]
price area bedrooms bathrooms stories mainroad guestroom basement \
0 13300000 7420 4 2 3 yes no no
1 12250000 8960 4 4 4 yes no no
2 12250000 9960 3 2 2 yes no yes
3 12215000 7500 4 2 2 yes no yes
4 11410000 7420 4 1 2 yes yes yes
hotwaterheating airconditioning parking prefarea furnishingstatus
0 no yes 2 yes furnished
1 no yes 3 no furnished
2 no no 2 yes semi-furnished
3 no yes 3 yes furnished
4 no yes 2 no furnished
Mean Squared Error: 2750040479309.0513
R-squared: 0.45592991188724474
[15]: <matplotlib.lines.AxLine at 0x1cbc6c36b40>
4
Decision Tree Classfier
[6]: # Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import␣
↪accuracy_score,classification_report,confusion_matrix
from sklearn import tree
import matplotlib.pyplot as plt
# Load the Iris dataset
iris = load_iris()
X = iris.data # Features
y = iris.target # Labels
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,␣
↪random_state=42)
# Initialize the Decision Tree classifier
clf = DecisionTreeClassifier()
# Train the classifier
clf.fit(X_train, y_train)
# Predict on the test set
y_pred = clf.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test,y_pred)
class_report = classification_report(y_test,y_pred,target_names=iris.
↪target_names)
print(accuracy)
print(cm)
print(class_report)
# Visualize the Decision Tree
plt.figure(figsize=(12, 8))
tree.plot_tree(clf, feature_names=iris.feature_names, class_names=iris.
↪target_names)
plt.title("Decision Tree for Iris Dataset", color='red',size=42)
plt.show()
1.0
[[19 0 0]
[ 0 13 0]
[ 0 0 13]]
precision recall f1-score support
setosa 1.00 1.00 1.00 19
versicolor 1.00 1.00 1.00 13
5
virginica 1.00 1.00 1.00 13
accuracy 1.00 45
macro avg 1.00 1.00 1.00 45
weighted avg 1.00 1.00 1.00 45
KNN
[8]: # Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score, classification_report,␣
↪confusion_matrix
# Load the Iris dataset
data = load_iris()
X = data.data # Features
y = data.target # Labels
# Split the dataset into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42)
6
# Create and train the KNN classifier
k = 3 # Number of neighbors
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
# Predict the labels for the test set
y_pred = knn.predict(X_test)
# Evaluate the classifier
accuracy = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test,y_pred)
class_report = classification_report(y_test,y_pred,target_names=data.
↪target_names)
print(accuracy)
print(cm)
print(class_report)
1.0
[[10 0 0]
[ 0 9 0]
[ 0 0 11]]
precision recall f1-score support
setosa 1.00 1.00 1.00 10
versicolor 1.00 1.00 1.00 9
virginica 1.00 1.00 1.00 11
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
Logistic Regression
[9]: # Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score, classification_report,␣
↪confusion_matrix
# Load the Iris dataset
data = load_iris()
X = data.data # Features
y = data.target # Labels
# Split the dataset into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42)
# Create and train the Logistic Regression model
log_reg = LogisticRegression(max_iter=200)
log_reg.fit(X_train, y_train)
7
# Predict the labels for the test set
y_pred = log_reg.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test,y_pred)
class_report = classification_report(y_test,y_pred,target_names=data.
↪target_names)
print(accuracy)
print(cm)
print(class_report)
1.0
[[10 0 0]
[ 0 9 0]
[ 0 0 11]]
precision recall f1-score support
setosa 1.00 1.00 1.00 10
versicolor 1.00 1.00 1.00 9
virginica 1.00 1.00 1.00 11
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
K Means
[10]: # Import required libraries
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
# Load the Iris dataset
X = load_iris().data
# Create and train the K-Means model
kmeans = KMeans(n_clusters=3, random_state=42).fit(X)
# Plot the clusters (using the first two features)
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='viridis')
plt.title("K-Means Clustering on Iris Dataset")
plt.xlabel('Sepal Length (cm)')
plt.ylabel('Sepal Width (cm)')
plt.show()
8
Sure! Let’s break down the statement plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_,
cmap='viridis') in detail:
0.0.1 1. plt.scatter
This is a function in the matplotlib.pyplot module that creates a scatter plot. A scatter plot
displays points in a 2D space, where each point represents a data sample, and its position is
determined by two numerical features (x and y).
0.0.2 2. X[:, 0]
• X is the feature matrix (data) loaded from the Iris dataset.
• X[:, 0] selects all rows (:) of the first column (0) from X. This column corresponds to the
feature “sepal length (cm)” in the Iris dataset.
• This becomes the x-coordinate for each data point in the scatter plot.
9
0.0.3 3. X[:, 1]
• Similar to X[:, 0], this selects the second column (1) of X, which corresponds to the feature
“sepal width (cm)” in the Iris dataset.
• This becomes the y-coordinate for each data point in the scatter plot.
0.0.4 4. c=kmeans.labels_
• kmeans.labels_ contains the cluster labels assigned to each data point by the K-Means
model.
– For example, if there are 3 clusters, the labels might look like [0, 1, 2, 1, 0, ...].
– These labels are used to group data points by their cluster assignment.
• The c parameter assigns a different color to each cluster based on these labels.
0.0.5 5. cmap='viridis'
• cmap stands for “color map,” which defines the set of colors used for the scatter plot.
• 'viridis' is a popular color map that provides a visually appealing gradient of colors,
transitioning from dark blue to bright yellow.
• Each cluster label (e.g., 0, 1, 2) is mapped to a specific color within this gradient.
0.0.6 6. Putting It All Together
This line plots a scatter plot where: - The x-coordinates are the sepal lengths (X[:, 0]). - The
y-coordinates are the sepal widths (X[:, 1]). - The points are colored based on the clusters
(kmeans.labels_), with colors chosen from the viridis color map.
0.0.7 Example in Action
If the Iris dataset contains 150 samples: - X[:, 0] and X[:, 1] provide 150 x and y coordinates.
- kmeans.labels_ assigns one of three labels (e.g., 0, 1, 2) to each sample. - cmap='viridis'
ensures each label gets a distinct color.
When executed, this produces a visual representation of the clusters found by K-Means, making it
easy to observe patterns or groupings in the data.
This notebook was converted with convert.ploomber.io
10