0% found this document useful (0 votes)

75 views6 pages

Machine Learning Model Building

I’ve written a paper on machine learning

Uploaded by

Jessica Hombal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

75 views6 pages

Machine Learning Model Building

I’ve written a paper on machine learning

Uploaded by

Jessica Hombal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Machine Learning Models:

Supervised Learning: Classification

Step1: read data

Step2: Get info of the data, accordingly do the missing value imputation.
columns = test_data.columns.to_list()
for col in columns:
print("Missing values % of", col, train_data[col].isna().sum()/train_data.shape[0])

def missing_val_treatment(df):
for col in columns:
miss_perc = df[col].isna().sum()/df.shape[0]
if miss_perc > 0 and miss_perc < 0.7 and df[col].dtype == 'O':
df[col].fillna(df[col].mode()[0], inplace = True)
elif miss_perc > 0 and miss_perc < 0.7 and df[col].dtype != 'O':
df[col].fillna(df[col].median(), inplace = True)
elif miss_perc > 0.7:
df.drop(col, axis = 1, inplace = True)
else:
pass
If any rows to be dropped that contains NA then:
df.dropna(subset = [column_name], inplace = True)

Step3: See if any outliers are there in any column and accordingly deal with the
outliers.
train_data.boxplot(num_cols)
Step4: See if any irrelevant columns present like Name or Address which has lot of
text and mostly unique throughout the rows.
train_data.drop(['Name','Ticket'], axis = 1, inplace = True)
test_data.drop(['Name','Ticket'], axis = 1, inplace = True)

Step5: Look for date-time columns. Here’s how you can deal with them:
df[‘ScheduledDay’] = pd.to_datetime(df[‘ScheduledDay’],
format = ‘%Y-%m-%dT%H:%M:%SZ’, errors = ‘coerce’)
Filteration w. r. t. date columns:

Date Feature Engineering:

df[‘ScheduledDay_year’] = df[‘ScheduledDay’].dt.year
df[‘ScheduledDay_month’] = df[‘ScheduledDay’].dt.month
df[‘ScheduledDay_week’] = df[‘ScheduledDay’].dt.week
df[‘ScheduledDay_day’] = df[‘ScheduledDay’].dt.day
df[‘ScheduledDay_hour’] = df[‘ScheduledDay’].dt.hour
df[‘ScheduledDay_minute’] = df[‘ScheduledDay’].dt.minute
df[‘ScheduledDay_dayofweek’] = df[‘ScheduledDay’].dt.dayofweek

Step6: One-Hot Encoding

train_enc = pd.get_dummies(train_data, drop_first = True)
test_enc = pd.get_dummies(test_data, drop_first = True)

Step7: Data Normalization

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
for col in num_cols:
if test_data[col].dtype != 'O' and test_data[col].dtype != ‘Datetime’:
train_enc[col] = sc.fit_transform(train_enc[[col]])
test_enc[col] = sc.fit_transform(test_enc[[col]])
train_enc.hist()

Step8: Look for imbalance in the data w. r. t. target variable and depending upon
that apply sampling technique like SMOTE.

Step9: Apply ML models for the cleaned data:

from sklearn.model_selection import train_test_split
train_data, val_data = train_test_split(train_enc, test_size = 0.1, random_state = 0)
print(train_data.shape)
print(val_data.shape)
X_train = train_data.drop('Survived', axis = 1)
X_test = val_data.drop('Survived', axis = 1)
y_train = train_data['Survived']
y_test = val_data['Survived']

OR
X = train_data.drop('Survived', axis = 1)
y = train_data['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state =
0)

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
model = lr.fit(X_train, y_train)

from sklearn.metrics import accuracy_score

y_train_pred = model.predict(X_train)
train_accuracy = accuracy_score(y_train, y_train_pred)
print("Training accuracy:", train_accuracy)

y_pred = model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)
print("Testing accuracy:", test_accuracy)

from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier

model2 = RandomForestClassifier(n_estimators = 80, oob_score = True,

random_state= 0)
model2 = model2.fit(X_train, y_train)
print(X_train.columns)
model2.feature_importances_
model2.oob_score_

oobs = []
w_values = list(range(20,300,10))
for w in w_values:
model2 = RandomForestClassifier(n_estimators = w, oob_score = True,
random_state= 0)
model2.fit(X_train, y_train)
oob = m_1.oob_score_
oobs.append(oob)
max_oob_index = oobs.index(max(oobs))
best_w = w_values[max_oob_index]
best_w

model2 = RandomForestClassifier(n_estimators = 280, oob_score = True,

random_state= 0)
model2.fit(X_train, y_train)
model2.oob_score_

model3 = AdaBoostClassifier(n_estimators = 100, random_state = 0)

model3.fit(X_train, y_train)
model3.score(X_test, y_test)

y_pred2 = model2.predict(test_enc)

result = pd.DataFrame()
result['PassengerId'] = test_enc['PassengerId']
result['Survived'] = y_pred2
result

result.to_csv("gender_submission.csv",index=False)

Supervised Learning: Regression

Import required modules:

from sklearn.model_selection import train_test_split

X = train_data.drop("MedHouseVal", axis = 1)
y = train_data['MedHouseVal']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

from sklearn.linear_model import LinearRegression

import statsmodels.api as sma

X_train = sma.add_constant(X_train)
X_test = sma.add_constant(X_test)

model = sma.OLS(y_train, X_train)

model = model.fit()
model.summary()
y_pred = model.predict(X_test)

from sklearn.metrics import r2_score, mean_squared_error

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("RMSE:",rmse)
r2 = r2_score(y_test, y_pred)
print("R2:",r2)

test_data = sma.add_constant(test_data)
y_pred2 = model.predict(test_data)

Unsupervised Learning: K-Means Clustering

Following are the steps to perform K-Means:
Step1: Perform EDA
Step2: Check for random
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 3, init = “k-means++”, random_state = 0)
kmeans = kmeans.fit(scaled_data)
wcss = []
for i in range(1, 30):
kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10,
random_state = 0)
kmeans.fit(scaled_data)
wcss.append(kmeans.inertia_)
import matplotlib.pyplot as plt
plt.plot(range(1,30), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

Step3: Identify the point (No. of clusters) where it starts remaining constant. If not
identifiable (Elbow method fails), then make use of Silhouette score to get the
number of clusters.
Take the range of no. of clusters where you are not sure where the consistency
starts. Let’s say you’re confused between k=3 and k=13.
from sklearn.metrics import silhouette_score
for i in range(3, 13):
labels = KMeans(n_clusters = i).fit(scaled_data).labels_
print(“SC for k =”+ str(i) +“is”+str(silhouette_score(scaled_data, labels)))

Now identify at what value of k the silhouette score is highest.

km = KMeans(n_clusters = 3, init = 'k-means++', random_state = 0)

y_means = km.fit_predict(scaled_data)

DA Programs
No ratings yet
DA Programs
44 pages
Mercedes-Benz Greener Manufacturing Ai
0% (1)
Mercedes-Benz Greener Manufacturing Ai
16 pages
Da Lab Mannual
No ratings yet
Da Lab Mannual
25 pages
ML Lab
No ratings yet
ML Lab
29 pages
Slip
No ratings yet
Slip
5 pages
05 E RandomForest LoanData
No ratings yet
05 E RandomForest LoanData
8 pages
Data Preprocessing
No ratings yet
Data Preprocessing
9 pages
ML All Projectpdf Removed
No ratings yet
ML All Projectpdf Removed
41 pages
Bacdeaf 23032025 115708 Split 1
No ratings yet
Bacdeaf 23032025 115708 Split 1
37 pages
ML Lab Manual
No ratings yet
ML Lab Manual
17 pages
Python ML Algorithms Guide
No ratings yet
Python ML Algorithms Guide
7 pages
16BCB0126 VL2018195002535 Pe003
No ratings yet
16BCB0126 VL2018195002535 Pe003
40 pages
Aiml Ex 4-7
No ratings yet
Aiml Ex 4-7
8 pages
ML Codes
No ratings yet
ML Codes
9 pages
IRis
No ratings yet
IRis
19 pages
ML PDF
No ratings yet
ML PDF
30 pages
B.Tech AI & DS: Data Science Lab
No ratings yet
B.Tech AI & DS: Data Science Lab
35 pages
Da Rec
No ratings yet
Da Rec
29 pages
N.E.O of Ai Spacescienceprintout
No ratings yet
N.E.O of Ai Spacescienceprintout
12 pages
1
No ratings yet
1
13 pages
DataAnalytics Lab Manual
No ratings yet
DataAnalytics Lab Manual
35 pages
Machine Learning Lab Assignment 1
No ratings yet
Machine Learning Lab Assignment 1
23 pages
23BCE7199 ML Lab Assignment
No ratings yet
23BCE7199 ML Lab Assignment
15 pages
Data Analytics
No ratings yet
Data Analytics
10 pages
Ann Experiential Learning
No ratings yet
Ann Experiential Learning
43 pages
Mlda - Lab
No ratings yet
Mlda - Lab
35 pages
1st PGM
No ratings yet
1st PGM
10 pages
Classification Review
No ratings yet
Classification Review
8 pages
ML Experiment WithDataset
No ratings yet
ML Experiment WithDataset
23 pages
Shobit Sharma (2124399) ML Lab File PDF
No ratings yet
Shobit Sharma (2124399) ML Lab File PDF
19 pages
23BCE7092 ML Lab Assignment
No ratings yet
23BCE7092 ML Lab Assignment
14 pages
hw1 ML IvanReyes
No ratings yet
hw1 ML IvanReyes
21 pages
S6 - Data Mining Lab Experiments (Except 1)
No ratings yet
S6 - Data Mining Lab Experiments (Except 1)
6 pages
ML Lab Manual
No ratings yet
ML Lab Manual
12 pages
Advance Python
No ratings yet
Advance Python
5 pages
Ai 28-01-25
No ratings yet
Ai 28-01-25
18 pages
CP4252 Lab Manual
No ratings yet
CP4252 Lab Manual
13 pages
ML Practical 205160694034
No ratings yet
ML Practical 205160694034
33 pages
R Assignment
No ratings yet
R Assignment
8 pages
Machine File
No ratings yet
Machine File
27 pages
Aiml 5-8
No ratings yet
Aiml 5-8
19 pages
17 Ensemble Techniques Problem Statement
No ratings yet
17 Ensemble Techniques Problem Statement
28 pages
LAB-4 Report
No ratings yet
LAB-4 Report
21 pages
ML Journal External
No ratings yet
ML Journal External
14 pages
Aml Lab
No ratings yet
Aml Lab
6 pages
Mltee t5 Assignment Pseudo Code
No ratings yet
Mltee t5 Assignment Pseudo Code
10 pages
Supervised Learning For Data Science...
No ratings yet
Supervised Learning For Data Science...
14 pages
C121 Exp2
No ratings yet
C121 Exp2
23 pages
Machine Learning Assignment
No ratings yet
Machine Learning Assignment
7 pages
Da 012307
No ratings yet
Da 012307
8 pages
Data - Analytics Lab - Manual JNTUH R22 Regulation
No ratings yet
Data - Analytics Lab - Manual JNTUH R22 Regulation
26 pages
ML Record Print
No ratings yet
ML Record Print
20 pages
Aiml Lab
No ratings yet
Aiml Lab
14 pages
(Feature Engineering) (Extended-Cheatsheet)
100% (1)
(Feature Engineering) (Extended-Cheatsheet)
9 pages
AI ML - Cycle 2 Programs
No ratings yet
AI ML - Cycle 2 Programs
15 pages
ML Classification
No ratings yet
ML Classification
54 pages
Python For Data Science IA 1 Programs
No ratings yet
Python For Data Science IA 1 Programs
14 pages
Excell Statistics
No ratings yet
Excell Statistics
40 pages
Chapter 3 - Testbank
50% (4)
Chapter 3 - Testbank
62 pages
Introduction to Statistics Concepts
No ratings yet
Introduction to Statistics Concepts
101 pages
PROCESS Documentation Addendum
No ratings yet
PROCESS Documentation Addendum
26 pages
Evidence Synthesis Guide
No ratings yet
Evidence Synthesis Guide
5 pages
MSA With Minitab
No ratings yet
MSA With Minitab
59 pages
Mat202 Probability, Statistics and Numerical Methods, May 2024
No ratings yet
Mat202 Probability, Statistics and Numerical Methods, May 2024
3 pages
Central Limit Theorem
No ratings yet
Central Limit Theorem
23 pages
Key Determinant Factors Affecting The Performance of Small and Medium Scale Manufacturing Enterprise A Case Study On West Shoa Zone Oromia National Regional State Ethiopia IJERTV
No ratings yet
Key Determinant Factors Affecting The Performance of Small and Medium Scale Manufacturing Enterprise A Case Study On West Shoa Zone Oromia National Regional State Ethiopia IJERTV
8 pages
Omv Bias Note
No ratings yet
Omv Bias Note
4 pages
For Event Studies
No ratings yet
For Event Studies
54 pages
Machine Learning Course Guide
No ratings yet
Machine Learning Course Guide
3 pages
Big Five Traits & Student GPA
No ratings yet
Big Five Traits & Student GPA
18 pages
T-Test Calculator with Steps
100% (1)
T-Test Calculator with Steps
4 pages
Class 11 Statistics Exam Paper
No ratings yet
Class 11 Statistics Exam Paper
3 pages
Time Series
No ratings yet
Time Series
31 pages
Credit Card Balance Analysis
No ratings yet
Credit Card Balance Analysis
19 pages
Introduction To Multivariate Analysis MPU2263
No ratings yet
Introduction To Multivariate Analysis MPU2263
14 pages
PPT08 - Analysis of Variance
No ratings yet
PPT08 - Analysis of Variance
45 pages
Emosi & Nilai Matematika Siswa SMK
No ratings yet
Emosi & Nilai Matematika Siswa SMK
8 pages
CH 07 Tif
100% (1)
CH 07 Tif
29 pages
Forecasting Models & Techniques
No ratings yet
Forecasting Models & Techniques
25 pages
SANDYA VB TIME SERIES FORECASTING PROJECT - HTML PDF
90% (20)
SANDYA VB TIME SERIES FORECASTING PROJECT - HTML PDF
196 pages
Statistical Methods For Quality & Reliability
No ratings yet
Statistical Methods For Quality & Reliability
2 pages
Myp Math Extended Unit 02
No ratings yet
Myp Math Extended Unit 02
6 pages
Final Assignment Business Analytics
No ratings yet
Final Assignment Business Analytics
10 pages
Statistical Methods in Biology Design and Analysis of Experiments and Regression 1st Edition Welham 2024 Scribd Download
No ratings yet
Statistical Methods in Biology Design and Analysis of Experiments and Regression 1st Edition Welham 2024 Scribd Download
40 pages
Zinzendoff Okwonu 2020 IOP Conf. Ser. Mater. Sci. Eng. 917 012065
No ratings yet
Zinzendoff Okwonu 2020 IOP Conf. Ser. Mater. Sci. Eng. 917 012065
9 pages
672ef30d3aa29a0544cbbfbf Fonelolukof
No ratings yet
672ef30d3aa29a0544cbbfbf Fonelolukof
2 pages
The Sampling Error in Estimates of Variance
No ratings yet
The Sampling Error in Estimates of Variance
17 pages