KEMBAR78
Regression | PDF | Principal Component Analysis | Mean Squared Error
0% found this document useful (0 votes)
26 views26 pages

Regression

Data preprocessing in machine learning involves several key steps including data collection, cleaning, transformation, splitting, feature selection, handling imbalanced data, data augmentation, and final export. Each step is crucial for ensuring that the data is clean, well-structured, and suitable for training models, with specific techniques and Python code examples provided for tasks like handling missing values, removing duplicates, and splitting datasets. The document emphasizes the importance of these processes to improve model performance and reliability.

Uploaded by

apun.is.harsh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views26 pages

Regression

Data preprocessing in machine learning involves several key steps including data collection, cleaning, transformation, splitting, feature selection, handling imbalanced data, data augmentation, and final export. Each step is crucial for ensuring that the data is clean, well-structured, and suitable for training models, with specific techniques and Python code examples provided for tasks like handling missing values, removing duplicates, and splitting datasets. The document emphasizes the importance of these processes to improve model performance and reliability.

Uploaded by

apun.is.harsh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

data preprocessing in ml steps

Data preprocessing is a crucial step in Machine Learning (ML) that ensures your data is clean, well-
structured, and suitable for training models. Here are the key steps involved:

1. Data Collection

Gather data from multiple sources (databases, CSV files, APIs, web scraping, etc.).
Ensure the data is relevant to the problem statement.

2. Data Cleaning

Handle missing values:


Remove rows/columns with excessive missing data.
Fill missing values using mean, median, mode, or imputation techniques.
Remove duplicates to avoid bias.
Correct inconsistent or wrongly formatted data (e.g., fixing typos, standardizing units).

3. Data Transformation

Feature Scaling: Normalize (MinMax Scaling) or standardize (Z-score) numerical features.


Encoding Categorical Data: Convert categorical variables into numerical form using:
One-Hot Encoding (for nominal data).
Label Encoding (for ordinal data).
Handling Outliers: Detect outliers using Z-score, IQR method, or visualization and decide whether
to remove or transform them.
Feature Engineering: Create new features or transform existing ones to improve model
performance.

4. Data Splitting

Split the dataset into:


Training Set (70-80%) - Used to train the model.
Validation Set (10-15%) - Used for hyperparameter tuning.
Test Set (10-15%) - Used for final evaluation.

5. Feature Selection/Reduction

Remove irrelevant or redundant features to reduce dimensionality.


Use techniques like PCA (Principal Component Analysis) to reduce complexity.
Apply statistical methods like correlation analysis to select the most important features.

6. Handling Imbalanced Data (if applicable)

If dealing with an imbalanced dataset (e.g., fraud detection, rare diseases), use:
Oversampling (SMOTE, ADASYN) to increase minority class instances.
Undersampling to reduce majority class instances.
Class-weight adjustment in models.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 1/26
7. Data Augmentation (if applicable)

In image/text datasets, augment data by rotating, flipping, or generating synthetic data to improve
model generalization.

8. Final Check & Export

Ensure data is properly formatted.


Save the preprocessed dataset in a suitable format (CSV, pickle, database).
Proceed with model training.

Would you like a code example for any of these steps? 🚀

in detail explain, give corresponding syntax and code for


Data Cleaning
Handle missing values:
Remove rows/columns with excessive missing data.
Fill missing values using mean, median, mode, or imputation techniques.
Remove duplicates to avoid bias.
Correct inconsistent or wrongly formatted data (e.g., fixing typos, standardizing units).

Data Cleaning in Machine Learning (With Python Examples)

Data cleaning is a crucial step in the ML pipeline that ensures data quality and reliability. Below is a
detailed explanation with corresponding syntax and Python code for handling missing values,
removing duplicates, and correcting inconsistencies.

1. Handling Missing Values


Missing values can significantly impact ML models. There are multiple ways to handle them:

a) Remove rows/columns with excessive missing data

If a row or column has too many missing values, it may not be useful for the model. We can:

Drop columns if too many values are missing.


Drop rows if only a few rows contain missing values.

Code Example

python

import pandas as pd

# Sample dataset with missing values


data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', None],
'Age': [25, 30, None, 35, 40],
'Salary': [50000, 54000, 58000, None, None]
}

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 2/26
df = pd.DataFrame(data)

print("Original Data:\n", df)

# Drop rows with any missing value


df_dropped_rows = df.dropna()
print("\nAfter Dropping Rows with Missing Values:\n", df_dropped_rows)

# Drop columns if more than 50% values are missing


df_dropped_columns = df.dropna(thresh=len(df) * 0.5, axis=1) # Keep columns with at least 50% valid
print("\nAfter Dropping Columns with More Than 50% Missing Values:\n", df_dropped_columns)

b) Fill missing values using mean, median, mode, or imputation techniques

If we don't want to lose data by dropping rows/columns, we can fill missing values using:

Mean: Best for numerical data with normal distribution.


Median: Works well for skewed distributions.
Mode: Best for categorical data.
Forward/Backward Fill: Uses nearby values to fill gaps.

Code Example

python

# Fill missing values in 'Age' column with mean


df['Age'].fillna(df['Age'].mean(), inplace=True)

# Fill missing values in 'Salary' column with median


df['Salary'].fillna(df['Salary'].median(), inplace=True)

# Fill missing values in 'Name' column with mode (most frequent value)
df['Name'].fillna(df['Name'].mode()[0], inplace=True)

print("\nAfter Filling Missing Values:\n", df)

Forward Fill & Backward Fill

python

# Forward fill (propagate previous value)


df.fillna(method='ffill', inplace=True)

# Backward fill (propagate next value)


df.fillna(method='bfill', inplace=True)

2. Removing Duplicates
Duplicate records can bias ML models. We remove them using `.drop_duplicates()`.

Code Example

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 3/26
python

# Sample dataset with duplicate rows


data_dup = {
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'David'],
'Age': [25, 30, 35, 25, 40],
'Salary': [50000, 54000, 58000, 50000, 62000]
}

df_dup = pd.DataFrame(data_dup)

print("\nOriginal Data with Duplicates:\n", df_dup)

# Remove duplicate rows


df_no_duplicates = df_dup.drop_duplicates()
print("\nAfter Removing Duplicates:\n", df_no_duplicates)

3. Correcting Inconsistent or Wrongly Formatted Data


Data may have inconsistencies like:

Typos (e.g., "USA", "U.S.A", "usA")


Incorrect capitalization (e.g., "New York" vs. "new york")
Standardizing units (e.g., "kg" vs. "kilograms")
Removing extra spaces

a) Fixing Typos and Standardizing Data

python

# Sample dataset with inconsistent data


data_inconsistent = {
'Country': ['USA', 'U.S.A', 'usA', 'India', 'INDIA', 'ind'],
'City': ['New York', 'Los Angeles', 'San Francisco', 'Delhi', 'Mumbai', 'Bangalore']
}

df_inconsistent = pd.DataFrame(data_inconsistent)

print("\nOriginal Data with Inconsistencies:\n", df_inconsistent)

# Convert to lowercase
df_inconsistent['Country'] = df_inconsistent['Country'].str.lower()

# Replace variations of "USA" with a standard format


df_inconsistent['Country'] = df_inconsistent['Country'].replace({'u.s.a': 'usa', 'usa': 'usa', 'ind

print("\nAfter Standardizing Country Names:\n", df_inconsistent)

b) Fixing Case Sensitivity (Lowercasing Strings)

python

df_inconsistent['City'] = df_inconsistent['City'].str.title() # Converts "new york" -> "New York"

c) Removing Extra Spaces

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 4/26
python

df_inconsistent['City'] = df_inconsistent['City'].str.strip()

d) Converting Data Types (e.g., Changing Strings to Integers)

python

df['Age'] = df['Age'].astype(int) # Convert Age to integer

Final Cleaned Data

After applying all cleaning techniques, we get a properly structured dataset that can be used for ML.

Summary

Step Method Syntax


Removing rows/columns
with excessive missing `dropna()` `df.dropna()`
values
Mean, Median, Mode, ffill,
Filling missing values `df.fillna(df.mean())`
bfill
Removing duplicates `drop_duplicates()` `df.drop_duplicates()`
`replace()`,
Fixing
`str.lower()`, `df['col'].replace({'typo1':'correct'})`
typos/inconsistencies
`str.strip()`
Convert case, remove
Standardizing data `df['col'].str.lower()`
spaces
Converting data types `astype()` `df['col'].astype(int)`

Would you like more details on any part? 🚀

in detail explain with corresponding syntax and code for


Data Splitting
Split the dataset into:
Training Set (70-80%) - Used to train the model.
Validation Set (10-15%) - Used for hyperparameter tuning.
Test Set (10-15%) - Used for final evaluation.
5. Feature Selection/Reduction
Remove irrelevant or redundant features to reduce dimensionality.
Use techniques like PCA (Principal Component Analysis) to reduce complexity.
Apply statistical methods like correlation analysis to select the most important features.
6. Handling Imbalanced Data (if applicable)
If dealing with an imbalanced dataset (e.g., fraud detection, rare diseases), use:
Oversampling (SMOTE, ADASYN) to increase minority class instances.
Undersampling to reduce majority class instances.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 5/26
Class-weight adjustment in models.
7. Data Augmentation (if applicable)
In image/text datasets, augment data by rotating, flipping, or generating synthetic data to improve
model generalization.
8. Final Check & Export
Ensure data is properly formatted.
Save the preprocessed dataset in a suitable format (CSV, pickle, database).
Proceed with model training.

Data Preprocessing in Machine Learning (Detailed Explanation &


Code)
Data preprocessing is essential to prepare raw data for machine learning models. This guide covers
data splitting, feature selection, handling imbalanced data, data augmentation, and final data
export with Python code examples.

1. Data Splitting
Splitting the dataset is crucial to evaluate model performance fairly. The dataset is divided into:

Training Set (70-80%) - Used to train the model.


Validation Set (10-15%) - Used to tune hyperparameters.
Test Set (10-15%) - Used for final model evaluation.

Code Example: Data Splitting using `train_test_split()`

python

from sklearn.model_selection import train_test_split


import pandas as pd

# Sample dataset
data = {
'Feature1': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
'Feature2': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Target': [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]
}

df = pd.DataFrame(data)

# Splitting data into features (X) and target variable (y)


X = df.drop(columns=['Target'])
y = df['Target']

# Splitting into training (70%), validation (15%), and test (15%) sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

print("Training Set Size:", X_train.shape)


print("Validation Set Size:", X_val.shape)
print("Test Set Size:", X_test.shape)

Key Syntax:

`train_test_split(X, y, test_size=0.3, random_state=42)`: Splits data into training and remaining


set.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 6/26
`random_state=42`: Ensures reproducibility.

2. Feature Selection/Reduction
Feature selection helps remove irrelevant or redundant features, improving model performance.

a) Removing Irrelevant Features

Irrelevant features don’t contribute much to the target variable and should be dropped.

python

df = df.drop(columns=['Feature2']) # Drop Feature2 if it’s not useful

b) Feature Selection Using Correlation Analysis

Highly correlated features can be redundant. We can use the correlation matrix to identify
them.

python

import seaborn as sns


import matplotlib.pyplot as plt

# Compute correlation matrix


correlation_matrix = df.corr()

# Plot heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.show()

Remove one of the features if two are highly correlated (above 0.9 correlation).

c) Principal Component Analysis (PCA) for Dimensionality Reduction

PCA reduces the number of features while preserving important information.

python

from sklearn.decomposition import PCA

# Apply PCA to reduce to 2 components


pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

print("Original shape:", X.shape)


print("Reduced shape:", X_pca.shape)

3. Handling Imbalanced Data


If the dataset is imbalanced (e.g., fraud detection), we need techniques to balance the classes.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 7/26
a) Oversampling (SMOTE) - Increase Minority Class

SMOTE (Synthetic Minority Over-sampling Technique) generates synthetic samples for the minority class.

python

from imblearn.over_sampling import SMOTE

# Applying SMOTE
smote = SMOTE(sampling_strategy='auto', random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

print("Class distribution after SMOTE:\n", y_resampled.value_counts())

b) Undersampling - Reduce Majority Class

Undersampling randomly removes examples from the majority class.

python

from imblearn.under_sampling import RandomUnderSampler

undersample = RandomUnderSampler(sampling_strategy='auto', random_state=42)


X_resampled, y_resampled = undersample.fit_resample(X, y)

print("Class distribution after undersampling:\n", y_resampled.value_counts())

c) Class Weight Adjustment in Models

Some ML models support adjusting class weights to handle imbalance.

python

from sklearn.ensemble import RandomForestClassifier

# Train model with class weights


model = RandomForestClassifier(class_weight='balanced', random_state=42)
model.fit(X_train, y_train)

4. Data Augmentation (For Image & Text Data)


Data augmentation artificially increases dataset size to improve model generalization.

a) Image Augmentation using `ImageDataGenerator`

python

from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Create an image generator


datagen = ImageDataGenerator(rotation_range=20, width_shift_range=0.2,
height_shift_range=0.2, horizontal_flip=True)

# Apply transformation
augmented_images = datagen.flow_from_directory('dataset/images', batch_size=32)

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 8/26
b) Text Augmentation (Synonym Replacement)

python

from textblob import TextBlob

def synonym_replacement(text):
words = text.split()
augmented_text = []
for word in words:
synonyms = TextBlob(word).synsets
if synonyms:
augmented_text.append(synonyms[0].lemmas()[0].name()) # Replace with synonym
else:
augmented_text.append(word)
return " ".join(augmented_text)

text = "The quick brown fox jumps over the lazy dog."
augmented_text = synonym_replacement(text)
print(augmented_text)

5. Final Check & Export


After preprocessing, save the cleaned dataset for model training.

a) Check Data Formatting

python

print(df.info()) # Ensure data types are correct


print(df.describe()) # Check summary statistics

b) Save to CSV

python

df.to_csv('cleaned_data.csv', index=False)

c) Save as Pickle for Faster Loading

python

import pickle

with open('cleaned_data.pkl', 'wb') as file:


pickle.dump(df, file)

Summary
Step Method Code

`X_train, X_test = train_test_split(X,


Data Splitting `train_test_split()`
test_size=0.2)`

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 9/26
Step Method Code

Remove irrelevant, PCA,


Feature Selection `PCA(n_components=2).fit_transform(X)`
correlation

Handling SMOTE, Undersampling, Class


`SMOTE().fit_resample(X, y)`
Imbalanced Data Weights

Data Image transformations, text


`ImageDataGenerator(rotation_range=20)`
Augmentation synonym replacement

Final Check &


Check types, save as CSV/Pickle `df.to_csv('cleaned_data.csv')`
Export

Next Steps

Now that your dataset is clean and prepared: ✅ Train ML models with it
✅ Tune hyperparameters
✅ Evaluate model performance
Would you like a full end-to-end ML pipeline implementation? 🚀

Linear Regression kya hota hai kaise kaam karta hai kab use karte hai kab nahi karte hai kaise lagate
hai

Linear Regression Kya Hota Hai?


Linear Regression ek supervised learning algorithm hai jo continuous data ke prediction ke liye use
hota hai. Isme ek independent variable (X) aur ek dependent variable (Y) ke beech linear
relationship establish kiya jata hai.

Linear Regression Ka Equation

Y = mX + c

Yahan:

Y = Predicted value (dependent variable)


X = Feature (independent variable)
m = Slope (coefficient)
c = Intercept (constant)

Agar multiple features ho to equation:

Y = b0 + b1 X1 + b2 X2 + ... + bn Xn
​ ​ ​ ​ ​ ​ ​

Isko Multiple Linear Regression bolte hain.

Linear Regression Kaise Kaam Karta Hai?

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 10/26
1. Best Fit Line Find Karna
Linear regression dataset par ek straight line fit karta hai jo sabse best approximation hota
hai.
2. Error (Loss) Minimize Karna
Model Mean Squared Error (MSE) ya Least Squares Method ka use karke prediction error ko
minimize karta hai.
Formula:
n
1
MSE = ∑(Yi − Y^i )2
​ ​ ​

n
i=1


^i = predicted values.
Jahan Yi = actual values, Y ​

3. Gradient Descent
Slope (m) aur Intercept (c) optimize karne ke liye Gradient Descent algorithm use hota hai jo
step-by-step best parameters find karta hai.

Linear Regression Kab Use Karte Hain?


✅ Continuous Data Prediction: Jab target variable continuous ho.
✅ Simple Relationships: Jab independent aur dependent variable ke beech linear relationship ho.
✅ Stock Price Prediction, House Price Prediction, Salary Estimation, etc.
✅ Interpretability Required Ho: Linear regression explainable model hai, isme feature importance
easily samajh aata hai.

Linear Regression Kab Nahi Use Karte?


❌ Non-Linear Relationship Ho: Agar data ka relationship linear nahi hai, toh linear regression kaam
nahi karega.
❌ Multicollinearity Ho: Agar independent variables ke beech high correlation ho, toh model biased
ho sakta hai.
❌ Outliers Bahut Ho: Agar dataset me outliers ho toh model overfit ya underfit ho sakta hai.
❌ Categorical Data Ho: Agar dependent variable categorical ho, toh classification algorithms better
honge (e.g., Logistic Regression, Decision Trees).

Python Code: Linear Regression Kaise Lagate Hain?


1. Import Libraries

python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 11/26
2. Sample Dataset

python

# Dummy dataset
data = {
'Experience': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Salary': [30000, 35000, 40000, 45000, 50000, 55000, 60000, 65000, 70000, 75000]
}

df = pd.DataFrame(data)
X = df[['Experience']] # Independent variable
y = df['Salary'] # Dependent variable

3. Split Dataset

python

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

4. Train Linear Regression Model

python

model = LinearRegression()
model.fit(X_train, y_train)

# Model Coefficients
print("Slope (m):", model.coef_[0])
print("Intercept (c):", model.intercept_)

5. Predict and Evaluate

python

y_pred = model.predict(X_test)

# Evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)


print("R-squared Score:", r2)

6. Visualizing Regression Line

python

plt.scatter(X, y, color='blue', label="Actual Data")


plt.plot(X, model.predict(X), color='red', label="Regression Line")
plt.xlabel("Years of Experience")
plt.ylabel("Salary")
plt.legend()
plt.show()

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 12/26
Summary
Concept Explanation

Type Supervised Learning (Regression)

Formula Y = mX + c
Use Cases Salary Prediction, Stock Price, House Price, etc.

When NOT to Use Non-linear data, Multicollinearity, Outliers


Key Functions in Python `LinearRegression().fit(X, y)`, `predict()`, `mean_squared_error()`

Agar tumhe Multiple Linear Regression ya Polynomial Regression ka bhi example chahiye toh batao!
🚀

Categorical data ko ml me kaise deal karte hai kya kya technique hoti hai in detail

Categorical Data Ko Machine Learning Mein Kaise Deal Karte


Hain?
Categorical Data Kya Hota Hai?

Categorical data woh data hota hai jo fixed categories ya labels me hota hai.
Example:

Nominal Data (No Order) → Gender (`Male`, `Female`), City (`Delhi`, `Mumbai`, `Kolkata`)
Ordinal Data (Ordered Categories) → Education (`High School`, `Bachelor's`, `Master's`), Rating
(`Bad`, `Average`, `Good`, `Excellent`)

Machine learning models numerical values samajhte hain, isliye categorical data ko numerical format
me convert karna zaroori hota hai.

Categorical Data Encoding Techniques


1. Label Encoding (For Ordinal Data)

Description: Categories ko unique integers me convert karta hai.


Best For: Jab categories ordered ho (e.g., Small < Medium < Large).
Issue: Model wrong numerical relationship assume kar sakta hai agar categories nominal ho.

🔹 Example:
python

from sklearn.preprocessing import LabelEncoder

data = {'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small']}


df = pd.DataFrame(data)

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 13/26
# Apply Label Encoding
encoder = LabelEncoder()
df['Size_Encoded'] = encoder.fit_transform(df['Size'])

print(df)

🔹 Output:
mathematica

Size Size_Encoded
0 Small 2
1 Medium 1
2 Large 0
3 Medium 1
4 Small 2

⚠️ Problem: "Small" > "Medium" > "Large" wrong relationship create ho sakta hai.
2. One-Hot Encoding (For Nominal Data)

Description: Har category ke liye binary columns (0/1) create karta hai.
Best For: Jab categories unordered ho (e.g., Cities, Colors).
Issue: Agar categories zyada ho (high cardinality) toh columns bhot badh sakte hain.

🔹 Example:
python

import pandas as pd

data = {'City': ['Delhi', 'Mumbai', 'Kolkata', 'Mumbai', 'Delhi']}


df = pd.DataFrame(data)

# Apply One-Hot Encoding


df_encoded = pd.get_dummies(df, columns=['City'])

print(df_encoded)

🔹 Output:
nginx

City_Delhi City_Mumbai City_Kolkata


0 1 0 0
1 0 1 0
2 0 0 1
3 0 1 0
4 1 0 0

⚠️ Issue: Agar categories bahut zyada ho (e.g., 1000 cities), toh dataset me 1000 columns ban jayenge.
3. Ordinal Encoding (For Ordered Categories)

Description: Ordered categories ko numerical ranking assign karta hai.


Best For: Jab categories ordered ho (e.g., `Low < Medium < High`).
Issue: Agar categories unordered ho, toh wrong relationships ban sakti hain.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 14/26
🔹 Example:
python

from sklearn.preprocessing import OrdinalEncoder

data = {'Education': ['High School', 'Bachelor', 'Master', 'PhD', 'High School']}


df = pd.DataFrame(data)

# Define order
education_order = [['High School', 'Bachelor', 'Master', 'PhD']]

# Apply Ordinal Encoding


encoder = OrdinalEncoder(categories=education_order)
df['Education_Encoded'] = encoder.fit_transform(df[['Education']])

print(df)

🔹 Output:
markdown

Education Education_Encoded
0 High School 0.0
1 Bachelor 1.0
2 Master 2.0
3 PhD 3.0
4 High School 0.0

4. Frequency Encoding

Description: Har category ke liye uska frequency count assign karta hai.
Best For: Jab categories zyada ho (high cardinality) but meaningful nahi ho.
Issue: Agar dataset unbalanced ho toh model biased ho sakta hai.

🔹 Example:
python

data = {'Country': ['India', 'USA', 'India', 'UK', 'USA', 'India']}


df = pd.DataFrame(data)

# Apply Frequency Encoding


df['Country_Encoded'] = df['Country'].map(df['Country'].value_counts())

print(df)

🔹 Output:
nginx

Country Country_Encoded
0 India 3
1 USA 2
2 India 3
3 UK 1
4 USA 2
5 India 3

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 15/26
5. Target Encoding (Mean Encoding)

Description: Har category ko uske target variable ke mean value se replace karta hai.
Best For: Jab category target variable se strongly correlated ho.
Issue: Agar data small ho, toh overfitting ho sakta hai.

🔹 Example:
python

data = {'Category': ['A', 'B', 'A', 'C', 'B', 'C'],


'Target': [10, 20, 30, 40, 50, 60]}
df = pd.DataFrame(data)

# Apply Target Encoding


target_means = df.groupby('Category')['Target'].mean()
df['Category_Encoded'] = df['Category'].map(target_means)

print(df)

🔹 Output:
css

Category Target Category_Encoded


0 A 10 20.0
1 B 20 35.0
2 A 30 20.0
3 C 40 50.0
4 B 50 35.0
5 C 60 50.0

⚠️ Overfitting Issue: Model train data ko yaad kar sakta hai, jisme data leakage ho sakta hai.
6. Hash Encoding (Feature Hashing)

Description: Categories ko hash function ke through fixed-length numeric representation me


convert karta hai.
Best For: Jab categories bahut zyada ho (High Cardinality).
Issue: Thoda information loss ho sakta hai.

🔹 Example:
python

from sklearn.feature_extraction import FeatureHasher

data = {'Item': ['Apple', 'Banana', 'Mango', 'Apple', 'Banana']}


df = pd.DataFrame(data)

# Apply Hash Encoding


hasher = FeatureHasher(n_features=3, input_type='string')
hashed_features = hasher.transform(df['Item'])

# Convert to DataFrame
hashed_df = pd.DataFrame(hashed_features.toarray())
print(hashed_df)

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 16/26
Best Encoding Technique Kab Use Kare?
Technique Best For When NOT to Use

Label Encoding Ordered categories (ordinal) Unordered categories (nominal)


One-Hot Encoding Few unique categories High cardinality datasets
Ordinal Encoding Ordered data If order is not meaningful

Frequency Encoding Large categories with frequency-based importance If frequency is not a good predictor
Target Encoding When category is correlated with target variable Small dataset (overfitting risk)
Hash Encoding Very high-cardinality features If interpretability is needed

Conclusion
Machine Learning models categorical data ko samajhne ke liye numerical values me convert karna
zaroori hai.
📌 Agar categories kam hain → One-Hot Encoding
📌 Agar categories ordered hain → Label/Ordinal Encoding
📌 Agar categories bahut zyada hain → Frequency, Target, ya Hash Encoding

Agar tum kisi specific dataset ke liye best encoding technique chahte ho toh batao! 🚀

Support Vector Regression kya hota hai kaise kaam karta hai kab use karte hai kab nahi karte hai
kaise lagate hai

Support Vector Regression (SVR) Kya Hota Hai?


Support Vector Regression (SVR) ek supervised learning algorithm hai jo continuous data prediction
ke liye use hota hai.
Ye Support Vector Machine (SVM) ka regression version hai.
SVR ka main objective hai ki ek best fit hyperplane banaya jaye jo ε (epsilon) margin ke andar
maximum data points ko include kare.

Support Vector Regression Kaise Kaam Karta Hai?


1. Hyperplane Banata Hai

SVR ek hyperplane (line in 2D, plane in 3D) fit karta hai jo epsilon (ε) margin ke andar hota hai.

2. Epsilon (ε) Tube Concept

Traditional Regression → Least Squares Error minimize karta hai.


SVR → Ek ε-tube define karta hai jisme maximum points aane chahiye.
Only those points that fall outside ε-margin (support vectors) contribute to loss.

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 17/26
3. Kernel Trick

Agar data non-linear ho, toh SVR Kernel Trick use karta hai jisse data ko higher dimension me
map karke linear separation possible ho jaye.
Common Kernels:
Linear Kernel → K(x, x′ ) = x ⋅ x′
Polynomial Kernel → K(x, x′ ) = (x ⋅ x′ + c)d
RBF (Radial Basis Function) Kernel → K(x, x′ ) = exp(−γ∣∣x − x′ ∣∣2 )

SVR Kab Use Karte Hain?


✅ Jab outliers ki importance kam ho
✅ Jab non-linear relationships ho
✅ High-dimensional data ho
✅ Stock Price Prediction, Real Estate Price Prediction, Weather Forecasting
SVR Kab Nahi Use Karte?
❌ Jab dataset bohot bada ho (SVR ka training slow hota hai)
❌ Jab exact prediction chahiye instead of range
❌ Jab linear regression se accha result mil raha ho
Python Code: SVR Kaise Lagate Hain?
1. Import Libraries

python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

2. Dummy Dataset

python

# Create sample dataset


data = {'Experience': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Salary': [30000, 35000, 40000, 45000, 50000, 60000, 65000, 70000, 75000, 80000]}

df = pd.DataFrame(data)

X = df[['Experience']] # Independent Variable


y = df['Salary'] # Dependent Variable

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 18/26
3. Feature Scaling (Important for SVR)

python

scaler_X = StandardScaler()
scaler_y = StandardScaler()

X_scaled = scaler_X.fit_transform(X)
y_scaled = scaler_y.fit_transform(y.values.reshape(-1, 1)).flatten()

4. Split Dataset

python

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_scaled, test_size=0.2, random_state=

5. Train SVR Model

python

model = SVR(kernel='rbf', C=100, epsilon=0.1, gamma=0.1)


model.fit(X_train, y_train)

🔹 Parameters:
`kernel='rbf'`→ RBF Kernel use ho raha hai
`C=100` → High `C` means model zyda precise hoga
`epsilon=0.1` → Epsilon margin set karta hai
`gamma=0.1` → Data ko higher dimension me map karta hai

6. Prediction & Evaluation

python

y_pred_scaled = model.predict(X_test)

# Reverse Scaling
y_pred = scaler_y.inverse_transform(y_pred_scaled.reshape(-1, 1)).flatten()
y_test_orig = scaler_y.inverse_transform(y_test.reshape(-1, 1)).flatten()

# Evaluation
mse = mean_squared_error(y_test_orig, y_pred)
r2 = r2_score(y_test_orig, y_pred)

print("Mean Squared Error:", mse)


print("R-squared Score:", r2)

7. Visualizing the SVR Regression Curve

python

plt.scatter(X, y, color='blue', label="Actual Data")


plt.plot(X, scaler_y.inverse_transform(model.predict(X_scaled).reshape(-1, 1)), color='red', label=

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 19/26
plt.xlabel("Years of Experience")
plt.ylabel("Salary")
plt.legend()
plt.show()

Summary
Feature Explanation
Type Supervised Learning (Regression)
Formula Uses ε-Tube Margin instead of minimizing error

Kernel Trick Converts non-linear data into a higher dimension


Best For Non-linear regression, Outliers handling, High-dimensional data
When NOT to Use Large datasets, Simple linear relationships
Key Functions in Python `SVR(kernel='rbf', C=1.0, epsilon=0.1)`

Conclusion:
SVR linear aur non-linear regression problems ke liye useful hai, but dataset size aur complexity ko
dhyan me rakhna zaroori hai.
Agar tumhe Polynomial SVR ya Grid Search for Hyperparameter Tuning ka example chahiye toh
batao! 🚀

Decision Tree Regression kya hota hai kaise kaam karta hai kab use karte hai kab nahi karte hai
kaise lagate hai

Decision Tree Regression Kya Hota Hai?


Decision Tree Regression ek supervised learning algorithm hai jo continuous data prediction ke liye
use hota hai.
Ye dataset ko multiple decision-based splits me todta hai aur har ek split ke liye average value predict
karta hai.

🌳 Decision Tree Regression ka concept ek actual decision tree jaisa hota hai, jisme data recursively
split hota hai if-else conditions ke basis par.

Decision Tree Regression Kaise Kaam Karta Hai?


1. Recursive Splitting

Data ko recursively subsets me split kiya jata hai, features aur threshold values ke basis par.
Ye split variance reduction (MSE minimization) ya information gain ke basis par hota hai.

2. Leaf Nodes & Predictions

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 20/26
Jab ek leaf node tak pahuncha jata hai, to waha ek average value assign ki jati hai jo final
prediction hoti hai.

3. Important Terms

Mean Squared Error (MSE): Decision tree best split dhundne ke liye MSE minimize karta hai.
Max Depth: Ye tree ki height define karta hai.
Min Samples Split: Ek node ko split hone ke liye minimum required samples.
Overfitting: Agar tree bohot deep ho jaye, to overfitting ho sakti hai.

Decision Tree Regression Kab Use Karte Hain?


✅ Jab data nonlinear ho
✅ Feature interactions important ho
✅ Interpretability chahiye (model samajhna easy ho)
✅ Jab dataset small/mid-size ho
Decision Tree Regression Kab Nahi Use Karte?
❌ Jab dataset bohot bada ho (Decision Trees slow ho sakte hain)
❌ Jab data bohot noisy ho (Overfitting ka risk hota hai)
❌ Jab smooth predictions chahiye (Decision Trees ekdum sharp splits karte hain, jo overfitting cause
kar sakta hai)

Python Code: Decision Tree Regression Kaise Lagate Hain?


1. Import Libraries

python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

2. Dummy Dataset

python

# Create sample dataset


data = {'Experience': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Salary': [30000, 35000, 40000, 45000, 50000, 60000, 65000, 70000, 75000, 80000]}

df = pd.DataFrame(data)

X = df[['Experience']] # Independent Variable


y = df['Salary'] # Dependent Variable

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 21/26
3. Split Dataset

python

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

4. Train Decision Tree Regression Model

python

model = DecisionTreeRegressor(max_depth=3, min_samples_split=2)


model.fit(X_train, y_train)

🔹 Parameters:
`max_depth=3` → Tree ki maximum depth 3 set ki gayi hai (overfitting avoid karne ke liye)
`min_samples_split=2` → Ek node ko split hone ke liye kam se kam 2 samples hone chahiye

5. Prediction & Evaluation

python

y_pred = model.predict(X_test)

# Evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)


print("R-squared Score:", r2)

6. Visualizing the Decision Tree Regression Curve

python

plt.scatter(X, y, color='blue', label="Actual Data")


plt.plot(X, model.predict(X), color='red', label="Decision Tree Prediction")
plt.xlabel("Years of Experience")
plt.ylabel("Salary")
plt.legend()
plt.show()

Summary
Feature Explanation
Type Supervised Learning (Regression)
Splitting Criteria Minimizes MSE (Mean Squared Error)
Best For Non-linear relationships, Feature interactions
When NOT to Use Large datasets, Highly noisy data

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 22/26
Feature Explanation
Key Function in Python `DecisionTreeRegressor(max_depth=3, min_samples_split=2)`

Conclusion

Decision Tree Regression simple & interpretable hota hai, but overfitting avoid karne ke liye
hyperparameters tune karna zaroori hota hai! 🚀
Agar tumhe Random Forest Regression ya Hyperparameter Tuning ka example chahiye toh batao! 😊

Random Forest Regression kya hota hai kaise kaam karta hai kab use karte hai kab nahi karte hai
kaise lagate hai

Random Forest Regression Kya Hota Hai?


Random Forest Regression ek ensemble learning algorithm hai jo multiple decision trees ka use
karta hai aur unke average predictions nikalta hai.
Ye Decision Tree Regression ka ek upgraded version hai jo overfitting ko reduce karta hai aur
accuracy improve karta hai.
🔹 Random Forest = Multiple Decision Trees + Bagging (Bootstrap Aggregation)
Random Forest Regression Kaise Kaam Karta Hai?
1. Bagging (Bootstrap Aggregation)

Multiple Decision Trees ko random subsets of data par train kiya jata hai.
Har tree thoda different hota hai, jisse variance reduce hoti hai aur overfitting kam hoti hai.

2. Random Feature Selection

Har decision tree me randomly selected features use kiye jate hain, jisse bias aur variance ka
balance maintain hota hai.

3. Final Prediction

Regression ke case me:


Har tree apni prediction deta hai, aur final output average of all trees hota hai.
Classification ke case me:
Majority voting se final prediction decide hota hai.

Random Forest Regression Kab Use Karte Hain?


✅ Jab dataset complex ho aur nonlinear relationships ho
✅ Jab overfitting avoid karna ho
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 23/26
✅ Feature importance nikalni ho
✅ Stock market prediction, house price prediction, medical diagnosis
Random Forest Regression Kab Nahi Use Karte?
❌ Jab dataset bohot bada ho (Training time zyada ho sakta hai)
❌ Jab real-time predictions chahiye (Random Forest slow ho sakta hai)
❌ Jab model explainability chahiye (Decision Trees easy hote hain, but Random Forest complex ho
sakta hai)

Python Code: Random Forest Regression Kaise Lagate Hain?


1. Import Libraries

python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

2. Dummy Dataset

python

# Create sample dataset


data = {'Experience': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Salary': [30000, 35000, 40000, 45000, 50000, 60000, 65000, 70000, 75000, 80000]}

df = pd.DataFrame(data)

X = df[['Experience']] # Independent Variable


y = df['Salary'] # Dependent Variable

3. Split Dataset

python

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

4. Train Random Forest Regression Model

python

model = RandomForestRegressor(n_estimators=100, max_depth=5, random_state=42)


model.fit(X_train, y_train)

🔹 Parameters:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 24/26
`n_estimators=100`→ 100 Decision Trees ka Forest banayega
`max_depth=5` → Har tree ki maximum depth 5 hogi
`random_state=42` → Same result ke liye seed set kar diya

5. Prediction & Evaluation

python

y_pred = model.predict(X_test)

# Evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)


print("R-squared Score:", r2)

6. Visualizing the Random Forest Regression Curve

python

plt.scatter(X, y, color='blue', label="Actual Data")


plt.plot(X, model.predict(X), color='red', label="Random Forest Prediction")
plt.xlabel("Years of Experience")
plt.ylabel("Salary")
plt.legend()
plt.show()

Feature Importance Extraction


Agar tumhe pata karna ho ki kaunse features prediction ke liye sabse important hain:

python

importance = model.feature_importances_
for i, v in enumerate(importance):
print(f"Feature {i}: {v}")

Summary
Feature Explanation
Type Supervised Learning (Regression)

Splitting Criteria Uses Multiple Decision Trees


Best For Non-linear relationships, Overfitting avoidance, Feature importance
When NOT to Use Large datasets, Real-time predictions, Explainability needed
Key Function in `RandomForestRegressor(n_estimators=100, max_depth=5,
Python random_state=42)`

Conclusion
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 25/26
Random Forest Regression Decision Tree ka improved version hai jo stability aur accuracy improve
karta hai.
Agar tumhe Hyperparameter Tuning ya Grid Search ka example chahiye toh batao! 🚀

Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 26/26

You might also like