Regression
Regression
Data preprocessing is a crucial step in Machine Learning (ML) that ensures your data is clean, well-
structured, and suitable for training models. Here are the key steps involved:
1. Data Collection
Gather data from multiple sources (databases, CSV files, APIs, web scraping, etc.).
Ensure the data is relevant to the problem statement.
2. Data Cleaning
3. Data Transformation
4. Data Splitting
5. Feature Selection/Reduction
If dealing with an imbalanced dataset (e.g., fraud detection, rare diseases), use:
Oversampling (SMOTE, ADASYN) to increase minority class instances.
Undersampling to reduce majority class instances.
Class-weight adjustment in models.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 1/26
7. Data Augmentation (if applicable)
In image/text datasets, augment data by rotating, flipping, or generating synthetic data to improve
model generalization.
Data cleaning is a crucial step in the ML pipeline that ensures data quality and reliability. Below is a
detailed explanation with corresponding syntax and Python code for handling missing values,
removing duplicates, and correcting inconsistencies.
If a row or column has too many missing values, it may not be useful for the model. We can:
Code Example
python
import pandas as pd
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 2/26
df = pd.DataFrame(data)
If we don't want to lose data by dropping rows/columns, we can fill missing values using:
Code Example
python
# Fill missing values in 'Name' column with mode (most frequent value)
df['Name'].fillna(df['Name'].mode()[0], inplace=True)
python
2. Removing Duplicates
Duplicate records can bias ML models. We remove them using `.drop_duplicates()`.
Code Example
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 3/26
python
df_dup = pd.DataFrame(data_dup)
python
df_inconsistent = pd.DataFrame(data_inconsistent)
# Convert to lowercase
df_inconsistent['Country'] = df_inconsistent['Country'].str.lower()
python
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 4/26
python
df_inconsistent['City'] = df_inconsistent['City'].str.strip()
python
After applying all cleaning techniques, we get a properly structured dataset that can be used for ML.
Summary
1. Data Splitting
Splitting the dataset is crucial to evaluate model performance fairly. The dataset is divided into:
python
# Sample dataset
data = {
'Feature1': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
'Feature2': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Target': [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]
}
df = pd.DataFrame(data)
# Splitting into training (70%), validation (15%), and test (15%) sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
Key Syntax:
2. Feature Selection/Reduction
Feature selection helps remove irrelevant or redundant features, improving model performance.
Irrelevant features don’t contribute much to the target variable and should be dropped.
python
Highly correlated features can be redundant. We can use the correlation matrix to identify
them.
python
# Plot heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.show()
Remove one of the features if two are highly correlated (above 0.9 correlation).
python
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 7/26
a) Oversampling (SMOTE) - Increase Minority Class
SMOTE (Synthetic Minority Over-sampling Technique) generates synthetic samples for the minority class.
python
# Applying SMOTE
smote = SMOTE(sampling_strategy='auto', random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)
python
python
python
# Apply transformation
augmented_images = datagen.flow_from_directory('dataset/images', batch_size=32)
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 8/26
b) Text Augmentation (Synonym Replacement)
python
def synonym_replacement(text):
words = text.split()
augmented_text = []
for word in words:
synonyms = TextBlob(word).synsets
if synonyms:
augmented_text.append(synonyms[0].lemmas()[0].name()) # Replace with synonym
else:
augmented_text.append(word)
return " ".join(augmented_text)
text = "The quick brown fox jumps over the lazy dog."
augmented_text = synonym_replacement(text)
print(augmented_text)
python
b) Save to CSV
python
df.to_csv('cleaned_data.csv', index=False)
python
import pickle
Summary
Step Method Code
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 9/26
Step Method Code
Next Steps
Now that your dataset is clean and prepared: ✅ Train ML models with it
✅ Tune hyperparameters
✅ Evaluate model performance
Would you like a full end-to-end ML pipeline implementation? 🚀
Linear Regression kya hota hai kaise kaam karta hai kab use karte hai kab nahi karte hai kaise lagate
hai
Y = mX + c
Yahan:
Y = b0 + b1 X1 + b2 X2 + ... + bn Xn
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 10/26
1. Best Fit Line Find Karna
Linear regression dataset par ek straight line fit karta hai jo sabse best approximation hota
hai.
2. Error (Loss) Minimize Karna
Model Mean Squared Error (MSE) ya Least Squares Method ka use karke prediction error ko
minimize karta hai.
Formula:
n
1
MSE = ∑(Yi − Y^i )2
n
i=1
^i = predicted values.
Jahan Yi = actual values, Y
3. Gradient Descent
Slope (m) aur Intercept (c) optimize karne ke liye Gradient Descent algorithm use hota hai jo
step-by-step best parameters find karta hai.
python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 11/26
2. Sample Dataset
python
# Dummy dataset
data = {
'Experience': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'Salary': [30000, 35000, 40000, 45000, 50000, 55000, 60000, 65000, 70000, 75000]
}
df = pd.DataFrame(data)
X = df[['Experience']] # Independent variable
y = df['Salary'] # Dependent variable
3. Split Dataset
python
python
model = LinearRegression()
model.fit(X_train, y_train)
# Model Coefficients
print("Slope (m):", model.coef_[0])
print("Intercept (c):", model.intercept_)
python
y_pred = model.predict(X_test)
# Evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
python
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 12/26
Summary
Concept Explanation
Formula Y = mX + c
Use Cases Salary Prediction, Stock Price, House Price, etc.
Agar tumhe Multiple Linear Regression ya Polynomial Regression ka bhi example chahiye toh batao!
🚀
Categorical data ko ml me kaise deal karte hai kya kya technique hoti hai in detail
Categorical data woh data hota hai jo fixed categories ya labels me hota hai.
Example:
Nominal Data (No Order) → Gender (`Male`, `Female`), City (`Delhi`, `Mumbai`, `Kolkata`)
Ordinal Data (Ordered Categories) → Education (`High School`, `Bachelor's`, `Master's`), Rating
(`Bad`, `Average`, `Good`, `Excellent`)
Machine learning models numerical values samajhte hain, isliye categorical data ko numerical format
me convert karna zaroori hota hai.
🔹 Example:
python
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 13/26
# Apply Label Encoding
encoder = LabelEncoder()
df['Size_Encoded'] = encoder.fit_transform(df['Size'])
print(df)
🔹 Output:
mathematica
Size Size_Encoded
0 Small 2
1 Medium 1
2 Large 0
3 Medium 1
4 Small 2
⚠️ Problem: "Small" > "Medium" > "Large" wrong relationship create ho sakta hai.
2. One-Hot Encoding (For Nominal Data)
Description: Har category ke liye binary columns (0/1) create karta hai.
Best For: Jab categories unordered ho (e.g., Cities, Colors).
Issue: Agar categories zyada ho (high cardinality) toh columns bhot badh sakte hain.
🔹 Example:
python
import pandas as pd
print(df_encoded)
🔹 Output:
nginx
⚠️ Issue: Agar categories bahut zyada ho (e.g., 1000 cities), toh dataset me 1000 columns ban jayenge.
3. Ordinal Encoding (For Ordered Categories)
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 14/26
🔹 Example:
python
# Define order
education_order = [['High School', 'Bachelor', 'Master', 'PhD']]
print(df)
🔹 Output:
markdown
Education Education_Encoded
0 High School 0.0
1 Bachelor 1.0
2 Master 2.0
3 PhD 3.0
4 High School 0.0
4. Frequency Encoding
Description: Har category ke liye uska frequency count assign karta hai.
Best For: Jab categories zyada ho (high cardinality) but meaningful nahi ho.
Issue: Agar dataset unbalanced ho toh model biased ho sakta hai.
🔹 Example:
python
print(df)
🔹 Output:
nginx
Country Country_Encoded
0 India 3
1 USA 2
2 India 3
3 UK 1
4 USA 2
5 India 3
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 15/26
5. Target Encoding (Mean Encoding)
Description: Har category ko uske target variable ke mean value se replace karta hai.
Best For: Jab category target variable se strongly correlated ho.
Issue: Agar data small ho, toh overfitting ho sakta hai.
🔹 Example:
python
print(df)
🔹 Output:
css
⚠️ Overfitting Issue: Model train data ko yaad kar sakta hai, jisme data leakage ho sakta hai.
6. Hash Encoding (Feature Hashing)
🔹 Example:
python
# Convert to DataFrame
hashed_df = pd.DataFrame(hashed_features.toarray())
print(hashed_df)
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 16/26
Best Encoding Technique Kab Use Kare?
Technique Best For When NOT to Use
Frequency Encoding Large categories with frequency-based importance If frequency is not a good predictor
Target Encoding When category is correlated with target variable Small dataset (overfitting risk)
Hash Encoding Very high-cardinality features If interpretability is needed
Conclusion
Machine Learning models categorical data ko samajhne ke liye numerical values me convert karna
zaroori hai.
📌 Agar categories kam hain → One-Hot Encoding
📌 Agar categories ordered hain → Label/Ordinal Encoding
📌 Agar categories bahut zyada hain → Frequency, Target, ya Hash Encoding
Agar tum kisi specific dataset ke liye best encoding technique chahte ho toh batao! 🚀
Support Vector Regression kya hota hai kaise kaam karta hai kab use karte hai kab nahi karte hai
kaise lagate hai
SVR ek hyperplane (line in 2D, plane in 3D) fit karta hai jo epsilon (ε) margin ke andar hota hai.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 17/26
3. Kernel Trick
Agar data non-linear ho, toh SVR Kernel Trick use karta hai jisse data ko higher dimension me
map karke linear separation possible ho jaye.
Common Kernels:
Linear Kernel → K(x, x′ ) = x ⋅ x′
Polynomial Kernel → K(x, x′ ) = (x ⋅ x′ + c)d
RBF (Radial Basis Function) Kernel → K(x, x′ ) = exp(−γ∣∣x − x′ ∣∣2 )
python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
2. Dummy Dataset
python
df = pd.DataFrame(data)
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 18/26
3. Feature Scaling (Important for SVR)
python
scaler_X = StandardScaler()
scaler_y = StandardScaler()
X_scaled = scaler_X.fit_transform(X)
y_scaled = scaler_y.fit_transform(y.values.reshape(-1, 1)).flatten()
4. Split Dataset
python
python
🔹 Parameters:
`kernel='rbf'`→ RBF Kernel use ho raha hai
`C=100` → High `C` means model zyda precise hoga
`epsilon=0.1` → Epsilon margin set karta hai
`gamma=0.1` → Data ko higher dimension me map karta hai
python
y_pred_scaled = model.predict(X_test)
# Reverse Scaling
y_pred = scaler_y.inverse_transform(y_pred_scaled.reshape(-1, 1)).flatten()
y_test_orig = scaler_y.inverse_transform(y_test.reshape(-1, 1)).flatten()
# Evaluation
mse = mean_squared_error(y_test_orig, y_pred)
r2 = r2_score(y_test_orig, y_pred)
python
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 19/26
plt.xlabel("Years of Experience")
plt.ylabel("Salary")
plt.legend()
plt.show()
Summary
Feature Explanation
Type Supervised Learning (Regression)
Formula Uses ε-Tube Margin instead of minimizing error
Conclusion:
SVR linear aur non-linear regression problems ke liye useful hai, but dataset size aur complexity ko
dhyan me rakhna zaroori hai.
Agar tumhe Polynomial SVR ya Grid Search for Hyperparameter Tuning ka example chahiye toh
batao! 🚀
Decision Tree Regression kya hota hai kaise kaam karta hai kab use karte hai kab nahi karte hai
kaise lagate hai
🌳 Decision Tree Regression ka concept ek actual decision tree jaisa hota hai, jisme data recursively
split hota hai if-else conditions ke basis par.
Data ko recursively subsets me split kiya jata hai, features aur threshold values ke basis par.
Ye split variance reduction (MSE minimization) ya information gain ke basis par hota hai.
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 20/26
Jab ek leaf node tak pahuncha jata hai, to waha ek average value assign ki jati hai jo final
prediction hoti hai.
3. Important Terms
Mean Squared Error (MSE): Decision tree best split dhundne ke liye MSE minimize karta hai.
Max Depth: Ye tree ki height define karta hai.
Min Samples Split: Ek node ko split hone ke liye minimum required samples.
Overfitting: Agar tree bohot deep ho jaye, to overfitting ho sakti hai.
python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
2. Dummy Dataset
python
df = pd.DataFrame(data)
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 21/26
3. Split Dataset
python
python
🔹 Parameters:
`max_depth=3` → Tree ki maximum depth 3 set ki gayi hai (overfitting avoid karne ke liye)
`min_samples_split=2` → Ek node ko split hone ke liye kam se kam 2 samples hone chahiye
python
y_pred = model.predict(X_test)
# Evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
python
Summary
Feature Explanation
Type Supervised Learning (Regression)
Splitting Criteria Minimizes MSE (Mean Squared Error)
Best For Non-linear relationships, Feature interactions
When NOT to Use Large datasets, Highly noisy data
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 22/26
Feature Explanation
Key Function in Python `DecisionTreeRegressor(max_depth=3, min_samples_split=2)`
Conclusion
Decision Tree Regression simple & interpretable hota hai, but overfitting avoid karne ke liye
hyperparameters tune karna zaroori hota hai! 🚀
Agar tumhe Random Forest Regression ya Hyperparameter Tuning ka example chahiye toh batao! 😊
Random Forest Regression kya hota hai kaise kaam karta hai kab use karte hai kab nahi karte hai
kaise lagate hai
Multiple Decision Trees ko random subsets of data par train kiya jata hai.
Har tree thoda different hota hai, jisse variance reduce hoti hai aur overfitting kam hoti hai.
Har decision tree me randomly selected features use kiye jate hain, jisse bias aur variance ka
balance maintain hota hai.
3. Final Prediction
python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
2. Dummy Dataset
python
df = pd.DataFrame(data)
3. Split Dataset
python
python
🔹 Parameters:
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 24/26
`n_estimators=100`→ 100 Decision Trees ka Forest banayega
`max_depth=5` → Har tree ki maximum depth 5 hogi
`random_state=42` → Same result ke liye seed set kar diya
python
y_pred = model.predict(X_test)
# Evaluation
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
python
python
importance = model.feature_importances_
for i, v in enumerate(importance):
print(f"Feature {i}: {v}")
Summary
Feature Explanation
Type Supervised Learning (Regression)
Conclusion
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 25/26
Random Forest Regression Decision Tree ka improved version hai jo stability aur accuracy improve
karta hai.
Agar tumhe Hyperparameter Tuning ya Grid Search ka example chahiye toh batao! 🚀
Printed using ChatGPT to PDF, powered by PDFCrowd HTML to PDF API. 26/26