KEMBAR78
Mtech Study Material | PDF | Categorical Variable | Regression Analysis
0% found this document useful (0 votes)
32 views10 pages

Mtech Study Material

The document provides an overview of key concepts in Python for data preprocessing, including the use of SimpleImputer for handling missing values, StandardScaler for feature scaling, and the significance of train_test_split for model evaluation. It also explains OneHotEncoding, the differences between fit(), transform(), and fit_transform() methods, and the importance of normalization and standardization in machine learning. Additionally, it emphasizes the necessity of splitting data before normalization to prevent data leakage.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views10 pages

Mtech Study Material

The document provides an overview of key concepts in Python for data preprocessing, including the use of SimpleImputer for handling missing values, StandardScaler for feature scaling, and the significance of train_test_split for model evaluation. It also explains OneHotEncoding, the differences between fit(), transform(), and fit_transform() methods, and the importance of normalization and standardization in machine learning. Additionally, it emphasizes the necessity of splitting data before normalization to prevent data leakage.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Subject Code: PGCSE 102

Subject Name: Python for Data Science

Q1. What is the purpose of the SimpleImputer class in Python?

Answer:
The SimpleImputer class in Python (from sklearn.impute) is used to fill missing values in a
dataset with a specific strategy such as mean, median, most frequent, or a constant value.

Code Example:

from sklearn.impute import SimpleImputer


import numpy as np

data = np.array([[1, 2], [np.nan, 3], [7, 6]])


imputer = SimpleImputer(strategy='mean')
result = imputer.fit_transform(data)
print(result)

Q2. Define how the “preprocessing” module is useful in Python for data preprocessing.

Answer:
The sklearn.preprocessing module provides functions and classes for feature scaling,
normalization, encoding categorical features, and transformation, making raw data suitable for
modeling.

Functions include: StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder, etc.


Q3. Describe the significance of StandardScaler class in data preprocessing.

Answer:
StandardScaler standardizes features by removing the mean and scaling to unit variance. It is
crucial for algorithms sensitive to feature scales (e.g., SVM, KNN).

Code Example:

from sklearn.preprocessing import StandardScaler

data = [[1, 20], [2, 40], [3, 60]]


scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
print(scaled_data)

Q4. How does Label Encoding affect model performance?

Answer:
Label Encoding converts categorical labels into numeric values. For models that consider label
ordering (like linear regression), it may introduce unintended bias. Best for tree-based models.

Q5. What are the steps involved in data preprocessing for machine learning?

Answer:

1. Importing libraries

2. Loading the dataset

3. Handling missing values

4. Encoding categorical data

5. Feature scaling
6. Splitting into train-test sets

7. Model fitting
Q6. Explain the use of train_test_split in data preprocessing.

Answer:
train_test_split (from sklearn.model_selection) is used to divide the dataset into training and
testing sets to evaluate model generalization.

Code Example:

from sklearn.model_selection import train_test_split

X = [[1], [2], [3], [4]]


y = [1, 2, 3, 4]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

******************************************************************************

Q1. Define the term "OneHotEncoding” and its application with suitable example.

Answer:
One-Hot Encoding converts categorical variables into a binary matrix (dummy variables),
avoiding ordinal relationships.

Code Example:

from sklearn.preprocessing import OneHotEncoder


import numpy as np

data = np.array([['red'], ['green'], ['blue']])


encoder = OneHotEncoder(sparse=False)
encoded = encoder.fit_transform(data)
print(encoded)
Application: Used in ML models that require numeric input like logistic regression or neural
networks.
Q2. Describe the difference between fit_transform(), fit() and transform()

methods. Answer:

Method Description

fit() Learns parameters from data (e.g., mean/std)

transform() Applies the learned parameters to transform data

fit_transform() Combines fit() and transform() in one step

Example:

scaler = StandardScaler()
scaler.fit(X_train) # learns mean/std
X_train_scaled = scaler.transform(X_train) # uses learned parameters
# OR
X_train_scaled = scaler.fit_transform(X_train)

Q3. Demonstrate how to load a dataset in Python using Pandas and perform basic
summary statistics.

Answer:

import pandas as pd

# Load dataset
df = pd.read_csv("data.csv")

# Display first 5 rows


print(df.head())

# Summary statistics
print(df.describe())
# Check for missing values
print(df.isnull().sum())
******************************************************************************

Q1. Analyze a dataset to deal with missing values and the potential impact of these missing
values on a machine learning model.

Answer:
Missing data can reduce model accuracy, introduce bias, or cause errors during training.

Handling Missing Values:

● Remove rows (dropna())

● Impute with mean/median/mode (SimpleImputer)

● Predict missing values (advanced methods)

Code Example:

import pandas as pd
from sklearn.impute import SimpleImputer

df = pd.read_csv("data.csv")
print("Missing before:\n", df.isnull().sum())

# Imputation
imputer = SimpleImputer(strategy='mean')
df[['Age', 'Salary']] = imputer.fit_transform(df[['Age', 'Salary']])
print("Missing after:\n", df.isnull().sum())

Impact on Model:

● Improved completeness

● Better generalization
● Avoids runtime errors
Q2. Analyze how the “compose” module is significant in Python for data preprocessing.

Answer:
The sklearn.compose module allows combining multiple preprocessing steps for different
column types using ColumnTransformer.

Significance:

● Streamlines preprocessing for numerical and categorical columns

● Reduces manual processing

● Supports pipeline integration

Code Example:

from sklearn.compose import ColumnTransformer


from sklearn.preprocessing import StandardScaler, OneHotEncoder
import pandas as pd

df = pd.DataFrame({
'age': [25, 30, 35],
'city': ['Delhi', 'Mumbai', 'Chennai']
})

preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), ['age']),
('cat', OneHotEncoder(), ['city'])
])

processed = preprocessor.fit_transform(df)
print(processed)

*****************************************************************************

Here are detailed notes on data normalization, standardization, and train-test split with clear
explanations of why normalization is done after splitting the data.
Data Normalization, Standardization, and Train-Test Split

1. Data Normalization

Definition:
Normalization is the process of rescaling features to a specific range, typically [0, 1] or [-1, 1],
without distorting differences in the ranges of values.

Formula:
For Min-Max Normalization:

Use Case:

● Suitable when the data has varying scales.

● Useful for distance-based models like KNN, K-means, Neural Networks.

Code Example:

from sklearn.preprocessing import MinMaxScaler


import numpy as np

data = np.array([[1, 20], [2, 40], [3, 60]])


scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data)
print(normalized_data)

2. Data Standardization

Definition:
Standardization transforms data to have zero mean and unit variance.
This is achieved using Z-score scaling.
Formula:

where

● μ\mu = mean of feature values

● σ\sigma = standard deviation of feature values

Use Case:

● Works well with algorithms like SVM, Logistic Regression, PCA.

● Keeps negative values (unlike normalization).

Code Example:

from sklearn.preprocessing import StandardScaler

data = [[1, 20], [2, 40], [3, 60]]


scaler = StandardScaler()
standardized_data = scaler.fit_transform(data)
print(standardized_data)

3. Train-Test Split

Definition:
The train_test_split function from scikit-learn divides the dataset into training and testing sets,
ensuring the model is trained on one part and evaluated on unseen data.

Why split the data?


● To prevent overfitting.
● To check how the model performs on unseen data.

Code Example:

from sklearn.model_selection import train_test_split

X = [[1], [2], [3], [4], [5]]


y = [1, 2, 3, 4, 5]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Train:", X_train, y_train)


print("Test:", X_test, y_test)

4. Why Normalization Should Be Done After Train-Test Split

Key Point:
We must fit the scaler only on training data and then transform both train and test data
using the same parameters (mean, std, min, max from the training set).

Reason:

1. If we normalize the entire dataset before splitting, information from the test set leaks
into the training process (data leakage).

2. The test set should mimic real-world unseen data.

Correct Approach:

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # Fit + transform training data
X_test_scaled = scaler.transform(X_test) # Transform test data using training params

5. Why Train-Test Split Should Not Be Null?


● Null test split means no evaluation: If you do not split the dataset, the model is
evaluated on the same data it was trained on, leading to over-optimistic
performance
metrics.

● Generalization check fails: Without a test set, we cannot measure how well the model
performs on new, unseen data.

Summary

Concept Purpose Normalization Scale values to a fixed range (0 to 1).

Standardization Center data around 0 with unit variance.

Train-Test Split Evaluate generalization of the model.

NormalizationAfter
Split
Prevents data leakage.

You might also like