Subject Code: PGCSE 102
Subject Name: Python for Data Science
Q1. What is the purpose of the SimpleImputer class in Python?
Answer:
The SimpleImputer class in Python (from sklearn.impute) is used to fill missing values in a
dataset with a specific strategy such as mean, median, most frequent, or a constant value.
Code Example:
from sklearn.impute import SimpleImputer
import numpy as np
data = np.array([[1, 2], [np.nan, 3], [7, 6]])
imputer = SimpleImputer(strategy='mean')
result = imputer.fit_transform(data)
print(result)
Q2. Define how the “preprocessing” module is useful in Python for data preprocessing.
Answer:
The sklearn.preprocessing module provides functions and classes for feature scaling,
normalization, encoding categorical features, and transformation, making raw data suitable for
modeling.
Functions include: StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder, etc.
Q3. Describe the significance of StandardScaler class in data preprocessing.
Answer:
StandardScaler standardizes features by removing the mean and scaling to unit variance. It is
crucial for algorithms sensitive to feature scales (e.g., SVM, KNN).
Code Example:
from sklearn.preprocessing import StandardScaler
data = [[1, 20], [2, 40], [3, 60]]
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
print(scaled_data)
Q4. How does Label Encoding affect model performance?
Answer:
Label Encoding converts categorical labels into numeric values. For models that consider label
ordering (like linear regression), it may introduce unintended bias. Best for tree-based models.
Q5. What are the steps involved in data preprocessing for machine learning?
Answer:
1. Importing libraries
2. Loading the dataset
3. Handling missing values
4. Encoding categorical data
5. Feature scaling
6. Splitting into train-test sets
7. Model fitting
Q6. Explain the use of train_test_split in data preprocessing.
Answer:
train_test_split (from sklearn.model_selection) is used to divide the dataset into training and
testing sets to evaluate model generalization.
Code Example:
from sklearn.model_selection import train_test_split
X = [[1], [2], [3], [4]]
y = [1, 2, 3, 4]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
******************************************************************************
Q1. Define the term "OneHotEncoding” and its application with suitable example.
Answer:
One-Hot Encoding converts categorical variables into a binary matrix (dummy variables),
avoiding ordinal relationships.
Code Example:
from sklearn.preprocessing import OneHotEncoder
import numpy as np
data = np.array([['red'], ['green'], ['blue']])
encoder = OneHotEncoder(sparse=False)
encoded = encoder.fit_transform(data)
print(encoded)
Application: Used in ML models that require numeric input like logistic regression or neural
networks.
Q2. Describe the difference between fit_transform(), fit() and transform()
methods. Answer:
Method Description
fit() Learns parameters from data (e.g., mean/std)
transform() Applies the learned parameters to transform data
fit_transform() Combines fit() and transform() in one step
Example:
scaler = StandardScaler()
scaler.fit(X_train) # learns mean/std
X_train_scaled = scaler.transform(X_train) # uses learned parameters
# OR
X_train_scaled = scaler.fit_transform(X_train)
Q3. Demonstrate how to load a dataset in Python using Pandas and perform basic
summary statistics.
Answer:
import pandas as pd
# Load dataset
df = pd.read_csv("data.csv")
# Display first 5 rows
print(df.head())
# Summary statistics
print(df.describe())
# Check for missing values
print(df.isnull().sum())
******************************************************************************
Q1. Analyze a dataset to deal with missing values and the potential impact of these missing
values on a machine learning model.
Answer:
Missing data can reduce model accuracy, introduce bias, or cause errors during training.
Handling Missing Values:
● Remove rows (dropna())
● Impute with mean/median/mode (SimpleImputer)
● Predict missing values (advanced methods)
Code Example:
import pandas as pd
from sklearn.impute import SimpleImputer
df = pd.read_csv("data.csv")
print("Missing before:\n", df.isnull().sum())
# Imputation
imputer = SimpleImputer(strategy='mean')
df[['Age', 'Salary']] = imputer.fit_transform(df[['Age', 'Salary']])
print("Missing after:\n", df.isnull().sum())
Impact on Model:
● Improved completeness
● Better generalization
● Avoids runtime errors
Q2. Analyze how the “compose” module is significant in Python for data preprocessing.
Answer:
The sklearn.compose module allows combining multiple preprocessing steps for different
column types using ColumnTransformer.
Significance:
● Streamlines preprocessing for numerical and categorical columns
● Reduces manual processing
● Supports pipeline integration
Code Example:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import pandas as pd
df = pd.DataFrame({
'age': [25, 30, 35],
'city': ['Delhi', 'Mumbai', 'Chennai']
})
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), ['age']),
('cat', OneHotEncoder(), ['city'])
])
processed = preprocessor.fit_transform(df)
print(processed)
*****************************************************************************
Here are detailed notes on data normalization, standardization, and train-test split with clear
explanations of why normalization is done after splitting the data.
Data Normalization, Standardization, and Train-Test Split
1. Data Normalization
Definition:
Normalization is the process of rescaling features to a specific range, typically [0, 1] or [-1, 1],
without distorting differences in the ranges of values.
Formula:
For Min-Max Normalization:
Use Case:
● Suitable when the data has varying scales.
● Useful for distance-based models like KNN, K-means, Neural Networks.
Code Example:
from sklearn.preprocessing import MinMaxScaler
import numpy as np
data = np.array([[1, 20], [2, 40], [3, 60]])
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data)
print(normalized_data)
2. Data Standardization
Definition:
Standardization transforms data to have zero mean and unit variance.
This is achieved using Z-score scaling.
Formula:
where
● μ\mu = mean of feature values
● σ\sigma = standard deviation of feature values
Use Case:
● Works well with algorithms like SVM, Logistic Regression, PCA.
● Keeps negative values (unlike normalization).
Code Example:
from sklearn.preprocessing import StandardScaler
data = [[1, 20], [2, 40], [3, 60]]
scaler = StandardScaler()
standardized_data = scaler.fit_transform(data)
print(standardized_data)
3. Train-Test Split
Definition:
The train_test_split function from scikit-learn divides the dataset into training and testing sets,
ensuring the model is trained on one part and evaluated on unseen data.
Why split the data?
● To prevent overfitting.
● To check how the model performs on unseen data.
Code Example:
from sklearn.model_selection import train_test_split
X = [[1], [2], [3], [4], [5]]
y = [1, 2, 3, 4, 5]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Train:", X_train, y_train)
print("Test:", X_test, y_test)
4. Why Normalization Should Be Done After Train-Test Split
Key Point:
We must fit the scaler only on training data and then transform both train and test data
using the same parameters (mean, std, min, max from the training set).
Reason:
1. If we normalize the entire dataset before splitting, information from the test set leaks
into the training process (data leakage).
2. The test set should mimic real-world unseen data.
Correct Approach:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # Fit + transform training data
X_test_scaled = scaler.transform(X_test) # Transform test data using training params
5. Why Train-Test Split Should Not Be Null?
● Null test split means no evaluation: If you do not split the dataset, the model is
evaluated on the same data it was trained on, leading to over-optimistic
performance
metrics.
● Generalization check fails: Without a test set, we cannot measure how well the model
performs on new, unseen data.
Summary
Concept Purpose Normalization Scale values to a fixed range (0 to 1).
Standardization Center data around 0 with unit variance.
Train-Test Split Evaluate generalization of the model.
NormalizationAfter
Split
Prevents data leakage.