Machine Learning Question Bank
Machine Learning Question Bank
Unit-1 – Chapter-1
I. Image Recognition:
ML is used to identify people, objects, or scenes in images. For example,
Facebook uses ML to automatically tag friends in photos using face recognition.
II. Speech Recognition:
It helps convert voice commands into text. Examples include Siri, Google
Assistant, and Alexa that understand and act on voice inputs.
III. Traffic Prediction:
Google Maps uses ML to predict traffic conditions using GPS data and past traffic
trends, helping users find the fastest routes.
IV. Product Recommendation:
E-commerce websites like Amazon and streaming services like Netflix use ML to
recommend products or movies based on user preferences and behaviour.
V. Self-Driving Cars:
Companies like Tesla use ML to train cars to detect objects, follow lanes, and
make driving decisions using real-time sensor data.
VI. Spam and Malware Filtering:
Email services use ML algorithms like Naïve Bayes to filter spam and detect
harmful attachments automatically.
VII. Medical Diagnosis:
ML helps doctors identify diseases by analysing medical images, health records,
and symptoms—for example, detecting brain tumours or cancer early.
Discuss the Classification of Machine Learning in detail.
Machine Learning is classified into different types based on how the learning process
happens and what kind of data is provided. The three main types are:
Supervised Learning
Unsupervised Learning
Reinforcement Learning
1. Supervised Learning :
In supervised learning, the model is trained on labelled data—this means the input
data is paired with the correct output. The model learns the relationship and predicts
output for new inputs. Used in applications where historical data is available.
Examples include:
Email Spam Detection
Risk Assessment
Image Classification
Fraud Detection
There are two major problems in supervised learning:
Regression: Predicts continuous values (e.g., house price, temperature)
Classification: Predicts categories (e.g., spam or not spam)
2. Unsupervised Learning :
In unsupervised learning, the model is given unlabelled data. The system tries to learn
patterns and structures from the data without known outputs. Used for tasks like:
Customer Segmentation
Anomaly Detection
Market Basket Analysis
Clustering Images or Documents
Two main types of problems are:
Clustering: Grouping similar data points (e.g., K-Means, Hierarchical Clustering)
Association: Discovering rules that describe large portions of data (e.g., Market
Basket Analysis)
3. Reinforcement Learning :
Reinforcement learning is a feedback-based method where an agent learns by
interacting with the environment, receiving rewards for good actions and penalties for
bad actions. Used in dynamic and sequential decision-making tasks such as:
Game Playing (e.g., Chess, Go)
Robotics
Self-driving cars
Industrial automation systems
Discuss various steps in designing a learning system
1. Choosing the Training Experience
The first step is to select the right training data or experience that will be fed
into the machine learning algorithm. This data must be relevant and should
have a direct or indirect impact on the success of the model. For example, in a
chess game, the moves played, and their outcomes act as training experience
from which the model learns.
Data Labelled data used to Labelled data used for Labelled data used
Type fit the model tuning and model only for final evaluation
selection
Usage Used during model Used during training Used after training and
Time training (for validation and validation are complete
tuning)
Seen by Yes, the model Yes, used indirectly No, completely unseen
Model? directly learns from it during model tuning during training and
tuning
Effect Directly affects how Helps adjust Does not influence the
on the model is trained parameters to improve model; only measures
Model performance performance
Risk If Model may underfit if May cause overfitting if If leaked into training,
Misused data is insufficient used excessively results in
overestimated
performance
1. Logical Model.
2. Probabilistic Model.
3. Geometric Model.
1. Logical Model
A logical model uses a series of logical conditions to divide the instance space into
segments. These models typically rely on if-then rules and are closely related to
decision trees and rule-based systems. They help classify data into groups by applying
logical expressions. In such models, the learning process involves determining which
conditions lead to which outputs.
This makes the logical model very easy to interpret and implement. It is especially
useful in applications where rule transparency is required.
2. Probabilistic Model
P(spam∣bonus=1,lottery=0)
If this value is greater than 0.5, the model classifies the email as spam. Probabilistic
models are suitable when it's important to estimate the degree of belief or confidence
in the prediction.
3. Geometric Model
An example of a geometric model is a linear classifier, where a straight line (in 2D) or
hyperplane (in higher dimensions) separates two classes. The formula used is:
w⋅x=t
Another example is the k-nearest neighbour model, where a new instance is classified
based on the majority class among its closest neighbours in the feature space.
Geometric models are effective when data is numerically represented, and spatial
separation exists between categories.
Machine Learning is all about using the right feature to build right model
that achieve right task justify your answer.
1. Binary Features
Binary features are attributes that can take only two values: typically, 0 or 1. These
values represent true/false, yes/no, or presence/absence of a particular
characteristic. Binary features are widely used in classification tasks, especially
where logical decision rules apply.
Example:
In email classification, a binary feature could be:
o bonus = 1 if the word "bonus" is present
o bonus = 0 if it is not present
This feature allows for straightforward rules like:
if bonus = 1 then Class = spam
2. Nominal Features
Nominal features are categorical attributes that can take on one of several discrete
values, but these values have no inherent order. Each category represents a label,
and all categories are treated as equally distinct without any ranking.
Example:
In movie classification, a feature like Genre can take values like:
o Action
o Comedy
o Drama
o Horror
3. Ordinal Features
Ordinal features are like nominal features but with an important difference their
values have a clear, meaningful order. However, the distance between the values is
not defined.
Example:
A satisfaction rating:
o Poor < Fair < Good < Excellent
While "Excellent" is clearly better than "Good", we cannot quantify how much
better. These features are important in models that can handle ordered
information.
4. Quantitative Features
Quantitative features (also called numerical features) are those that take on real
numerical values and have a mathematical meaning. The differences between
values are measurable and consistent.
Example:
o Age: 25, 30, 45
o Price: ₹199, ₹250, ₹399
These features can be directly used in mathematical computations like calculating
averages, distances, or trends, and are essential for regression and geometric
models.
In machine learning, raw data collected from various sources often contains irrelevant,
redundant, or unstructured information. Models cannot perform efficiently if the input
features are poorly represented. Therefore, feature construction and feature
transformation are essential steps to improve the learning process by making data
more meaningful and usable for algorithms.
Making data compatible with algorithms that expect input in a certain form
1. Feature Construction
Feature construction involves creating new features from the existing raw data to
enhance the model’s predictive power. These new features help represent the
underlying patterns more clearly.
Example :
From an email’s text, new features like "bonus", "lottery", and "win" can be
extracted. These do not exist explicitly in the raw data but are constructed based
on word presence or frequency.
This process helps transform unstructured text into structured features suitable for
machine learning algorithms.
2. Feature Transformation
Techniques :
o Text to Binary Transformation: Converting words into binary values — e.g., if the
word “bonus” appears, feature = 1, else 0.
o Word Frequency Count: Converting text features into numerical form based on
how often a word appears in the document.
These transformations make the data machine-readable and help algorithms like
decision trees, SVMs, and neural networks learn more efficiently.
Explain the various approaches that can be used for feature selection.
Feature selection is the process of identifying and selecting the most relevant features
from a dataset to improve model performance, reduce overfitting, and lower
computational cost. There are three main approaches to feature selection:
1. Filter Approach
2. Wrapper Approach
The wrapper approach evaluates subsets of features by actually training and testing a
model on them. It searches for the best-performing combination of features using
techniques like forward selection, backward elimination, or recursive feature
elimination.
Example: Adding or removing one feature at a time and testing how the model
accuracy changes.
3. Embedded Approach
The embedded approach performs feature selection as part of the model training
process. The learning algorithm itself selects the most important features while
building the model.
Example: Decision trees automatically select the best features at each split;
LASSO regression shrinks less important feature coefficients to zero.
Discuss the term variance and bias with respect to overfitting and
underfitting.
In machine learning, bias and variance are two key components that contribute to a
model's prediction error. Understanding how they relate to underfitting and overfitting
helps in building models that generalize well on unseen data.
Such models are often too simple to capture the complexity of the data.
This results in poor performance both on the training set and the test set.
Example:
Using a straight line to fit a clearly curved dataset results in high bias. The model
cannot learn the curve and gives inaccurate predictions, even on training data.
Such models are typically too complex relative to the amount of data available.
They perform very well on training data but fail to generalize on new, unseen
data.
This leads to overfitting, where the model captures noise instead of the true
signal.
Example:
A deep decision tree that fits all training examples perfectly, including outliers, may
fail to predict well on test data due to high variance.
Bias-Variance Trade-off
High bias, low variance models are stable but often inaccurate — they underfit.
Low bias, high variance models are flexible but often unstable — they overfit.
The challenge is to find the optimal model complexity that maintains a balance:
2. The classifier learns patterns from input data and maps them to the correct class
label. During training, it sees many examples of each class so that it can later
identify the correct category for new, unseen data.
3. A common example is crop classification, where the model predicts whether the
data represents wheat, rice, maize, or cotton. Each crop type is treated as a
separate class, and the algorithm uses features like soil type and climate to make
predictions.
Write a note on :
1. R2 method.
2. Mean Absolute Error.
3. Root Mean Square.
A value of R² close to 1 indicates that the model explains most of the variance,
while a value close to 0 means the model explains very little. For example, if R²
= 0.85, then 85% of the variation in the target is explained by the model.
MAE measures the average of the absolute differences between actual and
predicted values. It shows how far predictions are from the true values, on
average, without considering direction (positive or negative).
RMSE measures the square root of the average squared differences between
actual and predicted values. Unlike MAE, it penalizes larger errors more heavily
since the errors are squared before averaging.
Example: Using the same data, actual values [10, 15, 20] and predictions [12,
14, 18], squared errors are [(2)², (1)², (2)²] = [4, 1, 4]. The mean squared error =
(4+1+4)/3 = 3. Then RMSE = √3 ≈ 1.73. This tells us the model’s average error
is about 1.73 units, with higher weight given to larger mistakes.
1. Cost Function
In regression, the cost function is used to measure how well the regression line
fits the data points. It calculates the difference between the predicted values (ŷ)
and the actual values (y). The goal is to minimize this difference so that the
model makes accurate predictions.
The most common cost function in regression is the Mean Squared Error
(MSE). It is calculated by squaring the difference between actual and predicted
values, summing them across all data points, and dividing by the total number
of points.
Formula:
1 2
MSE= Σ ( y i−^y ⅈ )
n
Example: Suppose we are predicting house prices. If the actual prices are [200,
220] and the model predicts [210, 230], then errors are [-10, -10]. Squared
errors = [100, 100]. MSE = (100+100)/2 = 100. This means on average, the
model makes squared errors of 100 units, which we want to minimize.
2. Gradient Descent
The process starts with random values of parameters and then iteratively
updates them in the direction of the negative gradient of the cost function. This
helps the model gradually move towards the values that minimize error.
The size of each step is controlled by a parameter called the learning rate. A
small learning rate ensures steady but slow progress, while a large one speeds
up learning but risks overshooting the minimum point.
Example: Imagine standing on a U-shaped hill (representing the cost curve) and
trying to reach the bottom. Each step you take downhill represents an update to
the model parameters. If the steps are too small, it will take longer to reach the
bottom; if they are too large, you might overshoot and miss the lowest point.
There are two main types of regularization: hard constraints, where strict limits
are set on parameter values, and soft constraints, where penalties are applied
through modified cost functions.
By balancing the trade-off between model accuracy on training data and model
simplicity, regularization reduces variance and ensures that the regression
model generalizes better to unseen data.
Best Use Works well when most Works best when only a few
Case predictors are useful and predictors are truly important,
multicollinear. others can be discarded.
2. This means the model assigns a confidence score for its prediction. For example,
instead of saying an email is spam, it may say “there is an 80% chance this is
spam”. This gives richer information than a hard yes/no output.
3. In binary classification, only one probability is needed, such as P(positive class). For
instance, if P(spam) = 0.8, then P(not spam) = 0.2 automatically, since
probabilities sum to one.
5. Since true probabilities in real data are not directly known, models estimate them
by learning from patterns in the training data. These estimates depend on how
similar new inputs are to examples seen before.
6. Two extreme approaches to probability estimation exist. In one extreme, all
instances are considered identical, so the model always predicts the overall
proportion of positives (e.g., 30% spam for every email).
7. In the other extreme, only identical instances are considered similar. In this case, if
the model has seen the same input before, it predicts with complete certainty, but
it fails to generalize for unseen inputs.
8. A practical balance is achieved using methods like decision trees, where data is
split into groups based on features. At each leaf, the probability is calculated from
the proportion of positives and negatives in that group.
9. Assessing the quality of probability estimates is done using metrics like Squared
Error (SE) or Mean Squared Error (MSE), also known as the Brier Score. These
penalize models for being overconfident or uncertain.
10. In summary, class probability estimation allows models to express how confident
they are about predictions. This makes them more useful in real-world applications
like medical diagnosis, risk analysis, or spam filtering, where knowing the degree of
certainty is just as important as the predicted label.
Types of Hypotheses
1. Binary Classification
3. Example: In email spam filtering, the model must decide whether an email is
spam (class 1) or not spam (class 0). Each incoming email is analysed, and the
classifier assigns it to one of the two categories.
5. Binary classifiers are built using algorithms such as Logistic Regression, Decision
Trees, Support Vector Machines, or Neural Networks, depending on the
complexity of the data and problem.
7. True Positive (TP) means the model correctly predicted the positive class (e.g.,
predicting spam when it is spam). True Negative (TN) means the model correctly
predicted the negative class (not spam when it is not spam).
8. False Positive (FP) occurs when the model incorrectly predicts a positive
outcome (e.g., classifying a genuine email as spam). False Negative (FN) occurs
when the model misses a positive case (e.g., failing to detect a spam email).
10. Example: Suppose a spam filter is tested on 100 emails. Out of 75 spam
emails, it correctly identifies 60 (TP) but misses 15 (FN). Out of 25 normal
emails, it correctly classifies 15 (TN) but wrongly marks 10 as spam (FP). Using
this, we can calculate Accuracy = (60+15)/100 = 0.75 or 75%, and other
metrics for deeper evaluation.
List and explain at least 3 error measures used to evaluate the performance
of regression model.
MAE measures the average of the absolute differences between actual values
and predicted values. It shows how far the predictions are from the true values
on average.
Formula:
1
MAE = n Σ| y i− ^y i|
Example: If actual values are [10, 15, 20] and predicted values are [12, 14, 18],
the absolute errors are [2, 1, 2]. MAE = (2+1+2)/3 = 1.67. This means the
model is off by about 1.67 units on average.
RMSE measures the square root of the average squared differences between
actual and predicted values. It penalizes larger errors more strongly because of
squaring.
Formula:
RMSE = 1√ n Σ ( y −^y )
1 2
i i
Example: With actual [10, 15, 20] and predicted [12, 14, 18], squared errors =
[4, 1, 4]. MSE = (4+1+4)/3 = 3. RMSE = √3 ≈ 1.73. Here, the model’s average
error is about 1.73, with larger mistakes weighted more.
3. R-Squared (R²)
R², also called the coefficient of determination, measures how much of the
variation in the dependent variable is explained by the model. It ranges from 0
to 1.
Formula:
SSres
R2 = 1− SStot
Example: If R² = 0.85 in a housing price model, it means 85% of the variation in
house prices is explained by features like size or location, and only 15% is
unexplained.
In summary:
A confusion matrix (also called a contingency table) is a tool used to evaluate the
performance of a classification model.
It compares the actual values from the dataset with the predicted values from the
model and organizes them into a table.
Predicted Predicted
Positive Negative
False Positive (FP): Model predicts positive when it is actually negative (Type I
error).
False Negative (FN): Model predicts negative when it is actually positive (Type II
error).
The model predicts 60 spams correctly (TP = 60), misses 15 spam emails (FN =
15), correctly identifies 15 normal emails (TN = 15), but wrongly marks 10
normal emails as spam (FP = 10).
Actual Spam TP = 60 FN = 15
Actual Not FP = 10 TN = 15
Spam
1. What is VC Dimension?
It was introduced by Vladimir Vapnik and Alexey Chervonenkis in the 1970s and
is central to statistical learning theory.
2. What is Shattering?
A hypothesis class is said to “shatter” a set of data points if, for every possible
labelling of those points, there exists a hypothesis in the class that classifies
them correctly.
However, too high a VC dimension increases the risk of overfitting, while too low
a VC dimension may cause underfitting.
4. Growth Function
The growth function, denoted mh(n) measures the maximum number of distinct
labelling’s (dichotomies) that the hypothesis class H can implement on n points.
If dvc(H)=d, then:
o For n > d, mH(n) grows more slowly and is bounded by Sauer’s Lemma:
d
m H ( n) ≤∑ n
( )
1=0 ¿ i
1. Underfitting
2. Overfitting
Definition: Overfitting occurs when a model is too complex, learning not only the
real patterns but also the noise in the training data. It performs very well on
training data but poorly on unseen data.
Example: Fitting a high-degree polynomial to a small dataset. The curve passes
through almost all training points but gives wrong predictions for new data.
Key Sign: High accuracy on training data but low accuracy on test data.
i. Regularization: Apply Ridge (L2), Lasso (L1), or Elastic Net to penalize large
coefficients and simplify the model.
iii. Pruning (for trees): Remove unnecessary branches in decision trees to make the
model simpler.
iv. Early stopping (for neural nets): Stop training once validation error starts
increasing, even if training error decreases.
v. Increase training data: More data helps the model generalize better and
reduces the chance of memorizing noise.
1. Definition of Regularization
2. Contribution to Generalization
A model that fits training data too closely often performs poorly on new data.
This is called overfitting.
By balancing accuracy and simplicity, regularization ensures the model does not
memorize noise but learns true patterns.
How it works: Adds the sum of absolute values of coefficients to the cost
function as a penalty.
How it works: Adds the sum of squared values of coefficients to the cost function
as a penalty.
Effect: Shrinks all coefficients towards zero but does not eliminate them
completely. It distributes weights more evenly, especially when predictors are
correlated.