Machine Learning Note
Machine Learning Note
01. What is machine learning and how does it differ from traditional
programming?
Machine learning
Machine learning is a subset of AI, which uses algorithms that learn from data to make
predictions. These predictions can be generated through supervised learning, where
algorithms learn patterns from existing data, or unsupervised learning, where they
discover general patterns in data.
To describe the bias-variance tradeoff in the context of machine learning modes, at first
we have to know about the bias, variance and the relation between them.
Bias:While making predictions, a difference occurs between prediction values made by the
model and actual values/expected values, and this difference is known as bias errors or
Errors due to bias.
Variance:variance tells that how much a random variable is different from its expected
value.
Relation between bias and variance:
learn well with the training dataset or uses few numbers of the parameter. It leads
to underfitting problems in the model.
• High-Bias,High-Variance:
With high bias and high variance, predictions are inconsistent and also inaccurate
on average.
While building the machine learning model, it is really important to take care of bias and
variance in order to avoid overfitting and underfitting in the model. If the model is very
simple with fewer parameters, it may have low variance and high bias. Whereas, if the
model has a large number of parameters, it will have high variance and low bias. So, it is
required to make a balance between bias and variance errors, and this balance between
the bias error and variance error is known as the Bias-Variance trade-off.
Cross Validation:
Cross validation is a technique used in machine learning to evaluate the performance
of a model on unseen data. It involves dividing the available data into multiple folds or
subsets, using one of these folds as a validation set, and training the model on the
remaining folds. This process is repeated multiple times, each time using a different
fold as the validation set.
04. Can you explain difference between classification and regression algorithm?
map the input value (x) with the to map the input value(x) with the
continuous output variable(y). discrete output variable(y).
Regression Algorithms are used with Classification Algorithms are used with
continuous data. discrete data.
In Regression, we try to find the best fit
In Classification, we try to find the
line, which can predict the output more decision boundary, which can divide the
accurately. dataset into different classes.
Regression algorithms can be used to Classification Algorithms can be used to
solve the regression problems such as solve classification problems such as
Weather Prediction, House price
Identification of spam emails, Speech
prediction, etc. Recognition, Identification of cancer cells,
etc.
The regression Algorithm can be further The Classification algorithms can be
divided into Linear and Non-linear divided into Binary Classifier and Multi-
Regression. class Classifier.
05. What are some common evaluation metrics used for classification task?
There are many ways for measuring classification performance. Accuracy, confusion
matrix, log-loss, and AUC-ROC are some of the most popular metrics. Precision-
recall is a widely used metrics for classification problems.
Confusion metrix:
A confusion matrix, also known as an error matrix. A confusion matrix is a table that is
often used to describe the performance of a classification model.
Actual data
(Positive) TP FP
(Negative) FN TN
Confusion matrix
• FP (false positive): The number of records classified as true while they were actually
false.
• FN (false negative): The number of records classified as false while they were actually
true.
• TN (true negative): The number of records classified as false while they were actually
false.
Accuracy:
Accuracy simply measures how often the classifier correctly predicts. We can define
accuracy as the ratio of the number of correct predictions and the total number of
predictions.
True positive+True Negative
Accuracy= True positive+False positive+False Negative+True Negative
Precision:
It explains how many of the correctly predicted cases actually turned out to be
positive.
True positive
Precision= True positive+False positive
Recall:
It explains how many of the actual positive cases we were able to predict correctly
with our model.
True positive
Recall= True positive+False Negative
F1 Score:
It gives a combined idea about Precision and Recall metrics. It is maximum when
Precision is equal to Recall.
Precision+Recall
F1 Score= 2
06. Explain the concept of feature engineering and its importance in machine
learning?
Feature engineering:
Feature engineering is the process of transforming raw data into features that are
suitable for machine learning models. In other words, it is the process of selecting,
extracting, and transforming the most relevant features from the available data to
build more accurate and efficient machine learning models.
07. What is the curse of dimentionality and how does it affect machine learning
models?
Curse of dimensionality:
The Curse of Dimensionality refers to the various challenges and complications that
arise when analyzing and organizing data in high-dimensional spaces (often hundreds
or thousands of dimensions).
• Exponential Increase in Data Volume: When there are more features, the space
where data exists gets much bigger. This makes data points spread out. Which
can make it tricky to get reliable results and increases the chance of overfitting.
• Increased Computational Complexity: In ML, the curse of dimensionality works
with lots of data takes more computer power and time. Algorithms that handle
big datasets often become slow or too hard.
• Data Sparsity: When there are lots of dimensions, there’s not much data in each
one. This makes it tough to guess probability distributions, find close data points,
or spot patterns reliably. So, it’s harder to make correct predictions or group
things accurately.
• Increased Model Complexity and Overfitting: When there are lots of data with
many features, models get more complicated to handle all those features. Which
makes them more likely to memorize random stuff from the training data instead
of real patterns. curse of dimensionality in machine learning leads to problems
when trying to predict new data accurately.
08. Describe the difference between batch gradient descent, stochastic gradient
descent and mini-batch gradient descent.
• Provides a balance between the faster updates of SGD and the stable
convergence of batch gradient descent. It is generally faster than batch gradient
descent and less noisy than SGD.
• Requires more memory than SGD but less than batch gradient descent since it
processes a small subset of the data at a time.
•
09. What is the purpose of regularization in machine learning, and how does it
work?
Regularization is a fundamental concept in machine learning that addresses the issue
of overfitting. Overfitting occurs when a model becomes too complex and memorizes
the training data rather than learning the underlying patterns. This leads to poor
training data. The goal is to minimize the loss function during training.
2. Regularization Term: A penalty term is added to the loss function. This term is
function, which includes both the original loss and the regularization penalty.
4.Simpler Model: By penalizing large coefficients, regularization discourages the model
from fitting too closely to the training data. This leads to a simpler model that is less
likely to overfit.
There are different techniques for regularization, each with its own way of penalizing
coefficients. Some common examples include:
L1 Regularization (Lasso): This technique adds a penalty equal to the absolute value of
the coefficients. It can shrink some coefficients to zero, effectively performing feature
selection.
By applying regularization, you achieve a balance between fitting the training data well
and keeping the model simple enough to generalize to unseen data. This is crucial for
Answer:
L1 Regularization L2 Regularization
1.The penalty term is based on the 1.The penalty term is based on the squares
parameters are shrunk towards zero). parameters are used by the model).
4.Selects a subset of the most important 4.All features are used by the model.
features.
6.The penalty term is less sensitive to 6.The penalty term is more sensitive to
correlated features. correlated features.
7. Useful when dealing with high- 7. Useful when dealing with high-
dimensional data with many correlated dimensional data with many correlated
features. features and when the goal is to have a
boosting.
Ensemble learning is a powerful technique in machine learning that combines multiple
models to improve overall predictive performance. The core idea is that an ensemble of
weaker models can often outperform a single, complex model. There are various
ensemble methods, but two of the most popular are bagging and boosting:
Bagging (Bootstrap Aggregating):
Step-1: Creates multiple training datasets by sampling with replacement from the
original data (bootstrapping). This means some data points may appear multiple times
dataset.
Step-3: Combines predictions from all the models using averaging (regression) or
scheme, with stronger learners having more influence in the final prediction.
Differences between Bagging and Boosting:
Bagging Boosting
3. For data sampling, Uses bootstrapping 3. For data sampling, Uses original data,
with replacement. adjusts weights based on errors.
Choosing between bagging and boosting depends on the specific problem and data.
Bagging is generally simpler to implement and works well when the base models have
high variance. Boosting can achieve higher accuracy, especially for complex problems,
2. Creates new samples by learning the 2. Acquires the threshold of judgment that
distribution of the underlying data. distinguishes various classes or categories.
3. For example: Text generation and 3. Used in activities like sentiment analysis
learning?
Dimensionality reduction techniques play a crucial role in machine learning by
simplifying complex datasets with many features. Here's how they benefit machine
learning models:
number of dimensions.
3. Visualization:With fewer dimensions, it becomes easier to visualize the data
learning:
identify the most important features for a specific task, allowing you to focus on
the most informative data.
There are various dimensionality reduction techniques, each with its own advantages
and disadvantages. The best choice depends on the specific characteristics of your data
and the machine learning task at hand. Some popular techniques include:
1. Principal Component Analysis (PCA):Identifies a new set of features (principal
between different classes in the data, particularly useful for classification tasks.
3. t-Distributed Stochastic Neighbor Embedding (t-SNE): Effective for visualizing
(SVMs)?
Support Vector Machines (SVMs) are a powerful machine learning algorithm known
SVMs aim to find an optimal hyperplane in high-dimensional space that best separates
from each class, called support vectors. These support vectors play a crucial role in
defining the optimal hyperplane.
SVMs strive to maximize this margin, as a larger margin indicates a clearer separation
between the classes and better generalization to unseen data.
address this, SVMs can leverage a technique called the kernel trick.
The kernel trick implicitly maps the data points to a higher-dimensional feature space
where a linear separation might exist. This allows SVMs to effectively handle non-linear
data.
the margin between the classes. This involves calculating the distance between
the hyperplane and the support vectors.
3. Classification: New data points are mapped to the same feature space, and the
SVM model predicts their class labels based on which side of the hyperplane they
fall on.
Advantages of SVMs:
2. Robust to noise: The focus on maximizing the margin makes SVMs less
kernel trick.
4. Memory efficiency: The model primarily relies on the support vectors for
Overall, SVMs are a versatile and powerful tool for various machine learning tasks,
particularly classification. Their ability to handle high-dimensional data, robustness to
noise, and effectiveness with non-linear data through the kernel trick make them a
popular choice for many applications.
From the figure above we can say that there are multiple lines that segregate our data
points or do a classification between red and blue circles. So how do we choose the best
line or in general the best hyperplane that segregates our data points?
One reasonable choice as the best hyperplane is the one that represents the largest
separation or margin between the two classes.
So we choose the hyperplane whose distance from it to the nearest data point on each
Here we have one blue ball in the boundary of the red ball. The blue ball in the
boundary of red ones is an outlier of blue balls. The SVM algorithm has the
characteristics to ignore the outlier and finds the best hyperplane that maximizes the
So in this type of data point what SVM does is, finds the maximum margin as done with
previous data sets along with that it adds a penalty each time a point crosses the
margin. So the margins in these types of cases are called soft margins. When there is a
soft margin to the data set, the SVM tries to minimize. Hinge loss is a commonly used
penalty. If no violations no hinge loss.If violations hinge loss proportional to the distance
of violation.
15. Describe a real-world machine learning project you've worked on, including
be a good choice. Random Forests are robust to overfitting and can handle a
variety of data types.
7. Model Training and Evaluation:The data is split into training and testing
sets.The Random Forest model is trained on the training data.The model's
performance is evaluated on the testing set using metrics like accuracy, precision,
recall, and F1-score. These metrics provide insights into how well the model
Challenges Faced:
1. Data Quality: Ensuring the accuracy and completeness of customer data is
churning customers. This can lead to models biased towards the majority class.
Techniques like oversampling or undersampling the minority class can be
explored.
3. Model Interpretability: While Random Forests are powerful, interpreting their
Additional Considerations:
conditions evolve.
This is a simplified example, but it highlights the key steps involved in a real-world
machine learning project. The specific details will vary depending on the problem and
the chosen approach.
Input data is labeled. Input data is not labeled. Input data is not predefined.
Learn pattern of inputs and Divide data into classes. Find the best reward
Model is built and trained Model is built and trained The model is trained and
prior to testing. prior to testing. tested simultaneously.
Deal with regression and Deals with clustering and Deals with exploration and
classification problems. associative rule mining exploitation problems.
problems.
17. What are the main steps involved in a typical machine learning project
pipeline?
A Machine Learning pipeline is a process of automating the workflow of a complete
machine learning task.
1. Data Ingestion
Each ML pipeline starts with the Data ingestion step. In this step, the data is processed
into a well-organized format, which could be suitable to apply for further steps. This
step does not perform any feature engineering; rather, this may perform the versioning
of the input data.
2. Data Validation
The next step is data validation, which is required to perform before training a new
model. Data validation focuses on statistics of the new data, e.g., range, number of
categories, distribution of categories, etc. In this step, data scientists can detect if any
anomaly present in the data. There are various data validation tools that enable us to
compare different datasets to detect anomalies.
3. Data Pre-processing
Data pre-processing is one of the most crucial steps for each ML lifecycle as well as the
pipeline. We cannot directly input the collected data to train the model without pre-
processing it, as it may generate an abrupt result.
The pre-processing step involves preparing the raw data and making it suitable for the
ML model. The process includes different sub-steps, such as Data cleaning, feature
scaling, etc. The product or output of the data pre-processing step becomes the final
dataset that can be used for model training and testing. There are different tools in ML
for data pre-processing that can range from simple Python scripts to graph models.
The model training step is the core of each ML pipeline. In this step, the model is trained
to take the input (pre-processed dataset) and predicts an output with the highest
possible accuracy.
However, there could be some difficulties with larger models or with large training data
sets. So, for this, efficient distribution of the model training or model tuning is required.
This issue of the model training stage can be solved with pipelines as they are scalable,
and a large number of models can be processed concurrently.
5. Model Analysis
After model training, we need to determine the optimal set of parameters by using the
loss of accuracy metrics. Apart from this, an in-depth analysis of the model's
performance is crucial for the final version of the model. The in-depth analysis includes
calculating other metrics such as precision, recall, AUC, etc. This will also help us in
determining the dependency of the model on features used in training and explore how
the model's predictions would change if we altered the features of a single training
example.
6. Model Versioning
The model versioning step keeps track of which model, set of hyperparameters, and
datasets have been selected as the next version to be deployed. For various situations,
there could occur a significant difference in model performance just by applying
more/better training data and without changing any model parameter. Hence, it is
important to document all inputs into a new model version and track them.
7. Model Deployment
After training and analyzing the model, it's time to deploy the model. An ML model can
be deployed in three ways, which are:
However, the common way to deploy the model is using a model server. Modelservers
allow to host multiple versions simultaneously, which helps to run A/B tests on models
and can provide valuable feedback for model improvement.
8. Feedback Loop
Each pipeline forms a closed-loop to provide feedback. With this close loop, data
scientists can determine the effectiveness and performance of the deployed models.
This step could be automated or manual depending on the requirement.
18. Explain the bias-variance tradeoff and its significance in machine learning
model selection.
Bias-Variance Trade-Off
While building the machine learning model, it is really important to take care of bias and
variance in order to avoid overfitting and underfitting in the model. If the model is very
simple with fewer parameters, it may have low variance and high bias. Whereas, if the
model has a large number of parameters, it will have high variance and low bias. So, it is
required to make a balance between bias and variance errors, and this balance between
the bias error and variance error is known as the Bias-Variance trade-off.
For an accurate prediction of the model, algorithms need a low variance and low bias.
But this is not possible because bias and variance are related to each other:
Missing values are a common challenge in data analysis, and there are several
strategies for handling them. Here’s an overview of some common approaches:
Imputation Methods
• Replacing missing values with estimated values.
3. Interpolation Techniques
• Estimate missing values based on surrounding data points using techniques like
linear interpolation or spline interpolation.
• More sophisticated than mean/median imputation: Captures relationships between
variables.
• Requires additional libraries and computational resources.
These interpolation techniques are useful when the relationship between data points
can be reasonably assumed to follow a linear or quadratic pattern.
Feature selection is a process that chooses a subset of features from the original
features so that the feature space is optimally reduced according to a certain criterion.
There are three general classes of feature selection algorithms: Filter methods,
wrapper methods and embedded methods.
The main purpose of cross validation is to prevent overfitting, which occurs when a
model is trained too well on the training data and performs poorly on new, unseen
data. By evaluating the model on multiple validation sets, cross validation provides a
more realistic estimate of the model’s generalization performance, i.e., its ability to
perform well on new, unseen data.
23. What is regularization, and how does it help prevent overfitting in machine
learning models?
Regularization is a technique in machine learning that helps prevent from overfitting.
Regularization prevents the model from fitting the training data too closely, which is a
common cause of overfitting. Instead, it promotes a balance between model
complexity and performance, leading to better generalization on new, unseen data.
2. Regularization introduces a trade-off between fitting the training data and keeping
the model’s parameters small. The strength of regularization is controlled by a
hyperparameter, often denoted as lambda (λ). A higher λ value leads to stronger
regularization and a simpler model.
3. Regularization techniques help control the complexity of the model. They make the
model more robust by constraining the parameter space. This results in smoother
decision boundaries in the case of classification and smoother functions in
regression, reducing the potential for overfitting.
24. Discuss the differences between batch gradient descent, stochastic gradient
descent, and mini-batch gradient descent optimization algorithms.
As each iteration of the SGD adjusts the model In order to strike a reasonable
approach requires parameters more often balance between speed and
computing the gradient of than GD, which causes it accuracy, the model
the cost function across the to converge more quickly parameters are changed
whole training dataset, GD more frequently than GD but
takes some time to less frequently than SGD.
converge.
With little error, GD Due to the fact that SGD Mini-batch Gradient Descent
modifies the model's is updated using just one has a significant amount of
parameters based on the training sample, it has a noise because the update is
average of all training lot of noise. based on a small number of
samples. training examples.
25. Describe the workings of decision trees and how they handle both
classification and regression tasks.
Decision trees are a type of supervised learning algorithm used for both classification
and regression tasks. They operate by splitting the data into subsets based on the value
of input features.
Tree Structure:
•Each internal node represents a test on an attribute, each branch represents the
outcome of the test.
•Each leaf node represents the final outcome (class label or numerical value).
In classification tasks, decision trees aim to partition the data such that each
partition contains instances of a single class as much as possible.
1. Impurity Measures:
• The quality of a split is evaluated using impurity measures such as Gini impurity, entropy
(information gain), or misclassification error.
• Gini Impurity: Measures the probability of incorrectly classifying a randomly chosen
element if it was randomly labeled according to the distribution of labels in the subset.
• Entropy: Measures the amount of disorder or randomness. Information gain is the
reduction in entropy.
2. Classification Decision:
• At each node, the algorithm calculates the impurity for each possible split and selects
the split that minimizes the impurity.
• Once the tree is built, to classify a new instance, the instance is passed through the tree
following the splits corresponding to the values of its features until it reaches a leaf
node. The class label of the leaf node is the predicted class for the instance.
In regression tasks, decision trees aim to partition the data such that each partition is as
homogeneous as possible with respect to the target variable (i.e., the values in each
partition should be close to each other).
1. Variance Reduction:
2. Regression Decision:
• At each node, the algorithm evaluates the potential splits and selects the one that
results in the greatest reduction in variance.
• For making predictions, a new instance is passed through the tree down to a leaf node.
The value at the leaf node, often the mean value of the target variable in that leaf, is the
predicted value for the instance.
26. What are ensemble learning methods, and how do bagging and boosting
differ?
Ensemble learning methods are techniques that create multiple models and combine
them to solve a particular computational problem. By aggregating the predictions of
several models, it can create a more accurate and robust model that often achieve
better performance than any single model could. The main idea behind ensemble
learning is to leverage the strengths and compensate for the weaknesses of individual
models.
Bagging Boosting
Bagging Boosting
Generally simpler to implement and can More complex due to the sequential
be easily parallelized since models are nature and the need to adjust weights
7. trained independently. and focus on errors iteratively.
Example: The Random forest model uses Example: The AdaBoost, Gradient
Bagging. Boostinguses Boosting techniques
27. Describe the K-nearest neighbors (KNN) algorithm and its applications.
Algorithm Overview:
2. Basic Principle: The core idea of KNN is to predict the label of a new data point based
on the labels of the 'k' nearest data points in the training set. The 'k' in KNN is a user-
defined constant and determines the number of neighbors to consider.
3. Distance Metric: To find the nearest neighbors, a distance metric is used. The most
common distance metric is Euclidean distance, but others like Manhattan distance,
Minkowski distance, or Hamming distance (for categorical data) can also be used.
4. Classification:
• For a classification task, the new data point is assigned the class that is most common
among its 'k' nearest neighbors. This process is often referred to as "majority voting."
• For example, if k=3 and the three nearest neighbors have labels A, B, and B, the new
data point would be classified as B.
5. Regression:
• For a regression task, the value assigned to the new data point is the average (or
sometimes weighted average) of the values of its 'k' nearest neighbors.
• For example, if k=3 and the three nearest neighbors have values 5, 6, and 7, the
predicted value for the new point would be (5+6+7)/3 = 6.
• A small 'k' (like 1 or 3) makes the algorithm sensitive to noise but captures the local
structure well.
• A large 'k' provides a smoother decision boundary but might overlook local structures,
leading to underfitting.
Applications of KNN
1. Pattern Recognition: KNN is widely used in various pattern recognition tasks, including
image recognition and handwriting recognition. For instance, in optical character
recognition (OCR), KNN can classify characters based on pixel intensity.
3. Medical Diagnosis: In healthcare, KNN can be used for disease prediction and
diagnosis. For instance, it can help predict whether a patient has a particular disease
based on their symptoms and historical data.
4. Finance: In financial applications, KNN can be used for stock price prediction, customer
segmentation, and credit scoring by analyzing historical financial data and trends.
28. What is the difference between precision and recall, and how are they
relevant in classification problems?
Precision Recall
Precision is about how many of the Recall is about how many of the actual
1. predicted positives are actual positives positives were identified by the model.
It focuses on the accuracy of the positive It focuses on the coverage of the positive
2. predictions. instances.
It is affected by false positives. More false It is affected by false negatives. More false
3. positives lead to lower precision. negatives lead to lower recall.
Precision is crucial when the cost of false Recall is crucial when the cost of false
4. positives is high negatives is high
A model with high precision makes fewer A model with high recall captures most
5. false positive errors. positives instances
Formula: Formula:
𝑇𝑃 𝑇𝑃
𝑇𝑃 + 𝐹𝑃 𝑇𝑃 + 𝐹𝑁
6.
1.Imbalanced Datasets:
2. Application-Specific Requirements:
• The choice between optimizing for precision or recall depends on the specific
requirements of the application. For example, in fraud detection, you might prioritize
recall to catch as many fraudulent transactions as possible, even if it means more false
positives.
© Research Wing | Complexity IT Job Care Page 34
Machine Learning Note
• There is often a trade-off between precision and recall. Improving one usually leads to a
decrease in the other. The balance between them can be evaluated using the F1 score,
Precesion ∗ Recall
which is the harmonic mean of precision and recall: F1 Score = 2 ∗ Precesion + Recall
• The F1 score provides a single metric that balances both precision and recall, making it
useful for evaluating model performance when both metrics are important.
Understanding and balancing precision and recall are crucial for developing effective
classification models, particularly in applications where the costs of false positives and
false negatives are significant. By focusing on these metrics, practitioners can tailor their
models to meet the specific needs and constraints of their applications.
Concept of Clustering
Definition:
Clustering involves partitioning a dataset into distinct groups where the data points
within each group are more similar to each other than to those in other groups.
Applications:
Genomic data analysis for grouping genes or proteins with similar expressions.
1. K-Means Clustering:
Description:
Use Cases:
2. Hierarchical Clustering:
Description:
Use Cases:
Feature Scaling:
Purpose:
Importance:
Normalization:
Process:
Importance:
31. What is dimensionality reduction, and how does it help in machine learning
tasks?
2.Computational Efficiency:
3.Visualization:
4.Noise Reduction:
5.Feature Engineering:
32. Describe the working of neural networks and their various architectures.
A neural network is a machine learning model inspired by the function and structure of
the human brain. They consist of interconnected nodes, or neurons, organized into
layers. Each neuron receives input signals, processes them through an activation
function, and passes the result to neurons in the next layer.
Basic Structure:
Forward Propagation:
Backpropagation:
• During training, the network adjusts its weights to minimize the prediction error
(loss).
• Backpropagation computes gradients of the loss with respect to each weight.
• Gradient descent updates weights to minimize the loss.
FNNs are the simplest form of neural networks, where information flows in one
direction, from input to output layers. They are suitable for tasks such as regression and
classification.
CNNs are specialized for processing grid-like data, such as images. They consist of
convolutional layers that apply filters to extract spatial hierarchies of features, followed
by pooling layers to reduce dimensionality and fully connected layers for classification.
RNNs are designed for sequential data processing, such as time series or natural
language. They have connections that form directed cycles, allowing them to maintain
internal state and capture temporal dependencies.
realistic data samples, while the discriminator learns to distinguish between real and
generated samples.
LSTMs are a variant of RNNs that address the vanishing gradient problem. They use
specialized memory cells with gating mechanisms to selectively retain or forget
information over long sequences.
Autoencoder:
Autoencoders are neural networks trained to reconstruct input data at the output layer,
typically through a bottleneck layer with a lower dimensionality than the input. They are
used for unsupervised feature learning and dimensionality reduction.