KEMBAR78
Module 3 | PDF | Principal Component Analysis | Applied Mathematics
0% found this document useful (0 votes)
30 views33 pages

Module 3

Uploaded by

Swathi Y
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views33 pages

Module 3

Uploaded by

Swathi Y
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 33

21CS644- INTRODUCTION TO DATA SCIENCE AND VISUALIZATION

Module-3
Syllabus:
Feature Generation and Feature Selection
Extracting Meaning from Data: Motivating application: user (customer) retention.
Feature Generation (brainstorming, role of domain expertise, and place for
imagination), Feature Selection algorithms. Filters; Wrappers; Decision Trees;
Random Forests. Recommendation Systems: Building a User-Facing Data Product,
Algorithmic ingredients of a Recommendation Engine, Dimensionality Reduction,
Singular Value Decomposition, Principal Component Analysis, Exercise: build your
own recommendation system.

Feature Selection
Feature Selection refers to the process of selecting the most relevant features (or
variables) from the dataset to use in building a predictive model. The goal is to
improve the performance of the model by eliminating irrelevant or redundant features,
which can lead to better accuracy, reduced overfitting, and more efficient computation.

Importance of Feature Selection: Feature selection is crucial because it helps in


simplifying models, making them easier to interpret, reducing computational cost, and
often improving the generalization of the model by reducing overfitting.

Why Feature Selection is important?

 It simplifies the model: data reduction, less storage, Occam’s razor and better
visualization
 Reduces training time
 Avoids over-fitting
 Improves accuracy of the model
 Avoids curse of dimensionality.

Types of Feature Selection Methods:

Methods

Feature selection methods can be grouped into three categories: filter method,
wrapper method and embedded method.
Three methods of feature selection

Filter Methods: These involve statistical techniques to evaluate the relevance of each
feature individually based on its relationship with the target variable. Examples
include correlation coefficients, Chi-square tests, and mutual information.

A subset of features is selected based on their relationship to the target variable. The
selection is not dependent of any machine learning algorithm. On the contrary, filter
methods measure the “relevance” of the features with the output via statistical tests.
You can use the following table for reference:

Pearson’s Correlation

A statistic that measures the linear correlation between two variables, which are both
continuous. It varies from -1 to +1, where +1 corresponds to positive linear
correlation, 0 to no linear correlation, and −1 to negative linear correlation.
Peason’s r

LDA

Linear Discriminant Analysis is a supervised linear algorithm that projects the data
into a smaller subspace k (k < N-1) while maximising the separation between the
classes. More specifically, the model finds linear combinations of the features that
achieve maximum separability between the classes and minimum variance within
each class.

ANOVA

Analysis of Variance is a statistical method that tests whether different input


categories have significantly different values for the output variable. The f_classif
method from sklearn allows for the analysis of multiple groups of data to determine
the variability between samples and within samples, in order to gain information
about the relationship between the dependent and independent variables.

CHI SQUARE

Chi-squared tests whether the occurrences of a specific feature and a specific class are
independent using their frequency distribution. The null hypothesis is that the two
variables are independent. However, large values of χ² indicate that the null
hypothesis should be rejected. When selecting features, we wish to extract those that
are highly dependent on the output.

Wrapper methods

In wrapper methods, the feature selection process is based on a specific machine


learning algorithm that we are trying to fit on a given dataset.

It follows a greedy search approach by evaluating all the possible combinations of


features against the evaluation criterion. The evaluation criterion is simply the
performance measure which depends on the type of problem, for e.g.
For regression evaluation criterion can be p-values, R-squared, Adjusted R-squared,
similarly for classification the evaluation criterion can be accuracy, precision, recall,
f1-score, etc. Finally, it selects the combination of features that gives the optimal
results for the specified machine learning algorithm.

Most commonly used techniques under wrapper methods are:

Forward selection

Backward elimination

Bi-directional elimination(Stepwise Selection)

Embedded Methods: These involve algorithms that perform feature selection during
the model training process. Regularization methods like LASSO (Least Absolute
Shrinkage and Selection Operator) and Ridge Regression are examples.

Criteria for Feature Selection:

Relevance: The feature should have a significant relationship with the target variable.

Redundancy: The feature should provide unique information that isn’t already
provided by another feature.

Interpretablility: Selected features should make sense in the context of the domain
knowledge.

Steps in Feature Selection:

Data Cleaning and Preprocessing: Ensure data quality before selecting features.

Initial Feature Selection: Use domain knowledge and exploratory data analysis
(EDA) to identify potential features.

Model-Based Selection: Apply algorithms and techniques to evaluate and select the
best subset of features.

Validation: Test the selected features on validation data to ensure they generalize
well.

Challenges and Considerations: There are some challenges like dealing with high-
dimensional data, multicollinearity among features, and the trade-off between model
complexity and performance. It also emphasizes the importance of iterative
experimentation and validation.
Feature selection is a fundamental step in the data science process that involves
choosing the most informative and relevant variables to include in a model, with the
aim of enhancing model performance and interpretability.

Decision Trees

A decision tree is a non-parametric supervised learning algorithm for classification


and regression tasks. It has a hierarchical tree structure consisting of a root node,
branches, internal nodes, and leaf nodes. Decision trees are used for classification and
regression tasks, providing easy-to-understand models.

A decision tree is a hierarchical model used in decision support that depicts decisions
and their potential outcomes, incorporating chance events, resource expenses, and
utility. This algorithmic model utilizes conditional control statements and is non-
parametric, supervised learning, useful for both classification and regression tasks.
The tree structure is comprised of a root node, branches, internal nodes, and leaf
nodes, forming a hierarchical, tree-like structure.

It is a tool that has applications spanning several different areas. Decision trees can be
used for classification as well as regression problems. The name itself suggests that it
uses a flowchart like a tree structure to show the predictions that result from a series
of feature-based splits. It starts with a root node and ends with a decision made by
leaves.

Example of Decision Tree


Let’s understand decision trees with the help of an example:
Decision trees are upside down which means the root is at the top and then this
root is split into various several nodes. Decision trees are nothing but a bunch of if-
else statements in layman terms. It checks if the condition is true and if it is then it
goes to the next node attached to that decision.

In the below diagram the tree will first ask what is the weather? Is it sunny, cloudy,
or rainy? If yes then it will go to the next feature which is humidity and wind. It
will again check if there is a strong wind or weak, if it’s a weak wind and it’s
rainy then the person may go and play.
https://www.analyticsvidhya.com/blog/2021/08/decision-tree-algorithm/

https://medium.com/geekculture/step-by-step-decision-tree-id3-algorithm-from-
scratch-in-python-no-fancy-library-4822bbfdd88f

ID3 Algorithm Decision Tree – Solved Example – Machine Learning

Problem Definition:
Build a decision tree using ID3 algorithm for the given training data in the table (Buy
Computer data), and predict the class of the following new example: age<=30,
income=medium, student=yes, credit-rating=fair

Solution:

First, check which attribute provides the highest Information Gain in order to split the
training set based on that attribute. We need to calculate the expected information to
classify the set and the entropy of each attribute.
The information gain is this mutual information minus the entropy:

The mutual information of the two classes,

Entropy(S)= E(9,5)= -9/14 log2(9/14) – 5/14 log2(5/14)=0.94

Now Consider the Age attribute

For Age, we have three values age<=30 (2 yes and 3 no), age31..40 (4 yes and 0 no),
and age>40 (3 yes and 2 no)

Entropy(age) = 5/14 (-2/5 log2(2/5)-3/5log2(3/5)) + 4/14 (0) + 5/14 (-3/5log2(3/5)-


2/5log2(2/5))

= 5/14(0.9709) + 0 + 5/14(0.9709) = 0.6935

Gain(age) = 0.94 – 0.6935 = 0.2465

Next, consider Income Attribute

For Income, we have three values incomehigh (2 yes and 2 no), incomemedium (4 yes
and 2 no), and incomelow (3 yes 1 no)

Entropy(income) = 4/14(-2/4log2(2/4)-2/4log2(2/4)) + 6/14 (-4/6log2(4/6)-


2/6log2(2/6)) + 4/14 (-3/4log2(3/4)-1/4log2(1/4))

= 4/14 (1) + 6/14 (0.918) + 4/14 (0.811)

= 0.285714 + 0.393428 + 0.231714 = 0.9108

Gain(income) = 0.94 – 0.9108 = 0.0292

Next, consider Student Attribute For Student, we have two values studentyes (6 yes
and 1 no) and studentno (3 yes 4 no)

entropy(student) = 7/14(-6/7log2(6/7)-1/7log2(1/7)) + 7/14(-3/7log2(3/7)-4/7log2(4/7)

= 7/14(0.5916) + 7/14(0.9852)

= 0.2958 + 0.4926 = 0.7884

Gain (student) = 0.94 – 0.7884 = 0.1516

Finally, consider Credit_Rating Attribute


For Credit_Rating we have two values credit_ratingfair (6 yes and 2 no) and
credit_ratingexcellent (3 yes 3 no)

Entropy(credit_rating) = 8/14(-6/8log2(6/8)-2/8log2(2/8)) + 6/14(-3/6log2(3/6)-


3/6log2(3/6))

= 8/14(0.8112) + 6/14(1)

= 0.4635 + 0.4285 = 0.8920

Gain(credit_rating) = 0.94 – 0.8920 = 0.479

Since Age has the highest Information Gain we start splitting the dataset using
the age attribute.

Decision Tree after step 1


Since all records under the branch age31..40 are all of the class, Yes, we can replace
the leaf with Class=Yes

Decision Tree after step 1_1

Now build the decision tree for the left subtree

The same process of splitting has to happen for the two remaining branches.
Left sub-branch
For branch age<=30 we still have attributes income, student, and credit_rating. Which
one should be used to split the partition?

The mutual information is E(Sage<=30)= E(2,3)= -2/5 log2(2/5) – 3/5 log2(3/5)=0.97

For Income, we have three values incomehigh (0 yes and 2 no), incomemedium (1
yes and 1 no) and incomelow (1 yes and 0 no)

Entropy(income) = 2/5(0) + 2/5 (-1/2log2(1/2)-1/2log2(1/2)) + 1/5 (0) = 2/5 (1) = 0.4

Gain(income) = 0.97 – 0.4 = 0.57

For Student, we have two values studentyes (2 yes and 0 no) and studentno (0 yes 3
no)

Entropy(student) = 2/5(0) + 3/5(0) = 0

Gain (student) = 0.97 – 0 = 0.97

We can then safely split on attribute student without checking the other attributes
since the information gain is maximized.

Decision Tree after step 2


Since these two new branches are from distinct classes, we make them into leaf nodes
with their respective class as label:
Decision Tree after step 2_2

Now build the decision tree for right left subtree

Right sub-branch
The mutual information is Entropy(Sage>40)= I(3,2)= -3/5 log2(3/5) – 2/5
log2(2/5)=0.97

For Income, we have two values incomemedium (2 yes and 1 no) and incomelow (1
yes and 1 no)

Entropy(income) = 3/5(-2/3log2(2/3)-1/3log2(1/3)) + 2/5 (-1/2log2(1/2)-1/2log2(1/2))

= 3/5(0.9182)+2/5 (1) = 0.55+0. 4= 0.95

Gain(income) = 0.97 – 0.95 = 0.02

For Student, we have two values studentyes (2 yes and 1 no) and studentno (1 yes
and 1 no)

Entropy(student) = 3/5(-2/3log2(2/3)-1/3log2(1/3)) + 2/5(-1/2log2(1/2)-1/2log2(1/2))


= 0.95

Gain (student) = 0.97 – 0.95 = 0.02

For Credit_Rating, we have two values credit_ratingfair (3 yes and 0 no) and
credit_ratingexcellent (0 yes and 2 no)

Entropy(credit_rating) = 0
Gain(credit_rating) = 0.97 – 0 = 0.97

We then split based on credit_rating. These splits give partitions each with records
from the same class. We just need to make these into leaf nodes with their class label
attached:

Decision Tree for Buys Computer


New example: age<=30, income=medium, student=yes, credit-rating=fair

Follow branch(age<=30) then student=yes we predict Class=yes

Buys_computer = yes

Random Forest

Random forest is a supervised learning algorithm. The “forest” it builds is an


ensemble of decision trees, usually trained with the bagging method. The general idea
of the bagging method is that a combination of learning models increases the overall
result.

Put simply: random forest builds multiple decision trees and merges them together to
get a more accurate and stable prediction.

Algorithm for Ransom Forest Work:

Step 1: Select random K data points from the training set.

Step 2:Build the decision trees associated with the selected data points(Subsets).

Step 3:Choose the number N for decision trees that you want to build.

Step 4:Repeat Step 1 and 2.

Step 5: For new data points, find the predictions of each decision tree, and assign the
new data points to the category that wins the majority votes.
How Does Random Forest Work?

The random Forest algorithm works in several steps which are discussed below–>

Ensemble of Decision Trees: Random Forest leverages the power of ensemble


learning by constructing an army of Decision Trees. These trees are like individual
experts, each specializing in a particular aspect of the data. Importantly, they operate
independently, minimizing the risk of the model being overly influenced by the
nuances of a single tree.

Random Feature Selection: To ensure that each decision tree in the ensemble brings a
unique perspective, Random Forest employs random feature selection. During the
training of each tree, a random subset of features is chosen. This randomness ensures
that each tree focuses on different aspects of the data, fostering a diverse set of
predictors within the ensemble.

Bootstrap Aggregating or Bagging: The technique of bagging is a cornerstone of


Random Forest’s training strategy which involves creating multiple bootstrap samples
from the original dataset, allowing instances to be sampled with replacement. This
results in different subsets of data for each decision tree, introducing variability in the
training process and making the model more robust.

Decision Making and Voting: When it comes to making predictions, each decision
tree in the Random Forest casts its vote. For classification tasks, the final prediction is
determined by the mode (most frequent prediction) across all the trees. In regression
tasks, the average of the individual tree predictions is taken. This internal voting
mechanism ensures a balanced and collective decision-making process.

Key Features of Random Forest

Some of the Key Features of Random Forest are discussed below:

High Predictive Accuracy: Imagine Random Forest as a team of decision-making


wizards. Each wizard (decision tree) looks at a part of the problem, and together, they
weave their insights into a powerful prediction tapestry. This teamwork often results
in a more accurate model than what a single wizard could achieve.

Resistance to Overfitting: Random Forest is like a cool-headed mentor guiding its


apprentices (decision trees). Instead of letting each apprentice memorize every detail
of their training, it encourages a more well-rounded understanding. This approach
helps prevent getting too caught up with the training data which makes the model less
prone to overfitting.

Large Datasets Handling: Dealing with a mountain of data? Random Forest tackles it
like a seasoned explorer with a team of helpers (decision trees). Each helper takes on
a part of the dataset, ensuring that the expedition is not only thorough but also
surprisingly quick.
Variable Importance Assessment: Think of Random Forest as a detective at a crime
scene, figuring out which clues (features) matter the most. It assesses the importance
of each clue in solving the case, helping you focus on the key elements that drive
predictions.

Built-in Cross-Validation: Random Forest is like having a personal coach that keeps
you in check. As it trains each decision tree, it also sets aside a secret group of cases
(out-of-bag) for testing. This built-in validation ensures your model doesn’t just ace
the training but also performs well on new challenges.

Handling Missing Values: Life is full of uncertainties, just like datasets with missing
values. Random Forest is the friend who adapts to the situation, making predictions
using the information available. It doesn’t get flustered by missing pieces; instead, it
focuses on what it can confidently tell us.

Parallelization for Speed: Random Forest is your time-saving buddy. Picture each
decision tree as a worker tackling a piece of a puzzle simultaneously. This parallel
approach taps into the power of modern tech, making the whole process faster and
more efficient for handling large-scale projects.

https://www.geeksforgeeks.org/random-forest-algorithm-in-machine-learning/

Developing a Recommendation System

https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-
recommendation-engine-python
Introduction

In today’s world, every customer faces multiple choices, such as finding a book to
read without a specific idea in mind, leading to time-consuming searches and reliance
on recommendations from others. However, a recommendation engine could
streamline this process by suggesting books based on previous reads, saving time and
enhancing the user experience. Recommendation engines, widely used by businesses
like Amazon, Netflix, Google, and Goodreads, leverage machine learning to provide
personalized suggestions. This article explores various recommendation engine
algorithms, the mathematics behind them, and demonstrates creating a
recommendation engine using matrix factorization in Python.

Problem Statement

Many online businesses rely on customer reviews and ratings. Explicit feedback is
especially important in the entertainment and ecommerce industry where all customer
engagements are impacted by these ratings. Netflix relies on such rating data to power
its recommendation engine to provide the best movie and TV series recommendations
that are personalized and most relevant to the user.

This practice problem challenges the participants to predict the ratings for jokes given
by the users provided the ratings provided by the same users for another set of jokes.
This dataset is taken from the famous jester online Joke Recommender system
dataset.

What are Recommendation Engines?

Till recently, people generally tended to buy products recommended to them by their
friends or the people they trust. This used to be the primary method of purchase when
there was any doubt about the product. But with the advent of the digital age, that
circle has expanded to include online sites that utilize some sort of recommendation
engine.

A recommendation engine filters the data using different algorithms and recommends
the most relevant items to users. It first captures the past behavior of a customer and
based on that, recommends products which the users might be likely to buy.

If a completely new user visits an e-commerce site, that site will not have any past
history of that user. So how does the site go about recommending products to the user
in such a scenario? One possible solution could be to recommend the best selling
products, i.e. the products which are high in demand. Another possible solution could
be to recommend the products which would bring the maximum profit to the business.

If we can recommend a few items to a customer based on their needs and interests, it
will create a positive impact on the user experience and lead to frequent visits. Hence,
businesses nowadays are building smart and intelligent recommendation engines by
studying the past behavior of their users.
Now that we have an intuition of recommendation engines, let’s now look at how they
work.

How does a Recommendation Engine Work?

Before we deep dive into this topic, first we’ll think of how we can recommend items
to users:

We can recommend items to a user which are most popular among all the users

We can divide the users into multiple segments based on their preferences (user
features) and recommend items to them based on the segment they belong to

Both of the above methods have their drawbacks. In the first case, the most popular
items would be the same for each user so everybody will see the same
recommendations. While in the second case, as the number of users increases, the
number of features will also increase. So classifying the users into various segments
will be a very difficult task.

The main problem here is that we are unable to tailor recommendations based on the
specific interest of the users. It’s like Amazon is recommending you buy a laptop just
because it’s been bought by the majority of the shoppers. But thankfully, Amazon (or
any other big firm) does not recommend products using the above mentioned
approach. They use some personalized methods which help them in recommending
products more accurately.

Let’s now focus on how a recommendation engine works by going through the
following steps.

Step1: Data Collection

This is the first and most crucial step for building a recommendation engine. The data
can be collected by two means: explicitly and implicitly. Explicit data is information
that is provided intentionally, i.e. input from the users such as movie ratings. Implicit
data is information that is not provided intentionally but gathered from available data
streams like search history, clicks, order history, etc.

There are various algorithms that help us make the filtering process easier. In the next
section, we will go through each algorithm in detail.

Content based filtering

This algorithm recommends products which are similar to the ones that a user has
liked in the past.
For example, if a person has liked the movie “Inception”, then this algorithm will
recommend movies that fall under the same genre. But how does the algorithm
understand which genre to pick and recommend movies from?

Consider Example of Netflix

Recommendation engines save all information related to each user in a vector form
known as the profile vector, which contains the user’s past behavior, including liked
or disliked movies and given ratings. Information about movies is stored in another
vector called the item vector, which includes details such as genre, cast, and director.
The content-based filtering algorithm uses cosine similarity to find the cosine of the
angle between the profile vector and the item vector. If A is the profile vector and B is
the item vector, the similarity between them can be calculated as the cosine of the
angle between these two vectors.

Based on the cosine value, which ranges between -1 to 1, the movies are arranged in
descending order and one of the two below approaches is used for recommendations:

Top-n approach: where the top n movies are recommended (Here n can be decided by
the business)

Rating scale approach: Where a threshold is set and all the movies above that
threshold are recommended

Other methods that can be used to calculate the similarity are:

Euclidean Distance: Similar items will lie in close proximity to each other if plotted in
n-dimensional space. So, we can calculate the distance between items and based on
that distance, recommend items to the user. The formula for the euclidean distance is
given by:

Euclidean Distance
Pearson’s Correlation: It tells us how much two items are correlated. Higher the
correlation, more will be the similarity. Pearson’s correlation can be calculated using
the following formula:

The algorithm’s main flaw is its narrow recommendation of items of the same type,
never recommending products the user hasn’t previously purchased or liked. To
improve, an algorithm should consider user behavior in recommendation.

Collaborative filtering

Let us understand this with an example. If person A likes 3 movies, say Interstellar,
Inception and Predestination, and person B likes Inception, Predestination and The
Prestige, then they have almost similar interests. We can say with some certainty that
A should like The Prestige and B should like Interstellar. The collaborative filtering
algorithm uses “User Behavior” for recommending items. This is one of the most
commonly used algorithms in the industry as it is not dependent on any additional
information. There are different types of collaborating filtering techniques and we
shall look at them in detail below.

User-User collaborative filtering

This algorithm first finds the similarity score between users. Based on this similarity
score, it then picks out the most similar users and recommends products which these
similar users have liked or bought previously.

In terms of our movies example from earlier, this algorithm finds the similarity
between each user based on the ratings they have previously given to different movies.
The prediction of an item for a user u is calculated by computing the weighted sum of
the user ratings given by other users to an item i.

The prediction Pu,i is given by:


Here,

Pu,i is the prediction of an item

Rv,i is the rating given by a user v to a movie i

Su,v is the similarity between users

Now, we have the ratings for users in profile vector and based on that we have to
predict the ratings for other users. Following steps are followed to do so:

For predictions we need the similarity between the user u and v. We can make use of
Pearson correlation.

First we find the items rated by both the users and based on the ratings, correlation
between the users is calculated.

The predictions can be calculated using the similarity values. This algorithm, first of
all calculates the similarity between each user and then based on each similarity
calculates the predictions. Users having higher correlation will tend to be similar.

Based on these prediction values, recommendations are made. Let us understand it


with an example:

Consider the user-movie rating matrix:

User/Movie x1 x2 x3 x4 x5 Mean User Rating

A 4 1 – 4 – 3

B – 4 – 2 3 3

C – 1 – 4 4 3

Here we have a user movie rating matrix. To understand this in a more practical
manner, let’s find the similarity between users (A, C) and (B, C) in the above table.
Common movies rated by A/[ and C are movies x2 and x4 and by B and C are movies
x2, x4 and x5.

The correlation between user A and C is more than the correlation between B and C.
Hence users A and C have more similarity and the movies liked by user A will be
recommended to user C and vice versa.

This algorithm is quite time consuming as it involves calculating the similarity for
each user and then calculating prediction for each similarity score. One way of
handling this problem is to select only a few users (neighbors) instead of all to make
predictions, i.e. instead of making predictions for all similarity values, we choose only
few similarity values. There are various ways to select the neighbors:

Select a threshold similarity and choose all the users above that value

Randomly select the users

Arrange the neighbors in descending order of their similarity value and choose top-N
users

Use clustering for choosing neighbors

This algorithm is useful when the number of users is less. Its not effective when there
are a large number of users as it will take a lot of time to compute the similarity
between all user pairs. This leads us to item-item collaborative filtering, which is
effective when the number of users is more than the items being recommended.

Item-Item collaborative filtering

In this algorithm, we compute the similarity between each pair of items.

The algorithm aims to find similarity between movie pairs and recommend similar
ones based on user-user collaborative filtering. It uses the weighted sum of ratings of
“item-neighbors” instead of “user-neighbors” and provides predictions based on user-
friendliness.

There are various algorithms that help us make the filtering process easier. In the
next section, we will go through each algorithm in detail.

Content based filtering

This algorithm recommends products which are similar to the ones that a user has
liked in the past.
For example, if a person has liked the movie “Inception”, then this algorithm will
recommend movies that fall under the same genre. But how does the algorithm
understand which genre to pick and recommend movies from?

Consider Example of Netflix

Recommendation engines save all information related to each user in a vector


form known as the profile vector, which contains the user’s past behavior,
including liked or disliked movies and given ratings. Information about movies is
stored in another vector called the item vector, which includes details such as
genre, cast, and director. The content-based filtering algorithm uses cosine
similarity to find the cosine of the angle between the profile vector and the item
vector. If A is the profile vector and B is the item vector, the similarity between
them can be calculated as the cosine of the angle between these two vectors.

Based on the cosine value, which ranges between -1 to 1, the movies are arranged
in descending order and one of the two below approaches is used for
recommendations:
 Top-n approach: where the top n movies are recommended (Here n can be
decided by the business)

 Rating scale approach: Where a threshold is set and all the movies above
that threshold are recommended

Other methods that can be used to calculate the similarity are:

 Euclidean Distance: Similar items will lie in close proximity to each other
if plotted in n-dimensional space. So, we can calculate the distance
between items and based on that distance, recommend items to the user.
The formula for the euclidean distance is given by:

 Pearson’s Correlation: It tells us how much two items are correlated.


Higher the correlation, more will be the similarity. Pearson’s correlation
can be calculated using the following formula:

The algorithm’s main flaw is its narrow recommendation of items of the same type,
never recommending products the user hasn’t previously purchased or liked. To
improve, an algorithm should consider user behavior in recommendation.

Collaborative filtering

Let us understand this with an example. If person A likes 3 movies, say Interstellar,
Inception and Predestination, and person B likes Inception, Predestination and The
Prestige, then they have almost similar interests. We can say with some certainty
that A should like The Prestige and B should like Interstellar. The collaborative
filtering algorithm uses “User Behavior” for recommending items. This is one of
the most commonly used algorithms in the industry as it is not dependent on any
additional information. There are different types of collaborating filtering
techniques and we shall look at them in detail below.

User-User collaborative filtering

This algorithm first finds the similarity score between users. Based on this
similarity score, it then picks out the most similar users and recommends products
which these similar users have liked or bought previously.

In terms of our movies example from earlier, this algorithm finds the similarity
between each user based on the ratings they have previously given to different
movies. The prediction of an item for a user u is calculated by computing the
weighted sum of the user ratings given by other users to an item i.

The prediction Pu,i is given by:

Here,
 Pu,i is the prediction of an item

 Rv,i is the rating given by a user v to a movie i

 Su,v is the similarity between users

Now, we have the ratings for users in profile vector and based on that we have to
predict the ratings for other users. Following steps are followed to do so:

 For predictions we need the similarity between the user u and v. We can
make use of Pearson correlation.

 First we find the items rated by both the users and based on the ratings,
correlation between the users is calculated.

 The predictions can be calculated using the similarity values. This


algorithm, first of all calculates the similarity between each user and then
based on each similarity calculates the predictions. Users having higher
correlation will tend to be similar.

 Based on these prediction values, recommendations are made. Let us


understand it with an example:

Consider the user-movie rating matrix:

User/Movie x1 x2 x3 x4 x5 Mean User Rating


A 4 1 – 4 – 3
B – 4 – 2 3 3
C – 1 – 4 4 3

Here we have a user movie rating matrix. To understand this in a more practical
manner, let’s find the similarity between users (A, C) and (B, C) in the above table.
Common movies rated by A/[ and C are movies x2 and x4 and by B and C are
movies x2, x4 and x5.
The correlation between user A and C is more than the correlation between B and
C. Hence users A and C have more similarity and the movies liked by user A will
be recommended to user C and vice versa.

This algorithm is quite time consuming as it involves calculating the similarity for
each user and then calculating prediction for each similarity score. One way of
handling this problem is to select only a few users (neighbors) instead of all to
make predictions, i.e. instead of making predictions for all similarity values, we
choose only few similarity values. There are various ways to select the neighbors:

 Select a threshold similarity and choose all the users above that value

 Randomly select the users

 Arrange the neighbors in descending order of their similarity value and


choose top-N users

 Use clustering for choosing neighbors

This algorithm is useful when the number of users is less. Its not effective when
there are a large number of users as it will take a lot of time to compute the
similarity between all user pairs. This leads us to item-item collaborative filtering,
which is effective when the number of users is more than the items being
recommended.

Item-Item collaborative filtering

In this algorithm, we compute the similarity between each pair of items.


The algorithm aims to find similarity between movie pairs and recommend similar
ones based on user-user collaborative filtering. It uses the weighted sum of ratings
of “item-neighbors” instead of “user-neighbors” and provides predictions based on
user-friendliness.

Now we will find the similarity between items.

Now, as we have the similarity between each movie and the ratings, predictions are
made and based on those predictions, similar movies are recommended. Let us
understand it with an example.

User/Movie x1 x2 x3 x4 x5

A 4 1 2 4 4
B 2 4 4 2 1

C – 1 – 3 4

Mean Item Rating 3 2 3 3 3

The mean item rating is the average of all ratings given to a particular item, compared
to the user-user filtering table. Instead of finding user-user similarity, item-item
similarity is calculated. For example, comparing movies (x1, x4) and (x1, x5),
common users who have rated these items are A and B, while those who have rated
movies x1 and x5 are also A and B.

The similarity between movie x1 and x4 is more than the similarity between movie x1
and x5. So based on these similarity values, if any user searches for movie x1, they
will be recommended movie x4 and vice versa.

https://www.analyticsvidhya.com/blog/2020/08/recommendation-system-k-nearest-
neighbors/

kNN algorithm is a reliable and intuitive recommendation system that leverages user
or item similarity to provide personalized recommendations. kNN recommender
system is helpful in e-commerce, social media, and healthcare, and continues to be an
important tool for generating accurate and personalized recommendations.

The nearest neighbor algorithm is a popular approach in recommendation systems for


identifying items or users that are most similar to a given item or user.

Some Problem with nearest neighbor:

Data Quality Issues: Check the quality and consistency of your data. Ensure that
your data is clean, free from outliers, and properly preprocessed (e.g., normalized
or standardized).

Distance Metric Selection: The choice of distance metric (e.g., Euclidean


distance, cosine similarity) can significantly impact the performance of the
algorithm. Experiment with different metrics to see which one best captures the
similarity between items or users in your dataset.

Curse of Dimensionality: In high-dimensional spaces, distance-based algorithms


like nearest neighbors can become less effective due to the increased sparsity of
data points. Consider dimensionality reduction techniques like PCA (Principal
Component Analysis) or feature selection to mitigate this issue.

Cold Start Problem: Nearest neighbor algorithms may struggle with cold start
problems, where there isn't enough data available for new users or items. Consider
using hybrid approaches or incorporating content-based features to handle this
scenario.

Scalability: For large datasets, computing distances between all pairs of items or
users can be computationally expensive. Look into approximate nearest neighbor
methods or data structures like KD-trees or Ball trees to improve efficiency.

Normalization: Ensure that features used for calculating similarity are properly
normalized to prevent certain features from dominating the distance calculation.

Hyperparameter Tuning: If using algorithms like k-nearest neighbors (k-NN),


experiment with different values of k and other hyperparameters to find the
optimal configuration for your dataset.

User/item representation: Make sure that your representation of users and items
(feature vectors) appropriately captures the relevant characteristics that define
similarity in your recommendation context.

Evaluation Metrics: Use appropriate evaluation metrics (e.g., precision, recall,


RMSE for rating prediction) to assess the performance of your recommendation
system and identify areas for improvement.

Implementation Bugs: Double-check your implementation for bugs or logical


errors that could affect the correctness of your results.

By systematically addressing these potential issues, you should be able to


diagnose and improve the performance of your nearest neighbor algorithm in your
recommendation system.

Dimensionality Reduction

Dimensionality reduction is a fundamental technique in data science aimed at


reducing the number of input variables (dimensions) under consideration. It's
particularly useful in scenarios where datasets have a large number of features or
dimensions, which can lead to increased computational complexity, overfitting, and
reduced model interpretability. Here are some key points about dimensionality
reduction:

Purpose: The primary goal of dimensionality reduction is to simplify data


representation while retaining important information. This simplification can aid in
better understanding the underlying structure of data, improving computational
efficiency, and enhancing model performance.
Techniques:

Principal Component Analysis (PCA): PCA is one of the most widely used
dimensionality reduction techniques. It transforms the original variables into a new set
of orthogonal variables (principal components) that capture the maximum variance in
the data.

t-Distributed Stochastic Neighbor Embedding (t-SNE): t-SNE is effective for


visualizing high-dimensional data by mapping similar instances to nearby points in a
lower-dimensional space.

Linear Discriminant Analysis (LDA): LDA is often used in supervised learning


tasks to find the feature subspace that maximizes class separability.

Autoencoders: These are neural network models that learn efficient representations
of data by encoding input into a lower-dimensional latent space and then
reconstructing the output from this representation.

Benefits:

Improved Model Performance: By reducing noise and irrelevant features,


dimensionality reduction can lead to better generalization and predictive performance
of machine learning models.

Visualization: Lower-dimensional representations are easier to visualize, enabling


better exploration and understanding of data patterns.

Efficiency: Reduced dimensionality can lead to faster training times and less memory
usage, especially beneficial for large datasets.

Considerations:

Loss of Information: Dimensionality reduction inherently involves a loss of some


information. The challenge lies in finding a balance between reducing dimensionality
and preserving the critical characteristics of the data.

Choice of Technique: The choice of dimensionality reduction technique depends on


factors such as the type of data, desired outcomes (e.g., visualization vs. model
improvement), and underlying assumptions about the data distribution.

Application in Recommendation Systems:

In recommendation systems, dimensionality reduction techniques can be applied to


user-item interaction matrices to uncover latent factors or preferences.
By reducing the dimensionality of feature vectors representing users and items,
recommendation algorithms can efficiently compute similarities or recommendations
while mitigating the effects of sparsity and noise in data.

Singular Value Decomposition (SVD) is a powerful matrix factorization technique


used extensively in data science and recommendation systems. Here’s an overview of
SVD and its relevance:

What is SVD?

Key Concepts and Uses of SVD:

Dimensionality Reduction:

o SVD is used for reducing the dimensionality of data. By retaining only


the most significant singular values and corresponding vectors, you can
represent the original matrix with reduced dimensions.

Matrix Approximation:

o SVD allows for approximating a matrix AAA by using only the first
kkk singular values and vectors. This approximation can be useful for
compressing data or denoising.

Collaborative Filtering in Recommendation Systems:

o In recommendation systems, SVD is used for collaborative filtering. It


helps in uncovering latent factors that represent user preferences and
item characteristics. By decomposing the user-item interaction matrix,
recommendations can be generated based on the reduced latent space.

Principal Component Analysis (PCA):

o PCA can be seen as a specific application of SVD, where the


covariance matrix of a dataset is decomposed to find its principal
components.

Steps for Using SVD in Recommendation Systems:


Advantages of SVD in Recommendation Systems:

Implicit Feedback Handling: SVD can handle implicit feedback data (e.g.,
user views or clicks) effectively by capturing underlying patterns in user-item
interactions.

Personalization: By learning latent factors, SVD can provide personalized


recommendations based on user preferences.

Scalability: Techniques like incremental SVD and stochastic gradient descent


can be used to scale SVD to large datasets.

Principal Component Analysis (PCA) is a widely used technique in data


science for reducing the dimensionality of data while retaining as much
variance as possible. Here’s a detailed overview of PCA and its applications:

What is PCA?

PCA is a statistical method that transforms a set of correlated variables (or


features) into a set of linearly uncorrelated variables called principal
components. These principal components are ordered by the amount of
variance they explain in the original data.

Key Concepts and Steps in PCA:

Covariance Matrix:

PCA starts by computing the covariance matrix of the dataset, which captures
the pairwise relationships between different variables.
Eigen decomposition or Singular Value Decomposition (SVD):

The covariance matrix is then decomposed into its eigenvectors and


eigenvalues (or using SVD for numerical stability), which represent the
directions and magnitudes of maximum variance in the data.

Selecting Principal Components:

Principal components are selected based on the eigenvalues, with higher


eigenvalues indicating greater variance explained. Typically, the number of
principal components chosen is based on the desired level of variance
retention (e.g., 95% of variance explained).

Transforming the Data:

Finally, the original dataset is transformed into the new space defined by the
selected principal components. This transformation projects the data onto a
lower-dimensional subspace while preserving as much variance as possible.

Applications of PCA:

Dimensionality Reduction:

PCA is primarily used for reducing the number of variables in a dataset while
retaining most of the information. This is beneficial for improving
computational efficiency, reducing noise, and avoiding overfitting in machine
learning models.

Visualization:

PCA is valuable for visualizing high-dimensional data. By reducing data to


two or three principal components, it becomes easier to plot and understand
the underlying structure and relationships.

Feature Extraction:

PCA can be used as a feature extraction technique where the principal


components serve as new features that may be more informative or less
redundant than the original variables.

Noise Reduction:

PCA can effectively filter out noise by emphasizing variations in data that are
significant (captured by principal components with high eigenvalues) and
disregarding variations that are less significant (captured by components with
low eigenvalues).

Advantages of PCA:
Interpretability: Principal components are linear combinations of original
variables, making them interpretable in terms of the contributions of different
features.

Data Compression: PCA allows for data compression by reducing the number
of dimensions while preserving most of the variance, which is useful for
storage and computation.

Improves Model Performance: By reducing the number of input variables,


PCA can lead to simpler and more efficient models that generalize better to
new data.

Challenges and Considerations:

Loss of Information: PCA involves a trade-off between dimensionality


reduction and information loss. Choosing too few principal components may
result in significant loss of variance and important features.

Assumptions of Linearity: PCA assumes linear relationships between variables,


which may not hold in all datasets.

Scaling: It is important to scale the data appropriately before applying PCA to


ensure that variables with larger scales do not dominate the principal
components.

You might also like