0% found this document useful (0 votes)

19 views37 pages

Data Minig Anwers

The document discusses various concepts in data mining, including the differences between data mining and traditional query processing, the importance of data integration, and techniques for data reduction. It also covers algorithms like k-Nearest Neighbor and Support Vector Machines, as well as concepts like confusion matrices, cross-validation, and the Apriori principle in association rule mining. Additionally, it addresses web mining topics such as clickstream analysis and personalized web mining.

Uploaded by

sparshbisht24

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views37 pages

Data Minig Anwers

Uploaded by

sparshbisht24

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 37

Section – A

1. Explain the difference between data mining and traditional

query processing.
Ans. The difference between data mining and traditional query
processing are:-
Traditional Query
Aspect Data Mining
Processing
Retrieve specific Discover hidden patterns
Purpose
information and knowledge
Nature of Exact answers to Probabilistic patterns,
Results defined queries trends, and predictions
SQL and relational Machine learning,
Technique Used
database operations statistics, AI algorithms
Large volumes of
Data Clean, structured, structured, semi-
Requirement relational data structured, or
unstructured data
Predefined, specific Exploratory, pattern-
Query Type
queries finding tasks
User Knowledge User knows what to User may not know what
Required ask to look for
"SELECT * FROM "Find patterns in
Examples Sales WHERE customer purchase
Region = 'West';" behavior"
Traditional Query
Aspect Data Mining
Processing
Goal Data retrieval Knowledge discovery

2. What is the importance of data integration in data mining?

Ans. The importance of data integration in data mining are:-
Combines data from multiple sources: Data integration allows data
from different databases, formats, or systems to be unified into a
single, coherent dataset, enabling more meaningful analysis.
Increases mining efficiency: Integrated data reduces duplication and
fragmentation, which improves the performance and accuracy of
data mining algorithms.

3. Define data reduction and list two techniques used for it.
Ans. Data reduction is the process of reducing the volume of data
while preserving its integrity and the essential information it
contains. The goal is to make data mining more efficient by
minimizing storage, computation time, and complexity, without
significantly losing accuracy or insights.

Two Common Techniques for Data Reduction:

1. Dimensionality Reduction: A technique that reduces the number
of input variables or features by transforming or selecting the
most relevant ones while retaining essential information.
2. Data Compression: A method that reduces data size by encoding
it more efficiently, often through mathematical transformations
or encoding schemes.

4. Explain the role of feature selection in data preprocessing.

Ans. Feature selection is a crucial step in data preprocessing that
involves identifying and selecting the most relevant attributes
(features) from a dataset to improve the performance and efficiency
of data mining or machine learning models.
Here are its keys roles :
1. Reduces dimensionality: By eliminating irrelevant or redundant
features, it simplifies the dataset, making models faster and easier
to train.
2. Improves model accuracy: Selecting the most informative
features helps models focus on the most predictive aspects of the
data, enhancing performance.
3. Prevents overfitting: Reducing noise and irrelevant information
decreases the risk of the model fitting random patterns in the
training data.
4. Enhances interpretability: Fewer features make it easier to
understand the model's behavior and the relationships within the
data.
5. Speeds up computation: Smaller datasets require less processing
power and memory, which is critical for large-scale data mining.

5. Differentiate between noise and missing values in a dataset.

Ans. The difference between noise and missing values are:-
Aspect Noise Missing Values
Random or meaningless
Absence of data in one or
Definition data that deviates from the
more fields in a dataset
true underlying pattern
Measurement errors, data Data not recorded, lost, or
Cause
entry mistakes, or outliers unavailable during collection
Can lead to biased or
Distorts patterns, reduces
Impact incomplete analysis if not
model accuracy
handled properly
A salary entry of "999999"
A blank cell in a column like
Example or a sudden spike in sensor
"Age" or "Income"
readings
Smoothing, outlier Imputation (mean/mode),
Handling
detection, or filtering deletion, or using algorithms
Methods
techniques that handle missing data

6. What is the difference between pattern evaluation and pattern

discovery in data mining?
Ans. The difference between pattern evaluation and pattern
discovery in data mining are :-
Aspect Pattern Discovery Pattern Evaluation
The process of The process of assessing the
identifying patterns, usefulness, interestingness,
Definition
trends, or or validity of discovered
relationships in data patterns
To uncover new and To filter and rank patterns
Purpose potentially unknown based on relevance,
patterns significance, or actionability
When it Early stage in the data After patterns have been
occurs mining process discovered
A large set of raw A refined set of meaningful
Output
patterns or rules and valuable patterns
Determining whether this
Finding that
pattern is statistically
Example customers who buy
significant or useful for
bread also buy butter
marketing

7. Describe the significance of scalability in data mining algorithms.

Ans. Scalability in data mining refers to an algorithm's ability to
handle increasing amounts of data effectively without a significant
loss in performance. It is important because:
1. Handles large datasets efficiently.
2. Maintains performance in terms of speed, accuracy, and resource
usage.

8. Explain how the k-Nearest Neighbor (k-NN) algorithm works.

Ans. The k-Nearest Neighbor (k-NN) algorithm is a simple, instance-
based learning method used for classification or regression. It works
by finding the k nearest data points to a given query point based on a
distance metric (usually Euclidean) and making predictions based on
the majority class (for classification) or average value (for regression)
of these neighbors.
Key points:
 For classification: Assigns the most frequent class among the k
nearest neighbors.
 For regression: Predicts the average of the values of the k nearest
neighbors.

9. What are the advantages of using ensemble methods in

classification?
Ans. The advantages of using ensemble methods in classification
are:-
Improved Accuracy: By combining multiple models, ensemble
methods reduce the risk of errors from individual models, leading to
more accurate predictions.
Flexibility: Ensemble methods can work with a variety of base
models (like decision trees, neural networks, etc.) and can be applied
to both weak and strong learners.

10.Define a confusion matrix and list its components.

Ans. A confusion matrix is a table used to evaluate the performance
of a classification algorithm by comparing the predicted labels with the
actual true labels. It provides insight into the types of errors made by
the classifier.
Key Components:
1. True Positive (TP): The number of instances correctly classified as
positive (Actual = 1, Predicted = 1).
2. False Positive (FP): The number of instances incorrectly classified
as positive (Actual = 0, Predicted = 1).
3. False Negative (FN): The number of instances incorrectly classified
as negative (Actual = 1, Predicted = 0).
4. True Negative (TN): The number of instances correctly classified
as negative (Actual = 0, Predicted = 0).
11.How does support vector machine (SVM) separate data points?
Ans. A Support Vector Machine (SVM) separates data points by
finding a hyperplane that best divides the data into two classes. It
maximizes the margin (distance) between the hyperplane and the
closest data points from each class, known as support vectors. This
helps SVM classify new data points with higher accuracy.

12. Explain the main assumption behind linear regression.

Ans. The main assumptions behind linear regression are:
1. Linearity: The relationship between the independent and
dependent variables is linear.
2. Independence: The residuals (errors) are independent of each
other.
3. Homoscedasticity: The variance of residuals is constant across all
levels of the independent variables.
4. Normality of Errors: The residuals are normally distributed.
5. No Multicollinearity: Independent variables should not be highly
correlated with each other.

13. What is the role of a cost function in regression models?

Ans. The cost function in regression models measures the difference
between predicted and actual values. Its role is to quantify errors,
guiding the optimization process to minimize these errors and
improve the model’s predictions. A lower cost indicates better model
performance.

14. Compare batch gradient descent and stochastic gradient

descent.
Ans. The comparison between batch gradient descent and stochastic
gradient are:-
Batch Gradient Descent Stochastic Gradient
Aspect
(BGD) Descent (SGD)
Updates model
Updates model
Update parameters after
parameters after each
Frequency processing the entire
individual data point.
dataset.
High, since it requires Low, as it updates
Computation
calculating the gradient after processing one
Cost
for the entire dataset. data point at a time.
More stable and May fluctuate, but can
Convergence smooth, but can be converge faster due to
slower to converge. frequent updates.
Requires storing the Low memory usage,
Memory Usage entire dataset in as it processes one
memory. data point at a time.
Speed Slower, as it processes Faster for large
the entire dataset datasets, as updates
before updating. are made more
Batch Gradient Descent Stochastic Gradient
Aspect
(BGD) Descent (SGD)
frequently.
Can escape local
More likely to get stuck
Effect on Local minima due to more
in local minima due to
Minima frequent, noisy
smooth convergence.
updates.

15. What is the use of cross-validation in machine learning models?

Ans. Cross-validation is a technique used to assess the performance
and generalization ability of a machine learning model. It helps to
evaluate how well the model will perform on unseen data, reducing
the risk of overfitting.

16. How does pruning improve decision tree performance?

Ans. Pruning improves decision tree performance by removing
unnecessary branches, preventing overfitting, and simplifying the
model. It makes the tree more generalizable to unseen data,
enhances prediction accuracy, and improves computational
efficiency.

17. Define multi-class classification with an example.

Ans. Multi-class classification is a type of classification problem
where the goal is to assign each input data point to one of three or
more classes (categories), rather than just two.
Example:
Consider a system that classifies animals based on their
characteristics (e.g., size, number of legs, habitat):
 Classes: "Cat", "Dog", "Bird"
 Input: Features like size, fur type, and habitat.
 Output: The model predicts one of the classes—either "Cat",
"Dog", or "Bird"—for a given animal.

18. Explain the difference between client-side and server-side web

usage mining.
Ans. The difference between client-side and server-side web usage
mining are:-
Client-Side Web Usage Server-Side Web Usage
Aspect
Mining Mining
Location of Data is collected from Data is collected from
Data the user's browser the web server that
Collection (client). hosts the website.
Uses data such as
Uses server logs, page
cookies, browser logs,
Data Sources access records, and HTTP
or user interactions
request data.
(clicks, scrolls).
Client-Side Web Usage Server-Side Web Usage
Aspect
Mining Mining
Focuses on individual
Focuses on aggregate
user behavior (what a
Scope user behavior (overall
specific user does on
site traffic patterns).
the site).
Can potentially track More anonymous as it
personal data and tracks only server-side
Data Privacy
actions across websites interactions, not specific
(e.g., through cookies). user data.
Provides detailed
Provides more general
interaction-level data,
Data usage data (e.g., number
such as time spent on
Granularity of hits per page, session
each page or button
length).
clicks.
Analyzing server logs to
Tracking user clicks on
understand the
buttons or navigation
Example frequency of page visits,
through a website
page views, and traffic
using JavaScript.
sources.

19. What is clickstream analysis in web mining?

Ans. Clickstream analysis refers to the process of tracking and
analyzing the sequence of clicks or actions made by users as they
navigate through a website or web application. It helps understand
user behavior, preferences, and interactions with web pages.
20. How is text mining related to web content mining?
Ans. Text Mining and Web Content Mining are related because text
mining techniques are often used within web content mining to
analyze and extract valuable insights from the textual data found on
web pages. Web content mining focuses on extracting information
from various web sources, and text mining helps process and analyze
the textual content (like articles, blogs, and product descriptions) on
those web pages to identify patterns, trends, and meaningful
information.

21. Explain how web structure mining can help in improving search
engine results.
Ans. Web structure mining analyzes the link structure of the web,
such as how web pages are interconnected through hyperlinks. By
examining the relationships and connections between pages, it helps
in improving search engine results by:
1. Ranking Pages: Identifying important pages based on their
connectivity and link structure (e.g., PageRank algorithm).
2. Improving Relevance: Understanding the structure helps search
engines prioritize relevant pages that are well-linked to others.
3. Optimizing Crawling: It helps search engines efficiently crawl and
index pages, improving the overall search result quality.
22. What are some challenges faced in web mining due to dynamic
web content?
Ans. Frequent Content Changes: Websites with frequently updated
or dynamically generated content make it difficult to mine accurate
and up-to-date data consistently.
Data Inconsistency: With constantly changing content, maintaining
consistency in data structure and format across multiple mining
sessions becomes difficult.

23. Define personalized web mining with an examples.

Ans. Personalized web mining refers to the process of using web
mining techniques to analyze user behavior, preferences, and
interactions with web content in order to deliver tailored, customized
experiences. This involves analyzing data to create personalized
recommendations, content, and advertisements based on individual
user profiles.
Examples:
Search Engines: Google tailors search results based on your previous
search history and location to provide more relevant results.

24. What is the Apriori principle in association rule mining?

Ans. The Apriori principle is a fundamental concept used in
association rule mining, which is based on the idea that if an itemset
is frequent, all of its subsets must also be frequent. This principle
helps to efficiently find frequent item sets in large datasets.
Example:-
In a retail setting, if the itemset {milk, bread} is frequent (often
bought together), the Apriori principle implies that {milk} and {bread}
must also be frequent by themselves.

25. Explain the concepts of support and confidence in association

rule mining.
Ans. The concepts of support and confidence in association rule
mining are :-
Support:
 Definition: Support is a measure of how frequently a particular
itemset appears in the dataset. It is used to evaluate the relevance
of the itemset.
 Interpretation: A higher support indicates that the itemset occurs
more frequently in the dataset. In association rule mining, item
sets with higher support are more likely to be considered as
important or useful.
Confidence:
 Definition: Confidence is a measure of how likely it is that an item
Y is purchased when item X is purchased. It evaluates the strength
of the association between two items.
 Interpretation: Confidence represents the probability that item Y
will be purchased when item X is purchased. A higher confidence
indicates a stronger rule.

Section – B
1. Explain the working of the Decision Tree algorithm with an
example.
Ans. A Decision Tree is a supervised learning algorithm used for
classification and regression. It splits the data into subsets
based on feature values to form a tree-like model of decisions.
Steps in Decision Tree Algorithm:
1. Select the Best Feature:
o Choose the feature that best splits the data based on a
criterion like Gini Index, Information Gain, or Entropy.
2. Create a Node:
o Make a decision node for the selected feature.
3. Split the Dataset:
o Divide the dataset into subsets based on the values of the
selected feature.
4. Repeat Recursively:
o Repeat steps 1–3 for each subset until:
 All data in a node belong to the same class (pure), or
 A stopping condition (e.g., maximum depth) is met.
5. Form the Tree:
o The result is a tree where each internal node is a decision
based on a feature, and each leaf node represents a class
label (output).
2. Compare supervised learning, unsupervised learning, and
reinforcement learning with examples.
Ans. The comparison between supervised learning ,
unsupervised learning and reinforcement learning are:-
Superv
Unsuper
Aspec ised Reinforcemen
vised
t Learni t Learning
Learning
ng
Defini Learns Learns Learns by
tion from from interacting
labeled unlabele with an
data d data environment
Superv
Unsuper
Aspec ised Reinforcemen
vised
t Learni t Learning
Learning
ng
(input- (no and receiving
output output rewards/punis
pairs) labels) hments
Find
Learn an
Predict hidden
optimal policy
output patterns
Goal to maximize
from or
cumulative
input grouping
reward
s in data
Sequential
Data Labele Unlabele data and
Type d data d data feedback from
environment
Predicti Groupin
ve g,
model dimensi
Policy or
(e.g., onality
Outpu strategy for
classific reductio
t decision-
ation n,
making
or pattern
regress discover
ion) y
Exam - Spam - - Game
Superv
Unsuper
Aspec ised Reinforcemen
vised
t Learni t Learning
Learning
ng
Custome
detecti
r
on- playing (e.g.,
segment
House Chess, Go)-
ples ation-
price Robotics
Market
predic navigation
basket
tion
analysis
- Linear
- K-
Regres
Means- - Q-learning-
Key sion-
Hierarchi Deep Q-
Algori Decisio
cal Networks
thms n
Clusterin (DQN)- SARSA
Trees-
g- PCA
SVM

3. Explain the K-Means clustering algorithm step by step with

a simple example.
Ans.

4. Describe the advantages and limitations of K-Nearest

Neighbors (k-NN) algorithm.
Ans. Advantages:
1. Simple and Easy to Implement
o Intuitive and easy to understand with minimal training
phase.
2. No Assumptions About Data
o Works well with non-linear and multi-class problems as it is a
non-parametric method.
3. Effective with Sufficient Data
o Can perform well if there is a large and well-distributed
dataset.
4. Versatile
o Can be used for both classification and regression problems.
Limitations:
1. Computationally Expensive
o Slow for large datasets, as it computes the distance to every
training point during prediction.
2. Memory Intensive
o Requires storing the entire training dataset in memory.
3. Sensitive to Irrelevant Features and Noise
o Performance can degrade if the data has too many irrelevant
or noisy features.
4. Poor with Imbalanced Data
o Biased towards the majority class in case of unbalanced
datasets.
o

5. Explain Linear Regression with a diagram and an example.

Ans. Linear Regression is a supervised learning algorithm used
to predict a continuous output based on the relationship
between input (independent) and output (dependent) variables
by fitting a straight line to the data.
The general equation for simple linear regression is:
y=mx+c
Where:
 y = predicted output
 xxx = input feature
 mmm = slope of the line (how much y changes per unit x)
 ccc = intercept (value of y when x = 0)

Example:

Hours Studied (x) Score (y)

1 50
2 55
3 65
4 70
5 75
The algorithm finds the best-fitting line through the points,
such as:
y=5x+45
This means:
 For every additional hour studied, the score increases by 5.
 If a student studies 6 hours:

y=5(6)+45=75

6. What is Random Forest? How does it improve the accuracy

compared to a single Decision Tres?
Ans. Random Forest is an ensemble learning algorithm that
builds multiple decision trees and combines their outputs to
make more accurate and stable predictions.
It is used for both classification and regression tasks.
Random Forest improves accuracy by addressing the
weaknesses of a single decision tree using ensemble learning.
Here’s a clear breakdown:
1. Reduces Overfitting
 A single decision tree can memorize training data and perform
poorly on unseen data.
 Random Forest builds multiple trees on different subsets of data
and features.
2. Reduces Variance
 A single tree is unstable — small data changes can produce a very
different model.
 Random Forest creates a "forest" of diverse trees, so the overall
prediction becomes more stable and consistent.
3.Increases Robustness
 Different trees focus on different parts of the data due to random
feature selection and sampling.
 Combining them ensures that the final model is not overly
influenced by any one tree's bias or errors.
Better Accuracy Through Majority Voting or Averaging
 In classification, each tree votes for a class — the majority vote
wins.
 In regression, it takes the average prediction of all trees.
7. Describe the working of Support Vector Machine (SVM) and
the concept of hyperplane.
Ans. Support Vector Machines (SVMs) are supervised learning
algorithms primarily used for classification, though they can
also handle regression tasks. They work by finding the optimal
decision boundary that best separates the classes in the feature
space. This boundary is called a hyperplane.

Working of SVM:
1. Data Representation:
SVM begins by representing the input data as points in a high-
dimensional space, where each dimension corresponds to one
feature.
2. Finding the Optimal Hyperplane:
o The goal is to find a hyperplane that separates the data
points belonging to different classes with the maximum
margin.
o The margin is defined as the distance between the
hyperplane and the nearest data points from each class.
These closest points are known as support vectors.
o By maximizing the margin, SVM aims to create a decision
boundary that generalizes better to unseen data.
4.Handling Non-Linearly Separable Data:
o If the data isn’t linearly separable in the original feature
space, SVM can apply a kernel trick to transform the data
into a higher-dimensional space where a linear hyperplane
can be used to separate the classes.

Concept of the Hyperplane:

 A hyperplane is a decision boundary that separates data points in
a multi-dimensional space.
 In two dimensions, a hyperplane is a line; in three dimensions, it is
a plane; and in higher dimensions, it’s called a hyperplane.
 The optimal hyperplane is chosen such that it maximizes the
distance (margin) between the closest points of the different
classes. This maximization helps ensure that the model has the
best possible generalization performance.
8. Explain the key differences between classification and
regression problems with examples.
Ans. The difference between classification and regression
problems are :-
Aspect Classification Regression
Predict Predict
Purpose categorical continuous
labels or classes numerical values
Discrete
Real numbers
categories (e.g.,
Output (e.g., height,
yes/no,
Type salary,
spam/ham,
temperature)
types)
Email spam Predicting house
Example detection (spam prices based on
or not spam) size and location
Logistic Linear
Algorithms Regression, Regression, SVR,
Used Decision Tree, Decision Tree
SVM, k-NN Regressor
Accuracy,
Mean Squared
Precision, Recall,
Evaluation Error (MSE),
F1 Score,
Metrics Mean Absolute
Confusion
Error (MAE), R²
Matrix
Prediction
Class label Continuous value
Type
Simple Examples:
Classification:
Email Spam Detection : Predict whether an email is spam or not
spam.
Regression:
House Price Prediction : Predict the price of a house based on
its features(e.g., size, location).

9. What is hierarchical clustering? Explain the concepts of

agglomerative and divisive methods.
Ans. Hierarchical Clustering is a clustering algorithm that builds
a hierarchy of clusters, which can be represented as a tree-like
structure called a dendrogram. It does not require the number
of clusters to be specified in advance, and it can be used for
both agglomerative and divisive clustering methods.
The concepts of agglomerative and division methods are:-
Agglomerative Method (Bottom-Up Approach)

 Definition: In agglomerative hierarchical clustering, every data

point starts as its own individual cluster, and the algorithm
progressively merges the closest clusters.
 Process:
o Start with each data point as its own cluster.
o Find the two closest clusters and merge them into a single
cluster.
o Repeat this process of merging the closest clusters until all
points are in one cluster.
 Distance Metric: The proximity between clusters can be
measured using various distance metrics like single linkage
(minimum distance), complete linkage (maximum distance), or
average linkage (average distance between all pairs).

Divisive Method (Top-Down Approach)

 Definition: In divisive hierarchical clustering, we start with all data
points in a single cluster and progressively split the cluster into
smaller clusters.
 Process:
1. Start with all data points in a single large cluster.
2. Identify the most distinct subset of points to form a separate
cluster.
3. Split the large cluster into smaller clusters.
4. Repeat the splitting process recursively until each data point
is in its own cluster.
 Distance Metric: Similar to agglomerative clustering, distance
metrics can also be used to measure how to split the clusters.

10.Explain logistic regression and how it is used for binary

classifications.
Ans. Logistic Regression is a statistical model used for predicting
the probability of a binary outcome (i.e., two classes) based on
one or more predictor variables. Despite the name
"regression," it is actually a classification algorithm used for
tasks where the response variable is categorical (binary, in this
case).
Logistic Regression for Binary Classification:
In binary classification, the goal is to categorize data into two
possible classes: 0 or 1. Logistic regression is ideal for this task
because it provides a probability score between 0 and 1,
representing the likelihood that an input belongs to class 1.
Example:
Let’s say we want to predict whether a customer will buy a
product (class 1) or not buy it (class 0) based on their age and
income.
 Input features: Age and Income.
 Output: A binary outcome, 1 for "Buy" and 0 for "Not Buy".
In logistic regression, we would:
1. Train the model using a dataset of customer information (age,
income, etc.) and the corresponding labels (1 or 0 for "Buy" or
"Not Buy").
2. The model learns the optimal coefficients (β0,β1,β2\beta_0, \
beta_1, \beta_2β0,β1,β2) for the features (age and income).
3. For a new customer, the model would use the learned weights to
calculate the probability that the customer will buy the product.

12. How can web usage mining be applied to improve user

experience on e-commerce platforms?
Ans. Web Usage Mining can significantly improve user
experience on e-commerce platforms by analyzing user
behavior patterns, such as page visits, clicks, time spent, and
purchase history. This insight allows platforms to personalize
content, optimize navigation, and enhance customer
engagement.
Applications of Web Usage Mining in E-Commerce:
How It Improves User
Application
Experience
Suggests products based on
Personalized
browsing and purchase history
Recommendations
(e.g., “You may also like…”).
Ranks search results based on
Improved Search
popular user clicks and frequent
Results
queries.
Dynamic Page Adjusts homepage or category
Layouts layout to highlight frequently
How It Improves User
Application
Experience
visited or trending products.
Groups users by behavior (e.g.,
Customer frequent buyers, cart
Segmentation abandoners) to tailor marketing
strategies.
Displays relevant discounts or
Targeted Promotions
offers based on user interests
and Ads
and purchase patterns.
Identifies when users abandon
Cart Abandonment
carts and triggers reminder
Recovery
emails or incentives.
Streamlines menu and link
Navigation structures based on commonly
Optimization followed user paths to improve
usability.
Anticipates issues and provides
Improved Customer
quick help or FAQs based on
Support
user behavior patterns.

13. Explain web structure mining and how PageRank

algorithm uses it to rank web pages.
Ans. Web Structure Mining is a branch of web mining that
focuses on analyzing the link structure of websites using graph
theory. It views the web as a directed graph, where:
 Nodes represent web pages.
 Edges (links) represent hyperlinks between pages.
PageRank is a link analysis algorithm developed by Google
founders Larry Page and Sergey Brin. It is a key application of
web structure mining, used to rank web pages in search results.
How PageRank Works:

14. Discuss the challenges involved in web contents.

Ans.
Challenge Description
Heterogeneous data Difficult to standardize across
formats HTML, PDF, images, etc.
Requires filtering out ads,
Unstructured/noisy
comments, or non-relevant
content
data
Requires advanced multilingual
Multilingual content
support and translation
Needs detection and
Duplicate content
elimination for quality results
Dynamic content Must track changes to keep
Challenge Description
mined data current
Legal and ethical Scraping restrictions, privacy,
concerns and compliance issues
Challenges in interpreting
Context sensitivity
sarcasm, slang, or sentiment
Requires handling vast, fast-
Scalability growing web content
efficiently

15. Explain the various types of data quality problems and

how they are addressed during data cleaning.
Ans.
16. Describe the methods to handle missing data and their
impact on machine learning models.
Ans.
Descriptio Impact on ML
Method
n Models

Remove Simple but can

rows lead to loss of
1. Deletion (listwise) valuable data
(Listwise/Pairwise) or and biased
columns results if not
(pairwise) MCAR*
Descriptio Impact on ML
Method
n Models
with
missing
values
Replace
missing Preserves
values dataset size
2. with the but may
Mean/Median/Mo mean, reduce
de Imputation median, variance and
or mode distort
of the relationships
feature
Replace
missing Useful for
values categorical
3.
with a data; may
Constant/Category
constant introduce
Imputation
(e.g., artificial
"Unknown categories
" or 0)
4. Propagate Suitable for
Forward/Backward previous time-series
Fill (forward) data; assumes
or next continuity
(backward
Descriptio Impact on ML
Method
n Models
) value
Predict
More accurate
missing
but
values
computationall
5. K-Nearest using
y expensive
Neighbors (KNN) feature
and sensitive
similarity
to feature
to nearest
scaling
neighbors

17. What is data normalization? Explain Min-Max

normalization with an example.
Ans. Data normalization is a preprocessing technique used to
rescale features to a common range, typically to improve the
performance and convergence of machine learning algorithms.
It ensures that no single feature dominates due to its scale and
makes the model more stable and efficient.
Min-Max Normalization
Min-Max Normalization rescales a feature to a specified range,

𝑋ₙₒᵣₘ = (𝑋 − 𝑋ₘᵢₙ) / (𝑋ₘₐₓ − 𝑋ₘᵢₙ)

usually [0,1],using the following formula:

Where:
 XXX = original value
 Xmin = minimum value in the feature
 Xmax = maximum value in the feature
Example
Suppose we have the following values for the feature "Age":
Original Age
18
30
50

Xmin = 18
Xmax = 50
Now, apply Min – Max Normalization :

18. Discuss the importance of noise detection and noise

removal technique.
Ans.
19. Describe the role of scatter plots and heatmaps in
visualizing complex data relationships.
Ans.

117 Data Science Interview Questions in 2023
No ratings yet
117 Data Science Interview Questions in 2023
31 pages
Data Mining BCA 10 Point Answers
No ratings yet
Data Mining BCA 10 Point Answers
3 pages
Unit 1 BD PDF
No ratings yet
Unit 1 BD PDF
26 pages
Data Science
No ratings yet
Data Science
14 pages
Quiz 4 5 6
No ratings yet
Quiz 4 5 6
11 pages
Data Miningng
No ratings yet
Data Miningng
8 pages
Crack Data Science Interview 1731300339
No ratings yet
Crack Data Science Interview 1731300339
132 pages
UNIT 1 Practice Quiz - MCQs - ML
100% (1)
UNIT 1 Practice Quiz - MCQs - ML
10 pages
Da 1733591326
No ratings yet
Da 1733591326
132 pages
DWM Quesans
No ratings yet
DWM Quesans
21 pages
Data Analytic 3 Marks Q
No ratings yet
Data Analytic 3 Marks Q
10 pages
Data Science
No ratings yet
Data Science
28 pages
Data Mining University Answer
No ratings yet
Data Mining University Answer
10 pages
Data Mining Objective Questions
No ratings yet
Data Mining Objective Questions
4 pages
Data Science Tool Box Important Viva Question
No ratings yet
Data Science Tool Box Important Viva Question
14 pages
Machine Learning Qs
No ratings yet
Machine Learning Qs
10 pages
DM Vsaq
No ratings yet
DM Vsaq
8 pages
Ds Revision 1
No ratings yet
Ds Revision 1
5 pages
25 Important Data Science Interview Questions 1719736087
No ratings yet
25 Important Data Science Interview Questions 1719736087
15 pages
Short Notes On Data Mining & Warehousing
No ratings yet
Short Notes On Data Mining & Warehousing
43 pages
Top Data Science Interview Questions and Answers in 2023 PDF
100% (1)
Top Data Science Interview Questions and Answers in 2023 PDF
14 pages
Common DS Interview Questions and Answers - 1
No ratings yet
Common DS Interview Questions and Answers - 1
4 pages
Data Analysis and Visualization
No ratings yet
Data Analysis and Visualization
18 pages
BigDataSolution of Paper Oct 2022
No ratings yet
BigDataSolution of Paper Oct 2022
11 pages
Data Mining Unit-IV
No ratings yet
Data Mining Unit-IV
5 pages
Data Science Interview Preparation (#DAY 10)
No ratings yet
Data Science Interview Preparation (#DAY 10)
11 pages
Data Science Interview
No ratings yet
Data Science Interview
132 pages
MCQ On Data Mining
No ratings yet
MCQ On Data Mining
20 pages
Assignment 02
No ratings yet
Assignment 02
9 pages
Long Answered Questions With Answer
No ratings yet
Long Answered Questions With Answer
6 pages
Data Science Interview Questions
No ratings yet
Data Science Interview Questions
31 pages
Data Science
100% (1)
Data Science
7 pages
20 Questions On Feature Engineering and Eda
No ratings yet
20 Questions On Feature Engineering and Eda
9 pages
Data Science Interview Questions Answer
No ratings yet
Data Science Interview Questions Answer
17 pages
Data Analytics-1
No ratings yet
Data Analytics-1
21 pages
MCQ On Data Mining
No ratings yet
MCQ On Data Mining
20 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
3 Preprocessing
No ratings yet
3 Preprocessing
27 pages
Statistics and ML
No ratings yet
Statistics and ML
11 pages
Machine Learning Viva Questions
No ratings yet
Machine Learning Viva Questions
6 pages
Data Science Interview QnAs by CloudyML
No ratings yet
Data Science Interview QnAs by CloudyML
21 pages
Data Mining Imp
No ratings yet
Data Mining Imp
11 pages
ML DS Interview Quetions
No ratings yet
ML DS Interview Quetions
17 pages
M L
No ratings yet
M L
4 pages
Data Analytics Question
No ratings yet
Data Analytics Question
6 pages
Ds Viva
No ratings yet
Ds Viva
9 pages
Data Science Concepts & Techniques
No ratings yet
Data Science Concepts & Techniques
18 pages
STAT243 Chapter 1 Tutorial Questions With Solutions - 23
No ratings yet
STAT243 Chapter 1 Tutorial Questions With Solutions - 23
3 pages
CHAPTER1 Datamining
No ratings yet
CHAPTER1 Datamining
33 pages
MCQ On Data Mining
No ratings yet
MCQ On Data Mining
20 pages
1.what Is Data Cleaning in Rapidminer?
No ratings yet
1.what Is Data Cleaning in Rapidminer?
9 pages
Big Data (Imp-Questions)
No ratings yet
Big Data (Imp-Questions)
17 pages
Data Mining
No ratings yet
Data Mining
55 pages
Data Science Interview Prep Guide
No ratings yet
Data Science Interview Prep Guide
3 pages
Simplified Viva EDA
No ratings yet
Simplified Viva EDA
7 pages
Exam Question Ans
No ratings yet
Exam Question Ans
19 pages
Generation of Computer
No ratings yet
Generation of Computer
7 pages
Internet
No ratings yet
Internet
8 pages
Siya Rawat Resume
No ratings yet
Siya Rawat Resume
1 page
English 9th
No ratings yet
English 9th
2 pages
Book
No ratings yet
Book
1 page
CH 9 Transport in Animals Worksheet 2 Answer Key
No ratings yet
CH 9 Transport in Animals Worksheet 2 Answer Key
8 pages
Taurus - USP39 Monograph - Dried Aluminum Hydroxide Gel
No ratings yet
Taurus - USP39 Monograph - Dried Aluminum Hydroxide Gel
2 pages
MSDS ND 705 (New)
No ratings yet
MSDS ND 705 (New)
7 pages
INSTRUMENTATION
No ratings yet
INSTRUMENTATION
43 pages
Math Olympiad Sample Test
No ratings yet
Math Olympiad Sample Test
3 pages
Gerhard Coetzee CV
No ratings yet
Gerhard Coetzee CV
3 pages
MYA Final Statistics and Probability
No ratings yet
MYA Final Statistics and Probability
10 pages
Student Exploration: Free Fall Tower
No ratings yet
Student Exploration: Free Fall Tower
4 pages
Complete Daily Curriculum For Early Childhood Revised Over 12ties To Support Multiple Intelligences and Learning Styles The
100% (6)
Complete Daily Curriculum For Early Childhood Revised Over 12ties To Support Multiple Intelligences and Learning Styles The
24 pages
List of Annexure
No ratings yet
List of Annexure
4 pages
Silicon Solar Cell Parameters - PVEducation
No ratings yet
Silicon Solar Cell Parameters - PVEducation
2 pages
TeraChem User Guide v1.0
No ratings yet
TeraChem User Guide v1.0
18 pages
Jawai-Interim Proposal-Map
No ratings yet
Jawai-Interim Proposal-Map
1 page
Radiation Shielding Principles
No ratings yet
Radiation Shielding Principles
212 pages
Ashford (2024) Impaired Oral Health - A Required Companion of Bacterial Aspiration Pneumonia
No ratings yet
Ashford (2024) Impaired Oral Health - A Required Companion of Bacterial Aspiration Pneumonia
19 pages
Quantitative Ability
No ratings yet
Quantitative Ability
9 pages
Man B&W: Chains
No ratings yet
Man B&W: Chains
58 pages
Slick Service Letter: Champion Aerospace LLC
No ratings yet
Slick Service Letter: Champion Aerospace LLC
34 pages
Bhava Chandrika Missing Links of Hindu Astrology
No ratings yet
Bhava Chandrika Missing Links of Hindu Astrology
32 pages
What Is This Module About?
No ratings yet
What Is This Module About?
39 pages
2017 KTM 1290 SAS Setup Manual
No ratings yet
2017 KTM 1290 SAS Setup Manual
19 pages
Microbes and Their Growth Needs
No ratings yet
Microbes and Their Growth Needs
7 pages
Technical Data Foam Sprinklers
No ratings yet
Technical Data Foam Sprinklers
6 pages
Honor 6 Plus - Pe-Tl10 QSG - (01, All, Neu, Si, L)
No ratings yet
Honor 6 Plus - Pe-Tl10 QSG - (01, All, Neu, Si, L)
144 pages
Eo 25-13
No ratings yet
Eo 25-13
13 pages
Aspen Hysys Based Simulation and Analysis of Crude Distillation Unit
100% (2)
Aspen Hysys Based Simulation and Analysis of Crude Distillation Unit
5 pages
OSCAR-III Treatment System O& M And: Troubleshooting Manual January 2019
No ratings yet
OSCAR-III Treatment System O& M And: Troubleshooting Manual January 2019
14 pages
Helyer #3 Ans
100% (1)
Helyer #3 Ans
3 pages
Reservoir Simulation Techniques Guide
100% (1)
Reservoir Simulation Techniques Guide
77 pages
Smile Dental Journal Volume 4 Issue 3
No ratings yet
Smile Dental Journal Volume 4 Issue 3
75 pages