Data Science
Data Science
7. Compare between data analytics and data science. (Short Ans:Problem: A retail company wants to predict which products will be popular in
the upcoming season to optimize inventory and marketing efforts.
Answer) Steps:
Data Collection: Gather historical sales data, customer reviews, seasonal trends,
Ans: weather data, etc.
Data Cleaning: Handle missing values, remove duplicates, and standardize formats.
Aspect Data Analytics Data Science Exploratory Data Analysis (EDA): Visualize trends in sales, customer preferences,
Focus Descriptive & diagnostic analysis Predictive & prescriptive analysis and correlations with seasonality.
Build models to predict or automate Feature Engineering: Create features like "seasonality score," "customer sentiment
Goal Understand past trends
tasks score" from reviews, or "weather impact" based on historical data.
Python, R, ML libraries, big data Model Building: Use machine learning algorithms like Random Forest or XGBoost
Tools Excel, SQL, Tableau, Power BI to predict product demand for the next season.
tools
Model Evaluation: Test the model using metrics like accuracy and RMSE (Root
Skills Statistics, visualization, business Programming, statistics, ML, data
Mean Squared Error).
Needed knowledge engineering
Deployment: Integrate the model into the company's inventory system to
Output Reports, dashboards Predictive models, algorithms automatically adjust stock levels based on predictions.
Outcome:
The company optimizes inventory, reduces stockouts, and increases sales by stocking
8. Compare box plot and histogram. (Short Answer) the right products at the right time.
12. Summarize the goals of data science? (Short Answer)
Aspect Box Plot Histogram Ans:· Data Exploration and Insight Discovery: Understand patterns, relationships,
Shows summary statistics and and trends within data to derive actionable insights.
Purpose Shows distribution of data frequencies
outliers · · Prediction and Forecasting: Use historical data to build models that predict
Data Type Continuous or ordinal Continuous or discrete future events or trends, such as customer behavior or sales forecasting.
Median, quartiles, min, max, · · Automation of Processes: Build models and algorithms that automate decision-
Displays Frequency of data in intervals (bins) making, optimizing tasks like inventory management or fraud detection.
outliers
· · Optimization: Improve processes or systems, such as resource allocation, by
Shape Good for visualizing shape (e.g.,
Limited shape information using data-driven approaches.
Info skewness)
· · Data-Driven Decision Making: Support business strategies and decisions by
Comparing distributions between analyzing data and providing recommendations for growth and improvement.
Best Use Understanding overall distribution
groups 13. Define an outlier. Explain its significance in statistics with an example. (Long
Answer)
Ans:An outlier is a data point that significantly differs from the rest of the dataset. It
9. Define Data mining. (Short Answer) is an extreme value that lies far outside the general pattern of the data.
Ans:Data mining is the process of discovering patterns, correlations, and useful Significance in Statistics:
information from large datasets using statistical, mathematical, and computational Distortion of Mean: Outliers can skew the mean, making it unrepresentative of the
techniques. It involves techniques like clustering, classification, regression, and data.
association rule learning to extract hidden insights. Model Impact: Outliers can affect the accuracy of models like regression, leading to
10. List popular libraries used in Data Science. (Short Answer) inaccurate predictions.
Ans:· Pandas – Data manipulation and analysis Error Detection: They help identify errors or unusual events in the data.
· · NumPy – Numerical computing and arrays Statistical Tests: They can affect hypothesis tests and lead to incorrect conclusions.
· · Matplotlib – Data visualization Example:
· · Seaborn – Statistical data visualization (built on Matplotlib) In the dataset of ages: [20, 22, 23, 24, 21, 100], the value 100 is an outlier. It skews
· · Scikit-learn – Machine learning algorithms the mean age higher, but most ages are around 21-24. The outlier might be due to an
· · TensorFlow – Deep learning and neural networks error or a rare case, and should be handled carefully.
11. Illustrate the use of Data science with example (Short Answer) 14. Break down the concept of Exploratory Data Analysis (EDA) and illustrate
its significance in statistics. Highlight how EDA facilitates the identification of
patterns, detection of outliers, and validation of assumptions using both 2. Interquartile Range (IQR) Method:The IQR measures the range between the
graphical and statistical methods. (Long Answer) 25th percentile (Q1) and the 75th percentile (Q3), which captures the middle
Ans:Exploratory Data Analysis (EDA) 50% of the data.
EDA is the process of analyzing and visualizing data to summarize its main
characteristics, often with the help of graphical methods. It is used to understand the Formula:IQR=Q3−Q1
structure, distribution, and relationships in the data before applying formal statistical Outlier Determination:
methods or models. Lower Bound: Q1−1.5×IQR
Significance in Statistics: Upper Bound: Q3+1.5×IQR
Identification of Patterns:EDA helps uncover patterns, trends, and relationships in Data points outside these bounds are considered outliers.
data. By visualizing data through methods like histograms and scatter plots, one can
identify key characteristics and form hypotheses.
Detection of Outliers:Outliers can be detected using box plots and scatter plots, 16. Find the probability of throwing two fair dice when the sum is 5 and when the
which highlight values that fall far outside the normal range. Outliers can distort sum is 8. Show your calculation. (Long Answer)
analysis and models, so identifying them early is crucial. Ans:
Validation of Assumptions:Many statistical methods assume normality or linearity. 17. Compare quantitative data and qualitative data? (Long Answer)
EDA allows validation of these assumptions through histograms, QQ plots, and Ans:
correlation matrices, ensuring the appropriateness of subsequent models. Aspect Quantitative Data Qualitative Data
Graphical Methods:
Data that can be measured and Data that describes qualities or
Histograms: Show data distribution, helping identify skewness. Definition
expressed numerically. characteristics.
Box Plots: Reveal data spread and outliers.
Scatter Plots: Show relationships between two variables. Numeric (e.g., height, weight, Categorical (e.g., colors, names,
Type
Statistical Methods: temperature). labels).
Summary Statistics: Mean, median, and standard deviation provide insight Examples Age, salary, distance, time. Gender, nationality, type of car.
into the central tendency and variability. Interval or ratio scale (e.g., Nominal or ordinal scale (e.g.,
Correlation: Quantifies relationships between variables. Measurement Scale
distance, temperature). yes/no, ratings).
15. Analyze how outliers can be identified in a dataset by examining their impact on Mathematical operations (e.g., Non-numeric analysis (e.g.,
data distribution. Compare at least two different detection methods, such as the Z- Analysis
mean, median, variance). frequency, mode, grouping).
score method and the Interquartile Range (IQR) method, and explain how each Graphical Histograms, bar graphs, line Pie charts, bar graphs, word
method determines outliers. (Long Answer) Representation charts. clouds.
Ans:Outliers are data points that deviate significantly from the majority of the data in
a dataset. Their presence can distort key statistical measures such as the mean, 18. Define covariance. Analyze how it helps in understanding the direction and
standard deviation, and correlations, leading to incorrect conclusions. Identifying and
strength of the relationship between two variables. (Long Answer)
handling outliers is essential for accurate analysis.
Detection Methods: Ans:Covariance is a statistical measure that quantifies the relationship between two
1. Z-Score Method: The Z-score indicates how many standard deviations a data variables, showing how they change together. It indicates whether an increase in one
point is from the mean. variable leads to an increase or decrease in another.
n
1
Formula:Z=X−μ/σ Formula: Cov(X,Y) =1 ∑ ( Xi − X )(Yi −Y )
n i=1
Where:
X is the data point, Where:
μ is the mean, Xand Yare two variables,
σ is the standard deviation. X bar and Y bar are the means of XXX and YYY,
Outlier Determination:Data points with a Z-score greater than 3 or less than -3 are nis the number of data points.
typically considered outliers (though this threshold can be adjusted depending on the Significance of Covariance:Covariance helps in understanding both the direction
dataset). and strength of the relationship between two variables.
Direction of Relationship:
Positive Covariance: If the covariance is positive, it indicates that as one variable Example: In genetics research, some outliers may represent important variations or
increases, the other tends to increase as well (positive relationship). mutations that offer insight into rare diseases.
Negative Covariance: If the covariance is negative, it suggests that as one variable Reason: Outliers could reveal important patterns or rare phenomena crucial to the
increases, the other tends to decrease (negative relationship). study.
Zero Covariance: A covariance close to zero suggests no linear relationship between Fraud Detection:
the variables. Example: In banking, abnormally large withdrawals could be outliers, but they
Strength of Relationship: might indicate fraudulent activity.
Magnitude: The larger the magnitude of the covariance, the stronger the Reason: Retaining these outliers can help detect suspicious behaviors or fraud.
relationship between the variables. However, covariance alone does not provide a Process Monitoring:
standardized measure of strength. The value depends on the scale of the variables, so Example: In manufacturing, a machine failure that causes a spike in data could be an
comparing covariances across datasets with different units can be misleading. outlier but indicates a problem needing attention.
Reason: The outlier could point to a process anomaly that requires corrective
19. A distribution is skewed to the right and has a median of 20. Will the mean be measures.
greater than or less than 20? Explain briefly. (Long Answer) 22. List the key properties of a normal distribution. Analyze any two of these
Ans:In a right-skewed distribution (positively skewed), the mean is pulled in the direction properties by explaining how they affect the shape and behavior of the distribution.
of the longer tail on the right side. Since the median (which represents the central value) is (Long Answer)
given as 20, the mean will be greater than 20 because the higher values in the tail raise the Ans:Key Properties of a Normal Distribution:
average. Symmetry: The normal distribution is symmetrical around its mean.
This happens because extreme values on the right increase the mean more than they affect the Bell-Shaped Curve: It has a bell-shaped curve, with the highest point at the mean,
median. and tails that extend infinitely in both directions.
20. Define one-sample t-test. Explain when it is used in statistical analysis. (Long Mean, Median, and Mode Equality: In a normal distribution, the mean, median, and
Answer) mode are all the same.
Ans:A one-sample t-test is a statistical test used to determine whether the mean of a 68-95-99.7 Rule:
sample is significantly different from a known or hypothesized population mean. It 68% of the data lies within one standard deviation of the mean,
compares the sample mean to the population mean to assess if any observed difference 95% lies within two standard deviations,
is statistically significant or if it could have occurred due to random chance. 99.7% lies within three standard deviations.
When to Use a One-Sample T-Test: Asymptotic: The tails of the distribution approach the horizontal axis but never touch
The one-sample t-test is used in the following scenarios: it.
When you have a sample and you want to compare its mean to a known Defined by Mean and Standard Deviation: The shape of the normal distribution is
population mean (or hypothesized value). fully defined by its mean (μ) and standard deviation (σ).
When the population standard deviation is unknown and the sample size is Analysis of Two Properties:
small (typically n<30n < 30n<30). 1. Symmetry:
When the data is approximately normally distributed, as the t-test assumes Impact on Shape: The symmetry of a normal distribution means that the curve is
normality for the sample data, especially when the sample size is small. identical on both sides of the mean. This property leads to a balanced distribution of
21. Apply your understanding of outliers to identify situations where retaining data, where half of the values are less than the mean and the other half are greater.
them is appropriate. Support your response with relevant examples. (Long Behavior: Since the distribution is symmetrical, it implies that the probability of
Answer) observing a value above the mean is equal to the probability of observing a value
Ans:While outliers are often removed to ensure accurate analysis, there are situations below the mean. This is crucial in hypothesis testing and confidence interval
where retaining them is necessary and informative. Here are some scenarios: estimation.
True Extreme Values: 2. 68-95-99.7 Rule:
Example: In financial analysis, extreme stock prices or large transactions may be Impact on Shape: This rule indicates that the majority of the data is concentrated
legitimate and critical to understanding market behavior. around the mean. For a standard normal distribution (mean = 0, standard deviation =
Reason: Removing these outliers could distort trends and risk analysis, as they 1), 68% of the data falls within one standard deviation, making the curve steepest at
represent rare but significant events. the mean and flatter as you move away from it.
Scientific Research: Behavior: This property shows that most values lie close to the mean, and as you
move further from the mean, the probability of observing values decreases rapidly.
This is important in predicting probabilities and understanding the spread of data (Short Answer)
within a normal distribution. Ans:· Binomial Distribution: This distribution is used when there are two possible
23. Explain different stages of data Science? (Long Answer) outcomes (success or failure) in a fixed number of independent trials.
Ans: Example: Tossing a coin 10 times and counting the number of heads.
24. What is machine learning? Justify its role and importance in data science with Normal Distribution: This is a continuous distribution that is symmetric about the
a brief explanation. (Long Answer) mean, often called a bell curve.
Ans:Machine Learning (ML) is a subset of artificial intelligence that enables Example: Heights of adult humans typically follow a normal distribution.
systems to learn from data and make predictions or decisions without being explicitly 3. Explain the difference between a parameter and a statistic with an example.
programmed. (Short Answer)
Role and Importance in Data Science: Ans:·
Core Analytical Tool: ML provides powerful algorithms (like regression, Aspect Parameter Statistic
classification, clustering) to extract patterns from data. A value that describes a characteristic A value that describes a
Definition
Automates Decision-Making: It helps build models that can automatically predict of a population characteristic of a sample
outcomes (e.g., customer churn, sales forecasts). Based on Entire population Subset (sample) of the population
Handles Large Data: ML efficiently processes and learns from massive, complex Average height of all students Average height of 50 randomly
datasets that are difficult to analyze manually. Example
in a school selected students
Supports Real-Time Applications: Powers applications like recommendation
systems, fraud detection, and image recognition.
25. What is data preprocessing? Explain its role in data analysis and describe 4. Describe how a probability distribution helps in statistical modeling. (Short
Answer)
common methods used for preprocessing data. (Long Answer)
Ans:A probability distribution models how likely different outcomes are, helping to:
Ans:Data preprocessing is the process of cleaning and transforming raw data into a Quantify uncertainty,
suitable format for analysis or modeling. Make predictions,
Role in Data Analysis: Support statistical inference (e.g., hypothesis testing and confidence intervals).
Ensures data quality and consistency. 5. Interpret what it means when a model is said to have a \"good fit\" in
Removes errors, missing values, and noise. statistical modeling. (Short Answer)
Enhances the accuracy and performance of models. Ans:A model has a good fit when it accurately represents the relationship between
Makes data ready for statistical or machine learning tasks. variables in the data:
Common Methods of Data Preprocessing: Residuals are small and random,
Missing Value Treatment – Filling or removing missing data. Predictions closely match observed values,
Data Cleaning – Removing duplicates, correcting errors. High R² value (in regression), indicating high explanatory power.
Normalization/Scaling – Standardizing feature ranges. 6. A researcher collects a sample of students\' heights from a university. Classify
Encoding Categorical Data – Converting categories into numerical values (e.g., whether the mean height of the sample is a parameter or a statistic, and justify your
one-hot encoding). answer. (Short Answer)
Outlier Detection and Handling – Identifying and treating extreme values. Ans:It is a statistic because it is calculated from a sample (not the entire
Data Transformation – Log, square root, or other transformations to reduce population).
skewness. Parameters are based on entire populations, which are often impractical to measure.
MODULE 2 7. A dataset follows a normal distribution with a mean of 50 and a standard deviation
1. Define a population and a sample in the context of statistical analysis. (Short of 5. Calculate the probability of getting a value greater than 60 using the empirical
Answer) rule. (Short Answer)
Ans: Population:A population is the complete set of individuals or items of interest Ans:
(e.g., all students in a university). 8. Given a dataset, a researcher fits a linear model to predict sales based on
Sample:A sample is a subset drawn from the population used for analysis (e.g., 200 advertising expenditure. Illustrate how they can check whether the model provides
selected students). a good fit using residual analysis.
Sampling allows for efficient estimation of population characteristics. Ans:· Plot residuals vs. predicted values.
2. List any two types of probability distributions and provide a brief example of each. · · If residuals are randomly scattered (no pattern), the model is a good fit.
· · Patterns or non-random spread suggest model misspecification or non-linearity. process and interpreting its significance through an example. (Long Answer)
9. A researcher fits a linear regression model to predict monthly sales based on Ans:The mean (or average) is a measure of central tendency that represents the
advertisement spending. Differentiate between systematic and random errors in the typical value in a data set.
model, and analyze how these errors affect prediction accuracy. (Short Answer) Calculation Process:
Ans:·Systematic errors: Persistent bias due to flawed assumptions or data (e.g., Add all data values.
missing a key variable). These skew predictions consistently. Divide the total by the number of values.
·Random errors: Unpredictable variations caused by unknown factors. These cause Formula:
fluctuations around the true value. Mean=Sum of all values \Number of values
Systematic errors reduce accuracy; random errors reduce precision. Example:
10.Evaluate the effectiveness of the median over the mean in specific data scenarios. Test scores: 70, 80, 90, 85, 75
Justify your reasoning with appropriate examples. (Long Answer) Mean=70+80+90+85+75/5=400/5=80
Ans:The median is often more effective than the mean in scenarios with skewed data Interpretation:The mean score is 80, meaning that on average, students scored 80 on
or outliers, as it better represents the central tendency by not being affected by the test.
extreme values. Significance:The mean provides a summary of the data's overall level, useful for
Example 1 (Income data): comparison and analysis. However, it can be influenced by extreme values (outliers),
Consider incomes: $30k, $35k, $40k, $45k, $1 million. so context matters.
Mean = $230k (skewed by outlier) 13. Analyze the different types of mean by classifying them and interpreting their
Median = $40k (more representative of most people's income) relevance with appropriate examples. (Long Answer)
Example 2 (Test scores): Ans:
Scores: 10, 20, 30, 40, 100 14. Analyze the characteristics and statistical significance of a bell-curve
Mean = 40
distribution by deconstructing its shape, features, and implications. (Long
Median = 30
If 100 is an outlier, median better reflects typical performance. Answer)
Conclusion: Use median when data is non-symmetric or contains outliers; use Ans:A bell-curve distribution, also known as a normal distribution, is a
mean when data is symmetrical and clean. fundamental concept in statistics with several key characteristics and implications.
11. Analyze the relationship between a population and a sample in inferential Shape & Features:
statistics, and examine their respective roles through examples. (Long Answer) Symmetrical: The curve is mirrored around the mean.
Unimodal: One peak at the center (mean = median = mode).
Ans:In inferential statistics, a sample is a subset of a population, and it is used to
Tails extend infinitely: But never touch the x-axis.
draw conclusions or make predictions about the entire population.
68-95-99.7 Rule (Empirical Rule):
Relationship:
68% of data within 1 standard deviation (σ) of the mean
The population includes all members of a group (e.g., all college students in a
95% within 2σ
country).
99.7% within 3σ
A sample is a smaller group selected from the population (e.g., 500 students
Statistical Significance:
from different colleges).
Predictability: Allows for probability estimation (e.g., likelihood of a test
Roles:
score).
Population is the target of the inference.
Benchmarking: Often used in grading, IQ tests, and standardized exams.
Sample provides the data used to estimate population characteristics (e.g., mean,
Foundation for inference: Many statistical tests (like t-tests and z-tests)
proportion).
assume normality.
Example:
Example:
If you want to know the average study time of college students in India:
If the mean SAT score is 1000 with σ = 100:
You can't ask every student (population is too large).
A score of 1100 is within 1σ → above average.
Instead, survey a sample (e.g., 500 students).
A score of 1300 is 3σ → exceptional (top ~0.15%)
Use the sample mean to estimate the population mean.
15. Analyze and contrast the characteristics of left-skewed and right-skewed
Conclusion:The sample acts as a tool to study the population. Inferential statistics use
sample data to make educated guesses about population parameters. distributions. Interpret their shapes using diagrams and explain their implications in
12. Analyze the statistical concept of the mean by breaking down its calculation data analysis. (Long Answer)
Ans: Role:
Feature Left-Skewed (Negative) Right-Skewed (Positive) Inferential statistics go a step further by using sample data to make predictions or
Right (toward higher inferences about a larger population. This is essential when it's impractical or
Tail Direction Left (toward lower values) impossible to collect data from every member of a population.
values)
Example:
Order of A political analyst surveys 1,000 people out of a population of 1 million to estimate
Mean < Median < Mode Mode < Median < Mean
Mean/Median/Mode voting preferences. Based on the sample, the analyst infers that 60% of the population
Shape Peak on right, tail on left Peak on left, tail on right supports a particular candidate. This is a classic case of inferential statistics.
Retirement age, exam Significance:
Example Income, housing prices
failures Inferential statistics are vital for:
Best Measure of Center Median Median Making decisions under uncertainty
Mean underestimates typical Mean overestimates typical Predicting future trends or behaviors
Implication Testing theories or assumptions based on limited data
value value
17. Imagine that Jeremy took part in an examination. The test is having a mean score
of 160, and it has a standard deviation of 15. If Jeremy’s z-score is 1.20, Evaluate his
score on test. (Long Answer)
Ans:
18. Analyze a left-skewed distribution with a median of 60 to draw conclusions
about the relative positions of the mean and the mode. Explain the relationship
among these measures of central tendency based on the skewness of the
distribution. (Long Answer)
Ans:In a left-skewed distribution (also called negatively skewed), the tail of the
distribution extends more toward the lower values (left side). Given that the median is
60, here's how the mean and mode relate to it:
Mean: The mean will be less than the median in a left-skewed distribution.
16. Evaluate the roles of descriptive and inferential statistics in data analysis. This is because the long left tail pulls the mean down. The mean is sensitive to
Justify the significance of each through practical examples. (Long Answer) extreme values, so outliers on the lower end will reduce the mean value.
Ans:1. Descriptive Statistics Mode: The mode, which is the most frequent value in the dataset, is usually
Role: greater than the median in a left-skewed distribution. This is because the
Descriptive statistics are used to summarize, organize, and present data in a majority of the data points are clustered toward the higher values on the right
meaningful way. They help to describe the basic features of a dataset, offering a clear side of the distribution.
picture of what the data shows without making any generalizations or predictions Summary of the relationship:
beyond the data itself. Mode > Median > Mean in a left-skewed distribution.
Example: The mean is pulled to the left by the skewness, making it the smallest measure
Suppose a teacher collects test scores from her class of 30 students. She calculates the of central tendency, while the mode is the largest, as it is typically positioned
average (mean) score as 75, the highest score as 95, and the standard deviation as 10. near the peak of the distribution.
This summary gives her a snapshot of class performance—this is descriptive 19. Evaluate the effect of outliers on statistical measures such as mean and
statistics in action. standard deviation. Justify when and why outliers may distort data interpretation
Significance: using examples. (Long Answer)
Descriptive statistics help in: Ans:Outliers can significantly impact statistical measures like the mean and standard
Understanding patterns in data deviation:
Identifying outliers or trends Mean: The mean is sensitive to outliers because it's calculated by summing all values.
Presenting data clearly for reporting or communication Outliers can skew the mean toward extreme values, making it unrepresentative of the
2. Inferential Statistics majority of the data.
Example: In the dataset (10, 12, 14, 16, 1000), the mean is disproportionately high False Positive (FP = 15): The model incorrectly predicted that 15 people had the
due to the outlier, making it misleading about the typical value. disease when they did not (false alarms).
Standard Deviation: Outliers increase the spread of data, leading to a higher standard True Negative (TN = 95): The model correctly predicted that 95 people did not have
deviation. This makes it seem like the data is more variable than it actually is. the disease.
Example: In datasets with and without outliers, the dataset with an outlier will show a Model Performance Metrics from the Confusion Matrix:
much higher standard deviation, which could distort interpretations of data variability. From the confusion matrix, we can derive various performance metrics to evaluate the
When and Why Outliers Distort Data Interpretation: model:
Skewed Conclusions: Outliers can lead to skewed interpretations, especially when Accuracy: The proportion of correctly predicted instances.
making decisions based on the mean or standard deviation. For instance, if a business Accuracy=TP+TN\TP+TN+FP+FN
is using the mean income to set salary levels, outliers could lead to unreasonably high Precision (Positive Predictive Value): The proportion of positive predictions that are
salaries. actually correct.
Misleading Comparisons: When comparing datasets or groups, outliers can give the Precision=TP\TP+FP
false impression that one group has more variability or a higher average than it Recall (Sensitivity, True Positive Rate): The proportion of actual positives that are
actually does. correctly identified.
Data Cleaning: In many cases, it's important to identify and decide whether outliers Recall=TP\TP+FN
should be removed or treated separately. They may represent important but rare F1 Score: The harmonic mean of precision and recall, balancing both metrics.
occurrences (e.g., a $100,000,000 income is a legitimate but rare outlier in an income F1=2×(Precision×Recall\Precision+Recall)
dataset), but they should not unduly influence overall statistical analyses. 21. Analyze the role of sampling in statistical investigations by comparing different
20. Analyze the structure and components of a confusion matrix in classification sampling techniques. Illustrate each method with examples and assess their
tasks. Interpret its elements (TP, FP, TN, FN) with an example to assess model applicability in various scenarios. (Long Answer)
performance. (Long Answer) Ans:1. Simple Random Sampling (SRS)
Ans:A confusion matrix is a table used to evaluate the performance of a classification Description: Every individual in the population has an equal chance of being selected.
model. It compares the predicted labels with the actual labels, showing how many Example: A researcher randomly selects 100 students from a school’s list of 500
correct and incorrect predictions were made. The confusion matrix is typically students to study their academic performance.
structured as follows: Applicability: Ideal for situations where the population is homogeneous and every
Components of a Confusion Matrix: member has an equal chance of being selected. Works well when there is no clear
True Positive (TP): The number of correct predictions where the model correctly subgroup structure in the population.
predicts the positive class. Pros:
False Positive (FP): The number of incorrect predictions where the model incorrectly Easy to understand and implement.
predicts the positive class (type I error). Results in an unbiased sample.
True Negative (TN): The number of correct predictions where the model correctly Cons:
predicts the negative class. Can be impractical for large populations.
False Negative (FN): The number of incorrect predictions where the model May not represent smaller subgroups well.
incorrectly predicts the negative class (type II error). 2. Systematic Sampling
Example: Description: A sample is selected by choosing every kth individual from the
Consider a binary classification model used to predict whether a person has a disease population after selecting a random starting point.
(Positive) or not (Negative), with actual values (True labels) and predicted values Example: A researcher decides to survey every 10th customer in a store’s queue,
(Predicted labels): starting from a random point.
Actual \ Predicted Positive (1) Negative (0) Applicability: Useful when a list of the population is available, and the researcher
Positive (1) 80 (TP) 10 (FN) wants a quick and simple sampling method. It works well when there’s no hidden
Negative (0) 15 (FP) 95 (TN) pattern in the population.
Interpretation: Pros:
True Positive (TP = 80): The model correctly predicted 80 cases of the disease. Simple and faster than simple random sampling.
False Negative (FN = 10): The model incorrectly predicted that 10 people did not Ensures even coverage of the population.
have the disease when they actually did (missed diagnoses). Cons:Can be biased if there's an underlying pattern in the population (e.g., a queue
system where every 10th customer has the same characteristics).
3. Stratified Sampling Example: A researcher decides to interview 50 men and 50 women in a city to
Description: The population is divided into subgroups (strata) based on specific understand their shopping habits, ensuring gender balance.
characteristics (e.g., age, gender, income level), and a sample is randomly selected Applicability: Useful when the researcher wants to ensure certain demographic
from each stratum. groups are represented in the sample but doesn't have the resources for random
Example: A researcher wants to survey employees about job satisfaction in a sampling.
company. The company is divided into strata based on departments (HR, marketing, Pros:
finance), and a random sample is taken from each department. Ensures diversity in the sample.
Applicability: Ideal when the population has distinct subgroups that may vary Easier and cheaper than random sampling.
significantly, and the researcher wants to ensure representation from all subgroups. Cons:
Pros:Ensures proportional representation of all key subgroups. Can be biased if the selection process is not random within each quota group.
Cons:More complex to implement and requires knowledge about the population The sample may not represent the overall population accurately.
structure. 7. Snowball Sampling
4. Cluster Sampling Description: A non-probability sampling method where existing participants recruit
Description: The population is divided into clusters (often geographically), and then a new participants, often used in hard-to-reach or niche populations.
random sample of clusters is selected. All members of the chosen clusters are Example: A researcher studying the experiences of rare disease patients may start by
surveyed. interviewing a few known patients, who then refer other patients.
Example: A researcher wants to study household energy usage across a country. They Applicability: Ideal for studying hidden populations or when there’s no clear list of
randomly select 10 cities (clusters) and survey all households in those cities. the population (e.g., drug users, homeless individuals).
Applicability: Useful when the population is geographically spread out, or when it’s Pros:
difficult to compile a complete list of the population. Often used in large-scale surveys Useful for accessing populations that are difficult to identify or reach.
or government studies. Can be helpful in exploratory research.
Pros: Cons:
Cost-effective and easier to manage when the population is geographically Prone to bias, as the sample may not be representative of the larger population.
dispersed. Limited to social networks or groups.
Requires fewer resources than surveying a random sample from the entire 22. Explain the normal distribution with examples. Describe its properties and
population. applications in data analysis. (Long Answer)
Cons: Ans:The normal distribution is a continuous probability distribution that is
May introduce bias if the clusters are not homogeneous. symmetric and bell-shaped, commonly used in statistics. It is defined by its mean (μ)
Less precise than stratified sampling. and standard deviation (σ), and many real-world variables, like heights or exam
5. Convenience Sampling scores, follow this distribution.
Description: The sample is selected based on what is easiest for the researcher (e.g., Properties of the Normal Distribution:
surveying people nearby or accessible). Symmetry: The distribution is symmetrical around the mean.
Example: A researcher surveys students in a class to understand their opinions about Bell-shaped: The curve is highest at the mean and tapers off towards the tails.
online education. 68-95-99.7 Rule:
Applicability: Often used in exploratory research or when time, cost, or access to the 68% of data lies within one standard deviation of the mean.
population is limited. 95% lies within two standard deviations.
Pros: 99.7% lies within three standard deviations.
Quick, easy, and inexpensive. Mean, Median, Mode: All are equal and located at the center.
Can be used for preliminary research. Asymptotic: The tails of the distribution approach but never reach the
Cons: horizontal axis.
Highly prone to bias, as it does not represent the broader population well. Example:
Results may not be generalizable to the entire population. For a set of student exam scores with a mean of 75 and a standard deviation of 10,
6. Quota Sampling most students would score near 75, with fewer students scoring very high or low.
Description: The researcher selects participants to fulfill certain quotas, ensuring Applications in Data Analysis:
representation of key demographic factors. It’s similar to stratified sampling, but the Hypothesis Testing: Many statistical tests assume normality (e.g., z-tests).
selection is non-random. Confidence Intervals: Used to estimate population parameters.
Z-Scores: Standardize data to compare different datasets. Underfitting is the opposite of overfitting. It happens when a model is too simple to
Quality Control: Applied to monitor consistency in manufacturing. capture what’s going on in the data.
Finance: Models returns on investment and risk assessment. For example, imagine drawing a straight line to fit points that actually follow a
23. Draw and explain the bias-variance tradeoff in machine learning. Interpret its curve. The line misses most of the pattern.
impact on model performance. (Long Answer) In this case, the model doesn’t work well on either the training or testing data.
Ans:The bias-variance tradeoff is a fundamental concept in machine learning,
representing the relationship between the model's complexity and its performance.
Bias: The error introduced by approximating a real-world problem with a simplified
model. A model with high bias makes strong assumptions and underfits the data,
leading to systematic errors. As model complexity decreases, bias increases.
Variance: The error introduced by the model's sensitivity to small fluctuations in the
training data. A model with high variance can overfit, capturing noise rather than the
true patterns. As model complexity increases, variance increases.
Impact on Model Performance:
High bias (underfitting) occurs when the model is too simple, resulting in poor
performance on both training and test data.
High variance (overfitting) happens when the model is too complex, performing well
Underfitting : Straight line trying to fit a curved dataset but cannot capture the
on training data but poorly on new, unseen data.
data’s patterns, leading to poor performance on both training and test sets.
The goal is to find a balance where both bias and variance are minimized, leading to
Overfitting: A squiggly curve passing through all training points, failing to
the best model performance. The total error is the sum of bias and variance, and the
generalize performing well on training data but poorly on test data.
optimal model complexity lies where this total error is lowest.
Appropriate Fitting: Curve that follows the data trend without overcomplicating
to capture the true patterns in the data
25. Apply the concept of p-values in a hypothesis testing scenario and interpret the
results using a suitable example. (Long Answer)
Ans:In hypothesis testing, the p-value helps determine the strength of evidence
against the null hypothesis (H₀). It represents the probability of obtaining results as
extreme as the observed, assuming H₀ is true.
Example Scenario:
A company claims that their battery lasts at least 10 hours on average. A researcher
believes the true average is less than 10 hours, and tests this by taking a sample of 30
batteries.
H₀ (null hypothesis): μ = 10 hours
24. Identify underfitting and overfitting in machine learning through model behavior H₁ (alternative hypothesis): μ < 10 hours
and apply your understanding to interpret their impact on prediction accuracy, using After testing, the researcher finds the sample mean is 9.6 hours, and the p-value is
diagrams where appropriate. (Long Answer) 0.03.
Ans: Overfitting in Machine Learning Interpretation:
Overfitting happens when a model learns too much from the training data, including If the significance level (α) is 0.05:
details that don’t matter (like noise or outliers). Since p = 0.03 < 0.05, we reject H₀.
For example, imagine fitting a very complicated curve to a set of points. The There is significant evidence that the battery lasts less than 10 hours
curve will go through every point, but it won’t represent the actual pattern. on average.
As a result, the model works great on training data but fails when tested on new If p > 0.05, we would fail to reject H₀, meaning the data doesn't provide strong
data. enough evidence to dispute the company's claim.
2. Underfitting in Machine Learning 26. Use Bayes’s theorem to solve a probability-based problem and explain the
steps involved with a real-life example. (Long Answer)
Ans: Encoding it using one-hot or dummy variables allows the model to understand
27. Analyze the application of Bayes’ Theorem in the Naive Bayes algorithm and different contract types numerically and associate them with churn risk.
illustrate how the algorithm functions in a classification task using an example.
(Long Answer)
Ans: MODULE 3
1. A company is using two different probability distributions to model customer
28. Apply one-hot encoding and dummy variable techniques to a given
purchasing behavior. One follows a normal distribution, while the other follows a
dataset and demonstrate their use with examples. (Long Answer) Poisson distribution. Critique which distribution is more appropriate for modeling
Ans:One-Hot Encoding daily purchase counts and justify your reasoning with statistical properties. (Short
Definition: One-hot encoding transforms each categorical value into a new binary Answer)
column, representing the presence (1) or absence (0) of each category. Ans:The Poisson distribution is more appropriate because it models count data—
Example daaset: specifically, the number of events (purchases) in a fixed time interval.
ID Color Poisson handles discrete, non-negative values, while Normal assumes continuous,
1 Red symmetric data.
2 Blue Daily purchase counts typically follow Poisson, especially when the counts are low or
3 Green vary over time.
2. Define unconstrained multivariate optimization and provide an example. (Short
4 Red
Answer)
After one hot en-coding: Ans: Unconstrained multivariate optimization involves finding the minimum or
ID Color_Red Color_Blue Color_Green maximum of a function with multiple variables, without any constraints.
1 1 0 0 Example: Minimize f(x,y)=x^2+y^2
2 0 1 0 Solution: The minimum occurs at (0, 0), where the gradient is zero.
3 0 0 1 3. List two key differences between equality and inequality constraints in
optimization. (Short Answer)
4 1 0 0
Ans:Two key differences between equality and inequality constraints in
Dummy Variable Encoding
optimization:
Definition: Dummy encoding is a form of one-hot encoding that drops one column
Equality constraints (e.g., g(x)=0g(x) = 0g(x)=0) require exact satisfaction;
to avoid the dummy variable trap — a situation where predictors are perfectly
inequality constraints (e.g., h(x)≤0h(x) \leq 0h(x)≤0) require a limit not to be
✅
multicollinear (i.e., one column can be predicted from others).
Dummy Variable Result (Red dropped):
ID Color_Blue Color_Green
exceeded.
Lagrange multipliers are used for equality constraints; Karush-Kuhn-Tucker
(KKT) conditions are used for inequality constraints.
1 0 0
2 1 0 4. Explain the role of the gradient in gradient descent optimization. (Short Answer)
3 0 1 Ans:The gradient points in the direction of steepest ascent of the function.
4 0 0 Gradient descent uses the negative gradient to iteratively update variables to
minimize the function.
Encoding
Use Case It guides the model towards local minima step-by-step.
Type
5. Describe how Lagrange multipliers are used to handle equality
Ideal for tree-based models (e.g., Decision Trees, constraints in optimization. (Short Answer)
One-Hot
Random Forest, XGBoost) that are not affected by Ans:Lagrange multipliers help solve constrained optimization problems by converting
Encoding
multicollinearity. them into an unconstrained form. Given a function f(x, y, ...) to maximize or minimize
Dummy Preferred for linear models (e.g., Linear or Logistic under an equality constraint g(x, y, ...) = 0, the method introduces a multiplier λ
Encoding Regression) where multicollinearity can affect the model. (Lagrange multiplier) to construct the Lagrangian function:
Real-World Application L(x,y,...,λ)=f(x,y,...)+λg(x,y,...)
In a customer churn prediction model, a feature like Contract Type = [Monthly, By solving the gradient equations ∇L=0, we ensure that the constraint g(x, y, ...) = 0
One Year, Two Year] is categorical. is satisfied while optimizing f(x, y, ...).
6. Interpret what happens when the learning rate in gradient descent is set too high Common Evaluation Metrics for Decision Tree Classifiers
or too low. (Short Answer) Accuracy:Measures the proportion of correctly predicted instances.
Ans:· Too high: The model may overshoot minima, fail to converge, or diverge. Formula:Accuracy=TP+TN \ TP+TN+FP+FN
· · Too low: Convergence becomes very slow, increasing computation time. Use: Works well when classes are balanced.
Optimal learning rate balances speed and stability. Confusion Matrix:A summary of prediction results:
7. A function has the form f(x, y) = x^2 + 2xy + y^2. Use the gradient descent TP (True Positive): Correctly predicted positive.
method to determine the direction of the steepest descent at the point (1,1). FP (False Positive): Incorrectly predicted positive.
Ans: TN (True Negative): Correctly predicted negative.
8. A company wants to minimize the cost function C(x,y)=x^2+y^2, subject to the FN (False Negative): Incorrectly predicted negative.
constraint that the total resource allocation satisfies x^2 + y^2 = 4. Use the Lagrange Precision:Indicates how many predicted positives were correct.
multiplier method to find the optimal values of x and y. Precision=TP \ TP+FP
Ans: Recall (Sensitivity):Shows how many actual positives were identified.
9. A machine learning model is trained using gradient descent. Illustrate how the Recall=TP \ TP+FN
model updates its weights iteratively using the learning rule. (Short Answer) F1 Score:Harmonic mean of precision and recall.
Ans:Weights www are updated iteratively using: F1 Score=2⋅Precision*Recall \ Precision+Recall
Wnew=Wold−η⋅∇L(w)
Use: Especially helpful for imbalanced datasets.
where η is the learning rate, and ∇L(w)(nabla) is the gradient of the loss function.
ROC Curve and AUC Score
This process continues until the loss converges to a minimum.
ROC Curve: Plots True Positive Rate vs. False Positive Rate.
10. A function f(x, y) has multiple local minima. Analyze how different
AUC (Area Under Curve): Measures model’s ability to distinguish between classes.
initialization points in gradient descent impact convergence to the global minimum.
Closer to 1 = better.
Ans:If f(x,y)f(x, y)f(x,y) has multiple local minima, gradient descent may converge to
different minima depending on the starting point.
Example: Titanic Dataset
Poor initialization may lead to local, not global, minima.
Using the Titanic dataset:
Multiple runs with different initial points or using stochastic approaches can help
A decision tree classifier predicts if a passenger survived.
find better minima.
Evaluation results:
11. Two optimization algorithms, Gradient Descent and Newton’s Method, are used
Accuracy: 82%
for unconstrained multivariate optimization. Compare their efficiency in terms of
Precision: 78%
convergence speed and computational cost, and justify which method would be more
Recall: 71%
suitable for large scale machine learning problems.
F1 Score: 74%
Ans:
AUC Score: 0.85
Aspect Gradient Descent Newton’s Method
Interpretation:
Slower, linear (depends on Faster (quadratic, if near The model performs well overall.
Convergence Speed
learning rate) minimum) Good at identifying survivors (high recall).
High (requires Hessian and Balanced performance (F1 score is strong).
Computational Cost Low per iteration
matrix inverse) Some tuning (e.g., pruning or feature selection) may improve performance.
Suitability for Large Better (less resource- 13. Analyze the impact of different imputation techniques on a given dataset with
Not ideal for large datasets
Scale intensive) missing values by comparing their outcomes based on relevant criteria. (Long
Answer)
12. Analyze the results of a decision tree classifier implemented on a real-world
Ans:Imputation techniques handle missing data differently, affecting statistical
dataset by examining its performance through suitable evaluation metrics. (Long
properties, model performance, and bias. Below is a comparison based on key criteria:
Answer)
1. Mean/Median/Mode Imputation:
Ans:A Decision Tree Classifier is a supervised machine learning algorithm used for
classification tasks. After training it on a real-world dataset (like Iris, Titanic, or loan Pros: Simple, fast, preserves data size.
prediction), we must evaluate its performance to assess its effectiveness and Cons: Reduces variance, distorts correlations, ignores relationships.
reliability. Best for: Small missingness (<5%), numerical data.
2. K-Nearest Neighbors (KNN) Imputation: Median
Pros: Uses similarity between samples, better for relationships. Metric Mean Imp. KNN Imp.
Imp.
Cons: Computationally heavy, sensitive to outliers.
Best for: Small to medium datasets with meaningful patterns. Accuracy (%) 78.2 79.1 82.5
3. Multiple Imputation (MICE - Multivariate Imputation by Chained RMSE 4.25 4.12 3.68
Equations):
Pros: Accounts for uncertainty, preserves variance, robust. F1-Score 0.76 0.77 0.82
Cons: Complex, slower, requires more tuning.
R² Score 0.71 0.73 0.79
Best for: Datasets with multivariate dependencies.
4. Regression-Based Imputation: Speed Fastest Fast Slow
Pros: Uses feature relationships, more accurate than mean.
Cons: Overfits if relationships are weak. 15. Develop a real-world scenario that illustrates an alternative hypothesis in
Best for: Linear relationships in data. hypothesis testing. Critically evaluate and justify the formulation of the hypothesis
5. Forward/Backward Fill (Time-Series Data) based on the scenario's context and underlying data. (Long Answer)
Pros: Maintains temporal order. Ans:Scenario:A pharmaceutical company claims its new drug (Drug X) lowers
Cons: Introduces bias if data isn’t sequential. cholesterol more effectively than the current standard (Drug Y).
Best for: Time-series with ordered missingness. Hypothesis Testing:
6. MissForest (Random Forest-Based Imputation): Null (H₀): Drug X = Drug Y (no difference in cholesterol reduction)
Pros: Handles non-linearities, works for mixed data types. Alternative (H₁): Drug X > Drug Y (Drug X is more effective)
Cons: Slow for large datasets. Justification:
Best for: Complex, non-linear datasets. Context: Prior studies suggest Drug X’s mechanism should outperform Drug Y.
14. Analyze the effect of different imputation techniques (such as mean, median,
Data: Randomized trial with pre/post cholesterol measurements (quantitative,
and KNN imputation) on model performance when applied to a dataset containing
normally distributed).
missing values. Compare their outcomes using appropriate performance metrics
Directionality: One-tailed test (H₁ uses ">") because only superior efficacy is
(Long Answer)
clinically meaningful.
Ans:Impact of Imputation Techniques on Model Performance:
Risk: Type I error (false claim of superiority) is costlier than Type II (missing a true
Mean Imputation:
effect), so α=0.01 is set.
Effect: Introduces bias, reduces variance
Critical Evaluation:
Performance: Decreases accuracy (especially for linear models)
Strengths: Aligns with biological plausibility and uses rigorous experimental data.
Best for: Quick solutions with minimal missing data (<5%)
Limitations: Assumes normal distribution; non-inferiority testing might be safer if
Median Imputation:
small margins matter.
Effect: More robust to outliers than mean
16. Critically evaluate the different types of biases encountered during the sampling
Performance: Slightly better than mean, but still suboptimal process by analyzing their impact on research validity. Support your evaluation with
Best for: Skewed data distributions real-life examples for each type of bias. (Long Answer)
KNN Imputation: Ans: Selection Bias
Effect: Preserves data relationships Definition: When the sample is not representative of the target population.
Performance: Highest accuracy (2-5% improvement over mean/median) Impact: Invalid generalizations, skewed results.
Best for: Critical applications where accuracy matters Example:1936 Literary Digest Poll predicted Landon would beat Roosevelt in the
U.S. election. They sampled only car owners and telephone users (wealthier groups),
missing broader voter sentiment. Result: Wrong prediction.
Recall measures the proportion of actual positive cases that the model correctly with a simple dataset.
identifies. It’s crucial when missing a positive case is costly (e.g., failing to detect a Ans:from sklearn.linear_model import LogisticRegression
fraud transaction). from sklearn.datasets import load_iris
Scenarios: from sklearn.model_selection import train_test_split
Medical Testing: Imagine a model that predicts whether a patient has a rare disease. X, y = load_iris(return_X_y=True)
If the disease occurs in 1% of the population, the model could predict "no disease" for X_train, X_test, y_train, y_test = train_test_split(X, y == 0, test_size=0.2)
everyone and still achieve 99% accuracy, missing every actual case of the disease. model = LogisticRegression()
Here, recall is more important—it's crucial that the model identifies as many true model.fit(X_train, y_train)
cases of the disease as possible, even if it means accepting a few false positives (lower print("Accuracy:", model.score(X_test, y_test))
precision).
Fraud Detection: In fraud detection, where fraudulent transactions make up a small 8. Given a dataset of customer transactions, apply the k-means clustering
percentage of the total transactions, a model that predicts "no fraud" most of the time algorithm to segment customers into three groups and briefly describe the steps.
could still have high accuracy. However, such a model wouldn't help in catching Ans:·Preprocess data (e.g., scale features).
fraudulent transactions. Here, both precision (to avoid false alarms) and recall (to · Use KMeans(n_clusters=3) from sklearn.cluster.
catch as many fraudulent transactions as possible) are more meaningful. · Fit the model and obtain cluster labels.
MODULE 4 · Analyze cluster profiles to understand customer segments.
1. Define logistic regression and mention one real-world application. 9. Given an imbalanced dataset, analyze whether logistic regression or k-NN would
Ans:Logistic regression is a classification algorithm that models the probability be more effective and justify your answer.
of a binary outcome using the sigmoid function. Ans:Logistic Regression is generally more effective for imbalanced data:
Real-world application: Predicting whether a customer will buy a product · Logistic Regression works well when the dataset has a class imbalance, as it
(yes/no) based on features like age and browsing behavior. assigns probabilities and can be adjusted using class weights to handle the imbalance
2. List the key assumptions of the k-nearest neighbor (k-NN) algorithm. better.
Ans:Similarity matters: Similar inputs have similar outputs (based on distance). · k-NN is sensitive to class imbalances—since it relies on neighbors, the majority
Feature scaling is important: Distance metrics are sensitive to scale. class can dominate predictions, making it less reliable in skewed datasets
No assumptions about data distribution: It's a non-parametric method. 10.Analyze how Logistic Regression is applied to classify binary outcomes. Examine
3. Explain how the sigmoid function is used in logistic regression. how its approach, underlying assumptions, and output differ from those of Linear
Ans:The sigmoid function transforms linear output into a probability between 0 and Regression. (Long Answer)
1: Ans:Logistic Regression for Binary Classification:
σ(z)=1/1+e−z 1. Application:
where z=wTx+b. If probability > 0.5, it predicts class 1; else, class 0. Logistic Regression is used when the dependent variable is binary (e.g., 0 or 1, Yes or
No). It estimates the probability that a given input belongs to a particular class.
4. Describe the role of the distance metric in k-NN classification. It uses a sigmoid function to map predicted values to a range between 0 and 1.
Ans:The distance metric (e.g., Euclidean, Manhattan) determines how similarity Based on a threshold (commonly 0.5), the output is classified into one of two
is measured between data points. categories.
In k-NN, classification is based on the 'k' closest neighbors. The choice of 2. Approach:
metric directly impacts model accuracy. The model calculates the log-odds (logit) of the probability:
5. Interpret how the number of clusters (k) affects the output of the k-means P
l og( )=β 0+ β 1 X
clustering algorithm. P −1
Ans:Too few clusters: Underfitting; dissimilar points grouped together. This is transformed into a probability using the sigmoid function:
Too many clusters: Overfitting; noise may be treated as structure. 1
p=
Choosing optimal k (e.g., using the elbow method) balances compactness and 1+e −( β 0+ β 1 X )
separation. 3. Underlying Assumptions:
6. Calculate the Euclidean distance between the two points (2,3) and (5,7) in a k-NN Binary dependent variable.
model. Linearity in the logit (not in the raw output like in linear regression).
Ans: No multicollinearity among predictors.
7. Illustrate how logistic regression can be implemented in Python using scikit-learn Independence of observations.
Errors do not need to be normally distributed. Repeating steps 2–3 until centroids stabilize (convergence).
4. Output: 2. Strengths:
Produces probabilities. Simple and computationally efficient.
Final classification depends on a decision threshold. Works well when clusters are spherical, equal-sized, and well-separated.
Can be evaluated using metrics like accuracy, precision, recall, ROC-AUC, etc. 3. Limitations:
11. Analyze how varying the value of 'k' influences the accuracy and decision Requires predefining k
boundaries of the K-Nearest Neighbors (KNN) algorithm. Examine the consequences Sensitive to initial centroid placement.
of choosing a 'k' that is too small or too large, and how it impacts model bias, variance, Struggles with non-spherical or overlapping clusters.
and overall performance. (Long Answer) Not suitable for categorical data (unless preprocessed).8
Ans:Effect of Varying ‘k’ in K-Nearest Neighbors (KNN): 4. Determining the Optimal Number of Clusters (k):
The value of ‘k’ (number of nearest neighbors) is a key hyperparameter in the KNN a. Elbow Method:
algorithm, and it greatly influences model accuracy, decision boundaries, bias, and Plots Within-Cluster Sum of Squares (WCSS) against different values of k.
variance. Look for the "elbow" point where adding more clusters gives diminishing
1. Small ‘k’ (e.g., k = 1): returns.
Decision Boundary: Very complex and irregular, tightly follows the training Limitation: The elbow is not always clear, especially for overlapping or noisy
data. data.
Accuracy: Can be high on training data but poor on test data. b. Silhouette Score:
Bias: Low – model fits training data closely. Measures how similar a point is to its own cluster compared to other clusters.
Variance: High – sensitive to noise and outliers. Score ranges from -1 to 1; higher is better.
Overfitting: Likely. More reliable than the elbow method, especially for non-uniform cluster sizes or
2. Large ‘k’ (e.g., k = 20 or more): shapes.
Decision Boundary: Smoother and more generalized. Scenario Elbow Method Silhouette Score
Accuracy: May improve on test data up to a point, then degrade if too large. Well-separated spherical clusters Effective Effective
Bias: Higher – model becomes too simple. Overlapping or irregular clusters Less reliable More informative
Variance: Lower – less sensitive to noise. High-dimensional data Difficult to interpret Useful, but may vary
Underfitting: Risk increases if ‘k’ is too large, as it averages over too many Noisy data or outliers Misleading May detect poor clustering
neighbors, possibly from different classes.
Value of k Bias Variance Risk 13. Analyze how Support Vector Machines (SVM) separate data points into distinct
Small (e.g., 1–3) Low High Overfitting classes by identifying optimal hyperplanes. Examine the role of kernel functions in
Optimal Balanced Balanced Best performance transforming non-linearly separable data into a higher-dimensional space for
Large (e.g., >20) High Low Underfitting effective classification. (Long Answer)
Model Performance: Ans:Support Vector Machines (SVM) classify data by finding the optimal
The optimal value of ‘k’ is usually found using cross-validation. hyperplane that best separates two classes. This hyperplane maximizes the margin
Even values of ‘k’ can lead to ties; odd k is often preferred for binary —the distance between the hyperplane and the nearest data points from each class,
classification.
called support vectors. A larger margin leads to better generalization and
12. Critically evaluate how the K-Means clustering algorithm partitions data points
classification accuracy. In cases where data is not linearly separable, kernel
into clusters. Justify the methods used to determine the optimal number of clusters,
functions can be used to transform the data into a higher-dimensional space where a
such as the Elbow Method or Silhouette Score, and assess their effectiveness in
linear hyperplane can separate the classes effectively.
different scenarios. (Long Answer)
Role of Kernel Functions in SVM:
Ans:How K-Means Works:
When data is not linearly separable in its original space, kernel functions enable
K-Means is an unsupervised learning algorithm that partitions data into k clusters
Support Vector Machines (SVM) to classify it effectively by transforming the data
by:
into a higher-dimensional space where a linear separation becomes possible.
Randomly initializing k centroids.
Key Points:
Assigning each data point to the nearest centroid (using Euclidean distance).
1. Transformation Without Explicit Computation (Kernel Trick):
Updating centroids by computing the mean of assigned points.
Kernel functions compute the dot product of two data points in the transformed Pre-Pruning (Early Stopping): Stops the tree from growing beyond a
space without explicitly mapping them. certain depth or requires a minimum number of samples to split a
This is known as the kernel trick, which saves computational effort and avoids node. This prevents over-complex trees.
working directly in high-dimensional space. Post-Pruning (Reduced Error or Cost-Complexity Pruning): Allows
2. How It Helps: the full tree to grow and then removes branches that don't significantly
Transforms complex, non-linear boundaries into linearly separable ones in a improve accuracy on validation data.
higher dimension. Both methods simplify the model, reducing overfitting and improving
Allows SVM to draw a linear hyperplane in this new space, which corresponds performance on unseen data.
to a non-linear boundary in the original space. 16. Analyze the role of distance metrics in the K-Nearest Neighbors (KNN)
14. Analyze how Decision Trees determine splits at each node based on feature algorithm and how they influence the classification or regression outcomes.
selection criteria such as Information Gain or Gini Impurity. Examine how the Examine why feature scaling through normalization or standardization is crucial
choice of splitting feature at each node impacts the tree’s accuracy, complexity, and when using KNN, particularly in datasets with varying feature ranges. (Long
potential for overfitting. (Long Answer) Answer).
Ans:Decision Trees Splitting Criteria Ans:Distance Metrics and Feature Scaling in KNN:
Feature Selection with Information Gain or Gini Impurity: Role of Distance Metrics in KNN:
Decision Trees choose the best feature to split a node based on criteria like KNN relies on distance metrics to identify the 'K' nearest neighbors of a query
Information Gain (IG) or Gini Impurity. point. Common metrics include:
Information Gain measures the reduction in entropy after a split. Euclidean Distance (default for continuous features)
Higher IG indicates a more informative split. Manhattan Distance (suitable for high-dimensional data)
Gini Impurity measures how often a randomly chosen element would Minkowski or Cosine Distance (for customized similarity)
be incorrectly classified. A lower Gini score is preferred. These metrics determine how "close" other points are, directly influencing
Impact on Accuracy and Complexity : classification or regression results.
The selected splitting feature directly affects the accuracy of the tree. Good splits lead Influence on Outcomes :
to purer nodes, improving classification or prediction performance. However, too The choice of distance metric affects which neighbors are selected. Incorrect
many splits can increase complexity, making the tree deeper and harder to interpret. or unbalanced metrics can lead to poor predictions by favoring irrelevant
Potential for Overfitting : features or distant points.
If the model keeps splitting based on minor differences (high complexity), it may Importance of Feature Scaling:
overfit the training data and perform poorly on unseen data. Using pruning techniques When features have different ranges (e.g., age: 0–100 vs. income: 0–100,000),
or setting depth limits helps reduce this risk. larger-range features dominate the distance calculation.
15. Critically evaluate the problem of overfitting in Decision Trees and its impact on Normalization (scales values to [0, 1])
model generalization. Justify how pruning techniques, such as pre-pruning and post- Standardization (scales to mean = 0, std = 1)
pruning, can effectively reduce overfitting and improve the model’s performance on These methods ensure all features contribute equally, making KNN
unseen data. (Long Answer) more accurate and fair.
Ans:Overfitting in Decision Trees and Pruning Techniques : 17. Analyze how Logistic Regression models probabilities to predict binary
Problem of Overfitting : outcomes. Examine how adjusting the decision threshold influences model
Overfitting occurs when a Decision Tree becomes too complex and captures performance, particularly in handling imbalanced datasets. (Long Answer)
noise in the training data rather than the actual patterns. This results in high Ans:Logistic Regression and Decision Thresholds (5 Marks)
training accuracy but poor generalization to new, unseen data. Modeling Probabilities (2 Marks):
Impact on Model Generalization (1 Mark): Logistic Regression predicts binary outcomes by modeling the probability that an
An overfitted tree fails to perform well on test data, reducing its real-world instance belongs to a class using the sigmoid function:
effectiveness. It may make overly specific rules that don't apply beyond the This maps any real-valued input to a value between 0 and 1, interpreted as the
training set. probability of the positive class.
Pruning Techniques to Reduce Overfitting : Decision Threshold :
A default threshold of 0.5 is typically used to classify outcomes:
If P≥0.5P \geq 0.5P≥0.5, predict class 1
If P<0.5P < 0.5P<0.5, predict class 0 Role:R² indicates the goodness of fit — the higher the R², the better the model
This threshold can be adjusted depending on the problem. explains the data. However, it does not indicate causation or model correctness and
Impact on Imbalanced Datasets : can be misleading if used alone, especially in overfitted models.
In imbalanced datasets (e.g., 95% negative, 5% positive), a 0.5 threshold may miss
minority cases (false negatives). 5. Describe how residual plots can help diagnose model fit issues in linear regression.
Lowering the threshold (e.g., to 0.3) increases sensitivity (recall) to detect more Ans:Residual plots show the difference between predicted and actual values
positives. (residuals) vs. predicted values or input features.
Adjusting the threshold helps balance precision and recall, improving How they help:
performance on real-world, skewed data. A random scatter of points suggests a good model fit.
Patterns or curves indicate non-linearity — the model may be missing
MODULE 5 relationships.
1. A company is using k-means clustering for customer segmentation but is unsure Increasing or decreasing spread signals heteroscedasticity (non-constant
about the optimal number of clusters. Critique whether the Elbow Method or variance).
Silhouette Score is a better approach to determine the optimum number of clusters Outliers or clustering can reveal data issues or the need for a better model.
and justify your reasoning. 6. Interpret the purpose of cross-validation in regression model selection.
Ans:The Elbow Method and Silhouette Score are both valid techniques for finding Ans:Cross-validation is used to assess how well a regression model generalizes to
the optimal number of clusters in K-means. unseen data.
Elbow Method plots the within-cluster sum of squares (WCSS) vs. number of Purpose:
clusters. The "elbow" point suggests the best balance between model complexity It splits the dataset into training and validation sets multiple times (e.g., in k-fold
and compactness. However, it can be subjective and unclear if no obvious elbow cross-validation).
exists. Helps avoid overfitting by testing the model on different data subsets.
Silhouette Score measures how well each point fits within its cluster compared to Provides a more reliable estimate of model performance compared to a single
others. It ranges from -1 to 1, with higher scores indicating well-separated, dense train-test split.
clusters. Aids in model selection by comparing performance metrics (like RMSE or R²)
across different models.
2. Define simple linear regression and mention one assumption it makes. 7. A simple linear regression model is given by Y=3X+5. Calculate the predicted
Ans:Simple linear regression models the relationship between one independent value of Y when X=4.
variable and one dependent variable. Ans:
Assumption: There is a linear relationship between the independent and dependent 8. Illustrate how to implement multiple linear regression in Python using the scikit-
variables. learn library with a sample dataset.
3. List two key differences between simple linear regression and multiple linear Ans:from sklearn.linear_model import LinearRegression
regression. from sklearn.datasets import make_regression
Ans: X, y = make_regression(n_samples=100, n_features=2, noise=0.1)
Feature Simple Linear Regression Multiple Linear Regression model = LinearRegression()
Independent model.fit(X, y)
One Two or more print(model.coef_, model.intercept_)
Variables
9.Given a dataset with multicollinearity, apply an appropriate regression
Simple, straight-line More complex, considers multiple
Model Complexity technique to reduce its impact and describe your approach.
relationship factors
Ans:To reduce multicollinearity in a dataset, we can apply Ridge Regression or
Principal Component Regression (PCR):
4. Explain the role of the R-squared (R^2) value in evaluating a regression model. Ridge Regression: Adds an L2 regularization penalty to the regression model,
Ans:The R-squared (R²) value measures how well a regression model explains the shrinking correlated variable coefficients to stabilize estimates.
variance in the dependent variable. Principal Component Regression (PCR): Uses Principal Component Analysis
It ranges from 0 to 1: (PCA) to transform correlated predictors into uncorrelated components, reducing
R² = 1 means perfect prediction. redundancy.
R² = 0 means the model explains none of the variance.
10.A researcher builds two multiple linear regression models with different feature Trade-off:
sets. Analyze how cross-validation can help determine which model is more reliable. Reducing bias usually increases variance and vice versa. The goal is to find a balance
Ans:Cross-validation helps compare the reliability and performance of two that minimizes total error.
multiple linear regression models with different feature sets. Effect on Overfitting/Underfitting:
How it helps: High bias → Underfitting (model too simple)
Both models are evaluated on multiple data splits (e.g., k-fold CV), reducing High variance → Overfitting (model too complex)
bias from a single train-test split. 14. Evaluate the performance of different classification models (e.g., Logistic
Metrics like Mean Squared Error (MSE) or R² are averaged across folds. Regression, Random Forest, and SVM) on a given dataset using precision, recall,
The model with lower average error and less variance in scores is considered and F1-score. (Long Answer)
more reliable. Ans:Overview of Metrics:
Helps identify if extra features improve performance or just cause overfitting Precision: Measures the proportion of true positive predictions among all positive
predictions.
11.A company is using Akaike Information Criterion (AIC) and Adjusted R-squared Precision=TP\TP+FP
to select the best multiple regression model. Critique which metric is more effective High precision means fewer false positives.
for model selection and justify your reasoning. Recall (Sensitivity): Measures the proportion of true positives identified out of actual
Ans:Both Akaike Information Criterion (AIC) and Adjusted R-squared are useful positives.
for model selection, but they serve slightly different purposes. Recall=TP\TP+FN
AIC balances model fit and complexity. It penalizes the number of predictors to
High recall means fewer false negatives.
avoid overfitting. Lower AIC indicates a better model. It is more effective when
F1-score: The harmonic mean of precision and recall, balancing the two.
comparing non-nested models and focuses on predictive accuracy.
F1=2×Precision×Recall \ Precision+Recall
Adjusted R-squared adjusts R² for the number of predictors. It increases only if a
Useful when you want a balance between precision and recall, especially in
new feature improves the model more than by chance. However, it is less
imbalanced datasets.
sensitive to overfitting compared to AIC.
15. Analyze the impact of hyperparameter tuning in Random Search in deep
12. Compare and contrast the advantages and disadvantages of ensemble
learning models. (Long Answer)
methods like Bagging, Boosting, and Stacking. (Long Answer)
Ans:· Efficiency: Random Search explores hyperparameter space by sampling
Ans:
combinations randomly, often finding good configurations faster than exhaustive grid
Method Advantages Disadvantages search, especially when only a few hyperparameters matter.
- Reduces variance and - Less improvement if base learners · · Improved Model Performance: Proper tuning of parameters like learning rate,
Bagging (e.g.,
overfitting- Easy to parallelize- are strong- Can be computationally batch size, and dropout via Random Search can significantly enhance model accuracy
Random Forest)
Improves stability expensive with many trees and generalization by optimizing training dynamics.
- Reduces bias and variance- · · Better Exploration: Random Search can discover non-intuitive hyperparameter
Boosting (e.g., - Sensitive to noisy data and values by sampling broadly, increasing chances of escaping local optima in complex
Focuses on hard-to-predict
AdaBoost, outliers- Sequential training is deep learning models.
examples- Often achieves high
XGBoost) slower and harder to parallelize · · Scalability: It handles high-dimensional hyperparameter spaces efficiently,
accuracy
making it suitable for deep learning models with many parameters.
- Combines diverse models for - Complex to implement and tune-
· · Limitations: Despite efficiency, Random Search requires multiple training runs,
Stacking better performance- Flexible, can Risk of overfitting if meta-model is which can be computationally expensive, and results depend on chosen parameter
use any base learners not carefully chosen ranges.
16. A regression analysis between apples (y) and oranges (x) resulted in the
13. Examine the trade-offs between bias and variance in machine learning models following least squares line: y = 100 + 2x. Predict the implication if oranges are
and their effect on overfitting and underfitting. (Long Answer) increased by 1 (Long Answer)
Ans:Bias: Bias refers to errors from overly simple models that underfit data, missing
important patterns. High bias leads to poor training and test accuracy.
Variance refers to errors from models that are too complex and overfit training
data, capturing noise. High variance leads to good training but poor test accuracy.