Foundations of Data Science Questions
Foundations of Data Science Questions
Part A
Linking means when data selected (brushed) in one visualization is automatically highlighted in other
visualizations. This helps explore relationships across multiple variables.
Example: Selecting outliers in a scatter plot and seeing where they lie on a histogram.
Performance Metrics:
4. How regression toward the mean differs from other parameters? Give an
example.
Regression toward the mean is a statistical phenomenon where extreme values tend to move closer to
the average in subsequent measurements.
Example: A student scoring exceptionally high on one test is likely to score closer to the class average on
the next, not due to a change in ability but due to natural variation.
It differs from regression analysis, which models relationships between variables, while this concept
explains statistical tendency of extremes returning to normal.
Significance:
Example All college students in India 500 students selected for survey
• Low p-value (< 0.05): Strong evidence against null hypothesis → reject it.
• High p-value (> 0.05): Weak evidence against null → fail to reject.
Purpose Compare means of two groups Compare means of three or more groups
Use Cases:
Applications:
• Medical research: patient survival time.
Ease of Analysis Easy with traditional methods Needs advanced tools (NLP, CV)
Detection Methods:
• Box plot
• Z-score
Hypothesis Direction Tests effect in one direction Tests effect in both directions
Applications:
Use Cases:
Example: In fraud detection, fraud samples may be rare but must be sampled more during training.
Part B
1. Structured Data
Definition:
Structured data is highly organized and easily searchable using traditional databases and tools like SQL.
Characteristics:
• Organized in rows and columns
• Follows a fixed schema
• Easily stored in relational databases
Examples:
• Student database (Name, Roll Number, Marks)
• Banking transactions (Date, Account No, Amount)
Tabular Example:
Student ID Name Age Marks
101 Ravi 20 87
102 Priya 21 91
2. Unstructured Data
Definition:
Unstructured data lacks a predefined format or organization, making it harder to analyze directly.
Characteristics:
• No fixed schema
• Requires advanced tools (NLP, CV)
• High in volume and complexity
Examples:
• Emails, social media posts
• Audio/video files
• Medical images
3. Semi-Structured Data
Definition:
Data that is not stored in a relational database but still has some organizational properties.
Characteristics:
• Has tags or markers
• Easier to analyze than unstructured data
• Common in data exchange formats
Examples:
• JSON, XML, HTML documents
Example JSON:
{
"name": "Anil",
"age": 25,
"city": "Mumbai"
}
4. Temporal Data
Definition:
Temporal data is time-stamped and varies with time.
Characteristics:
• Time-series in nature
• Useful for forecasting
Examples:
• Stock market prices
• Temperature recordings
Visualization:
Stock Price vs Time → Line chart
5. Spatial Data
Definition:
Spatial data is related to physical space or geography.
Characteristics:
• Includes coordinates, maps
• Requires GIS tools
Examples:
• GPS data
• Satellite images
• Store locations
7. Big Data
Definition:
Big data refers to large, complex data sets that cannot be handled using traditional processing tools.
3Vs of Big Data:
• Volume: Huge amount
• Velocity: Speed of data generation
• Variety: Different types and sources
Example:
• Data generated by Facebook every second
8. Metadata
Definition:
Metadata is "data about data". It provides information about other data.
Example:
• Author name and date for a file
• File size and type
Summary Table
Facet Description Tools Used
Structured Tabular data SQL, Excel
Unstructured Raw formats like text, video NLP, Computer Vision
Semi-structured Tagged data (JSON, XML) XML parsers, NoSQL
Temporal Time-based measurements Time series analysis
Spatial Location-based data GIS, Google Maps
Quantitative Numerical measurements Statistical analysis
Qualitative Categorical attributes Label encoding
Big Data High-volume and fast data Hadoop, Spark
Metadata Info about other data Data catalogs
ii) Sketch and outline the step-by-step activities in the data science process.
The Data Science Life Cycle defines the structured workflow followed by data scientists to solve real-world
problems using data.
1. Problem Definition
Goal: Understand the problem clearly.
• Define objectives
• Identify business needs
• Set success criteria
Example: "Predict customer churn in a telecom company."
2. Data Collection
Goal: Gather data from various sources.
• Internal databases
• Public datasets
• APIs, sensors, surveys
Example: Collect customer data, call records, service usage logs.
5. Feature Engineering
Goal: Create meaningful features from raw data.
• Normalize/scale data
• Extract useful variables
• Reduce dimensionality (PCA)
Example: Derive average monthly usage from daily logs.
6. Model Building
Goal: Train machine learning models.
• Choose algorithms (Linear Regression, Decision Tree, etc.)
• Split data into train/test sets
• Use libraries like Scikit-learn
Example: Use Logistic Regression to predict churn.
7. Model Evaluation
Goal: Check model performance.
• Metrics: Accuracy, Precision, Recall, F1-Score
• Use validation techniques (K-Fold, Cross-validation)
Example:
• Accuracy = 92%
• F1-score = 0.88
8. Model Deployment
Goal: Deploy the model into production.
• Use tools: Flask API, Streamlit, Docker
• Make predictions accessible to users
Example: Integrate churn prediction into CRM software.
Conclusion
The facets of data help us understand the variety and complexity of real-world information. Each type of
data requires specific tools and handling techniques. Similarly, the data science process is a structured
approach involving multiple iterative stages — from understanding the problem to deploying the model
and monitoring it.
By mastering both the facets of data and the workflow of data science, students and professionals are
better equipped to solve analytical problems in domains like healthcare, finance, marketing, and more.
11. (b) Explain in Detail About Cleansing, Integrating, and Transforming Data
with Examples
Data preprocessing is a crucial step in the data science pipeline. It ensures that the data used for analysis
and modeling is clean, consistent, and usable.
2. Data Integration
3. Data Transformation
🔹 1. Data Cleansing (Data Cleaning)
✅ Definition:
Data cleansing is the process of detecting and correcting (or removing) inaccurate, incomplete,
inconsistent, or irrelevant parts of the data.
Issue Example
o Remove rows/columns
df['Age'].fillna(df['Age'].mean(), inplace=True)
2. Removing Duplicates:
df.drop_duplicates(inplace=True)
4. Handling Outliers:
o Z-score method
o IQR method
o Winsorizing
5. Standardizing Values:
Replace variations of categorical values with a standard label.
📝 Example:
Raw Data:
Cleaned Data:
🔹 2. Data Integration
✅ Definition:
Data integration is the process of combining data from multiple sources into a coherent dataset.
• SQL Databases
• CSV Files
• APIs
• Cloud Storage
• IoT Sensors
1. Schema Integration:
Aligning different structures into one format.
2. Entity Identification:
Matching records that refer to the same entity.
📝 Example:
After Integration:
🔹 3. Data Transformation
✅ Definition:
Data transformation is the process of converting data into a suitable format or structure for analysis or
modeling.
1. Normalization/Standardization:
pd.get_dummies(df['Gender'])
3. Binning:
4. Log Transformation:
df['Salary'] = np.log(df['Salary'])
5. Feature Construction:
6. Date-Time Transformation:
df['Year'] = df['Date'].dt.year
📝 Example:
Before Transformation:
25 Male ₹50,000
30 Female ₹60,000
After Transformation:
0.33 1 10.82
0.67 0 11.00
[Raw Data]
[Cleansing]
[Integration]
↓
[Transformation]
[Analysis / Modeling]
Process Benefit
🧠 Conclusion
The success of any data science project depends heavily on data preprocessing. Cleansing removes
errors, integration combines data from diverse sources, and transformation makes the data suitable for
modeling. Mastering these techniques ensures accuracy, efficiency, and robustness in analytics and
predictive models.
“Garbage in, garbage out” — if raw data is poor, the model output will also be poor. Hence,
preprocessing is the foundation of data science success.
11) c) Explain in detail about the benefits and uses of data science with
counter examples.
📘 Introduction to Data Science
Data Science is the study of extracting knowledge and insights from large volumes of data using a
combination of statistical methods, algorithms, machine learning, and domain expertise. In today’s
digital world, where data is generated at an unprecedented scale, data science plays a vital role in
helping individuals, organizations, and governments make informed, accurate, and strategic decisions.
1. Improved Decision-Making
Data science enables companies to make decisions based on facts, patterns, and data-driven
predictions, rather than gut instinct.
• Example: A retail chain analyzes purchasing patterns to stock high-demand products before a
festival season.
• Counter Example: A business that makes assumptions without analyzing customer behavior may
stock the wrong inventory, resulting in losses.
2. Accurate Predictions
Data science models can forecast future outcomes with high accuracy when trained on historical data.
• Example: Weather forecasting using historical climate and satellite data helps predict storms.
• Counter Example: Without using data science models, predictions may be based on outdated or
insufficient information, causing inaccurate planning or disaster response.
Data science tailors products, content, and advertisements to user behavior, improving satisfaction and
engagement.
• Example: Netflix recommends shows based on a user’s watch history and similar viewer profiles.
Data science helps detect anomalies and frauds by identifying unusual patterns in large datasets.
• Example: Banks use machine learning models to detect credit card fraud in real time.
• Counter Example: Institutions without such detection systems are prone to higher risks and
financial loss due to undetected fraud.
5. Operational Efficiency
Data science improves efficiency by automating processes and optimizing resource allocation.
• Example: Logistics companies like FedEx use data to optimize delivery routes and reduce fuel
costs.
• Counter Example: Without analysis, inefficient routing could cause delays and operational losses.
🔬 1. Healthcare
• Example: Predicting patient readmission rates using electronic health records (EHR).
• Counter Example: Without data science, patient monitoring may miss early warning signs, risking
health outcomes.
• Counter Example: Retailers without data analysis may fail to anticipate customer needs or run
inefficient sales campaigns.
• Example: Loan eligibility prediction based on customer income, credit history, and spending
behavior.
• Counter Example: Relying only on fixed rules for credit approval may exclude creditworthy but
unconventional applicants.
• Counter Example: Poor planning may result in long wait times and customer dissatisfaction.
🎓 5. Education
• Counter Example: Without adaptation, learners may receive one-size-fits-all material, reducing
effectiveness.
📈 6. Business Intelligence
• Example: BI tools like Power BI and Tableau display real-time KPIs for executive decisions.
• Counter Example: Without visualization and trend insights, businesses may miss critical
opportunities or risks.
Despite its benefits, data science is not foolproof. It can lead to negative consequences if not applied
properly.
Biased Data If training data contains bias, the model A recruitment model favoring one
will reflect that bias. gender due to biased historical data.
Overfitting Model fits training data too closely, but A spam detection model that marks
fails on new data. all unfamiliar emails as spam.
Lack of Context Models can’t understand real-world A model might suggest layoffs to cut
implications without human insight. costs without considering morale.
Fraud detection Banking systems detecting anomalies Missed fraud leads to huge losses
Resource Allocating doctors based on patient Wastage of staff or long patient wait
optimization flow data times
🧠 Ethical Considerations
📌 Conclusion
Data Science has revolutionized the way the world operates, bringing efficiency, accuracy, and
intelligence into decision-making and operations. From healthcare to banking, education to
entertainment, the uses of data science are vast and impactful.
However, it must be used with caution, ethical responsibility, and awareness of its limitations. Blind
reliance without understanding the data, context, or model behavior can lead to bias, errors, and real-
world harm.
Exploratory Data Analysis (EDA) is the first and most critical step in any data science process. It involves
analyzing datasets to summarize their main characteristics, often using visualization techniques.
🎯 Goals of EDA
3. Test assumptions
1. Univariate Analysis
Example:
Example:
2. Bivariate Analysis
o 0: No correlation
Example:
📊 Categorical vs Numerical:
Example:
📋 Categorical vs Categorical:
Example:
3. Multivariate Analysis
🖼️ Techniques:
Example:
Example:
2. Outlier Detection
🖼️ Visualization Tools
Tool Use
Example:
• Selecting high-income customers in one graph also highlights their age in another.
2. Dimensionality Reduction
• Techniques like PCA (Principal Component Analysis) reduce features to 2D/3D for visualization.
Example:
Step-by-step EDA:
2. Check shape:
df.shape
3. Null values:
df.isnull().sum()
4. Univariate Analysis:
o Histogram of Age
5. Bivariate Analysis:
6. Outliers:
o Boxplot of Fare
7. Multivariate Analysis:
Purpose Description
Feature selection Identify which features are most useful for modeling
📌 Conclusion
Exploratory Data Analysis (EDA) is not just the first step — it’s the foundation of effective data science.
By using visualizations and statistical techniques, data scientists gain intuitive and structured
understanding of the data before model building.
EDA:
• Reduces errors
“If data is the new oil, then EDA is the refining process.”
12 (a) For each of the following pairs of distributions, first decide whether
their standard deviations are about the same or different. If their standard
deviations are different, indicate which distribution should have the larger
standard deviation. Note that the distribution with the more dissimilar set of
scores or individuals should produce the larger standard deviation regardless
of whether, on average, scores or individuals in one distribution differ from
those in the other distribution.
(i) SAT scores for all graduating high school seniors (al) or all college freshmen
(a2)
(ii) Ages of patients in a community hospital (bl) or a children’s hospital (b2)
(iii) Motor skill reaction times of professional baseball players (cl) or college
students (c2)
(iv) GPAs of students at some university as revealed by a random sample (dl)
or a census of the entire student body (d2)
(v) Anxiety scores (on a scale from 0 to 50) of a random sample of college
students taken from the senior class (el) or those who plan to attend an
anxiety-re duction clinic (e2)
(vi) Annual incomes of recent college graduates (fl) or of 20-year alumni (f2)
Rule: The standard deviation (SD) is a measure of how spread out the values in a dataset are.
A larger SD means the data values vary more from the mean.
✅ (i) SAT scores for all graduating high school seniors (a1) or all college freshmen (a2)
• Explanation: SAT scores of college freshmen (a2) are from students who were admitted to college,
possibly a more academically filtered group, while high school seniors (a1) include all levels of
ability.
• Conclusion:
• Explanation: A community hospital (b1) serves people of all ages — children, adults, seniors —
whereas a children’s hospital (b2) focuses on a narrow age group (usually 0–18).
• Conclusion:
✅ (iii) Motor skill reaction times of professional baseball players (c1) or college students (c2)
• Explanation: Professional baseball players (c1) have consistent, trained reaction times due to
athletic conditioning, whereas college students (c2) show more variability in skill levels.
• Conclusion:
✅ (iv) GPAs of students at some university revealed by a random sample (d1) or a census of the entire
student body (d2)
• Explanation: A random sample (d1) may have sampling variation, while a census (d2) captures
everyone, thus being more stable and less varied.
• Conclusion:
✅ (v) Anxiety scores (scale 0–50) of a random sample of senior college students (e1) or those who plan
to attend an anxiety-reduction clinic (e2)
• Explanation: Students planning to attend a clinic (e2) are likely to show higher and more varied
anxiety levels, while the random sample (e1) might show a narrower, more moderate distribution.
• Conclusion:
✅ (vi) Annual incomes of recent college graduates (f1) or of 20-year alumni (f2)
• Explanation: Recent graduates (f1) typically have similar entry-level salaries. In contrast, 20-year
alumni (f2) are likely to have diverse careers, promotions, and financial outcomes, leading to
greater spread.
• Conclusion:
📌 Summary Table
Question Recap:
Sorted Data:
0, 0, 0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 7, 8, 11
Total numbers (n = 18)
Step 2: Mode
Frequency table:
Value Frequency
0 3
1 2
2 3
3 4
4 2
5 1
7 1
8 1
11 1
Step 3: Median
Since n = 18 (even), median is the average of the 9th and 10th values.
• 9th value = 3
• 10th value = 3
Median = (3 + 3)/2 = 3
Step 4: Mean
xi xi−μ (xi−μ)2
1 -2.83 8.00
3 -0.83 0.69
4 0.17 0.03
1 -2.83 8.00
0 -3.83 14.66
2 -1.83 3.35
5 1.17 1.37
8 4.17 17.39
0 -3.83 14.66
2 -1.83 3.35
3 -0.83 0.69
4 0.17 0.03
7 3.17 10.03
11 7.17 51.40
0 -3.83 14.66
2 -1.83 3.35
3 -0.83 0.69
3 -0.83 0.69
Measure Value
Mode 3
Median 3
Mean 3.83
✅ Question:
The frequency distribution for the length (in seconds) of 100 telephone calls is:
0–20 0
21–40 5
41–60 7
61–80 14
81–100 28
101–120 21
121–140 13
141–160 9
161–180 3
✅ Step 1: Compute the Class Midpoints
0–20 0 10
21–40 5 30.5
41–60 7 50.5
61–80 14 70.5
81–100 28 90.5
101–120 21 110.5
121–140 13 130.5
141–160 9 150.5
161–180 3 170.5
f x (midpoint) f·x
0 10 0
5 30.5 152.5
7 50.5 353.5
14 70.5 987.0
28 90.5 2534.0
21 110.5 2320.5
13 130.5 1696.5
9 150.5 1354.5
3 170.5 511.5
Use formula:
Where:
• N=100N = 100
• N\2= 50
f x x² f · x²
0 10 100 0
Measure Value
b) (i) The wind speed X in miles per hour and wave height Y in feet were
measured under various conditions on an enclosed deep water sea, with
the results shown in the table.
X 0 2 7 9 13 22
Y 0 5 10 14 22 31
Create a scatter plot and predict the type of correlation.
Given Data
Since you're preparing for a written exam, below is a simple hand-drawable version of how to sketch the
scatter plot on graph paper.
Range: 0 to 25
Y-axis: Wave Height (feet)
Range: 0 to 35
Plot points:
• (0, 0)
• (2, 5)
• (7, 10)
• (9, 14)
• (13, 22)
• (22, 31)
Definition:
Correlation describes the strength and direction of a linear relationship between two variables.
Observation:
This suggests:
• The relationship looks close to linear, though the increase may become more rapid for higher
values
Conclusion:
The scatter plot shows a strong positive correlation between wind speed and wave height.
As wind speed increases, wave height increases, indicating that higher wind speeds generate taller waves
in deep water conditions.
Key Exam Writing Points:
X’=5 Y’ = 60
SS,=35 SSy=70
✅ Interpretation:
This regression equation predicts life expectancy (Y) from years of heavy smoking (X).
• The slope is negative, indicating that as years of heavy smoking increase, life expectancy decreases.
• For each additional year of heavy smoking, life expectancy drops by approximately 1.13 years.
📝 Use in Exam:
Make sure to:
• Clearly define what each variable stands for
• Show each step of calculation (slope and intercept)
• Write the final equation in a box or underline it
13 (a) (i) Reading achievement scores are obtained for a group of fourth
graders. A score of 4.0 indicates a level of achievement appropriate for
fourth grade, a score below 4.0 indicates underachievement, and a score
above 4.0 indicates overachievement. Assume that the population
standard deviation equals 0.4. A random sample of 64 fourth graders
reveal a mean achievement score of 3.82. Construct a 95 percent
confidence interval for the unknown population mean. (Remember to
convert the standard deviation to a standard error.)
Interpret this confidence interval; that is, do you find any consistent
evidence either of overachievement or of underachievement?
✅ Question Breakdown:
We are given:
• Population standard deviation σ=0.4\sigma = 0.4
• Sample size n=64n = 64
• Sample mean xˉ=3.82\bar{x} = 3.82
• Confidence level = 95%
🧮 Step-by-Step Solution
✅ Step 2: Determine the Z-value for 95% confidence
For a 95% confidence level, the critical z-value is:
z=1.96z = 1.96
1. Introduction
Estimation in statistics is the process of inferring the value of a population parameter using data from a
sample. Since we rarely have access to entire populations, estimation helps us make educated guesses
about characteristics like mean, proportion, or standard deviation.
• Point Estimation
• Interval Estimation
2. Estimation Methods
Example:
If the sample mean of height of students is 167.2 cm, it is a point estimate of the population mean.
B. Interval Estimation
• Provides a range (interval) of values believed to contain the population parameter with a certain
level of confidence.
3. Confidence Interval
A Confidence Interval (CI) is a range of values, derived from the sample data, that is likely to contain the
true population parameter with a specified probability (confidence level).
Confidence Levels:
90% 1.645
95% 1.96
99% 2.576
Suppose:
Then:
Interpretation:
We are 95% confident that the true population mean lies between 46.08 and 53.92.
7. Conclusion
• Confidence Interval gives a range where the true parameter likely lies, with a confidence level.
• Confidence intervals are more informative as they quantify the uncertainty of estimation.
(b) (i) For the population at large, the Wechsler Adult Intelligence Scale is designed to yield a normal
distribution of test scores with a mean of 100 and a standard deviation of 15. School district
officials wonder whether, on the average, an IQ score different from 100 describes the intellectual
aptitudes of all students in their district. Wechsler IQ scores are obtained for a random sample of
25 of their students, and the mean IQ is found to equal 105. Using the step-by-step procedure,
test the null hypothesis at the .05 level of significance.
ii) Imagine a simple population consisting of only five observations: 2, 4, 6, 8, 10. List all
possible samples of size two. Construct a relative frequency table showing the sampling
distribution of the mean.
Here’s a step-by-step solution to the hypothesis testing problem for Part B, 10 marks in the subject
Fundamentals of Data Science.
🔍 Question Overview:
The question asks whether the mean IQ in the district is different from 100, so this is a two-tailed test.
zcritical=±1.96
Step 4: Compare z with Critical Value
• Computed z = 1.67
Since 1.67 < 1.96, the test statistic does not fall in the rejection region.
Step 5: Conclusion
Interpretation:
At the 0.05 level of significance, there is not enough evidence to conclude that the average IQ of students
in the district is significantly different from the general population mean of 100.
• Z-statistic: 1.67
• Conclusion: No statistically significant difference in IQ levels of district students at the 0.05 level.
13. (a) Imagine that one of the following 95 percent confidence intervals estimates the effect of vitamin
C on IQ scores:
95% Confidence Interval Lower Limit Upper Limit
1 100 102
2 95 99
3 102 106
4 90 111
6 91 98
(i) Which one most strongly supports the conclusion that vitamin C increases IQ scores?(4)
(ii) . Which one implies the largest sample size? (3)
(iii) Which one most strongly supports the conclusion that vitamin C decreases IQ scores?(3)
(iv) Which one would most likely stimulate the investigator to conduct an additional
experiment using larger sample sizes? (3)
1 100 102
2 95 99
3 102 106
4 90 111
6 91 98
✅ (i) Which one most strongly supports the conclusion that vitamin C increases IQ scores?
(4 marks)
• Reason: This interval lies entirely above 100, the average IQ score in the population. It strongly
suggests that IQ scores increased due to vitamin C.
• Reason: The narrowest confidence interval typically comes from a larger sample size, which
reduces the margin of error. Interval 1 is the narrowest (just 2 units wide), implying a high level of
precision.
✅ (iii) Which one most strongly supports the conclusion that vitamin C decreases IQ scores?
(3 marks)
• Reason: This interval lies entirely below 100, indicating a consistent drop in IQ scores, and thus
supports the idea that vitamin C may reduce IQ.
✅ (iv) Which one would most likely stimulate the investigator to conduct another
experiment with larger sample sizes? (3 marks)
• Reason: This interval is very wide, suggesting high variability and uncertainty in the results. A wide
confidence interval usually indicates the sample size was small, and more data is needed to
improve precision.
(b) (i) Exemplify in detail about the significance of attest, its procedure and
decision rule with example. (6)
✅ What is a t-test?
A t-test is a statistical method used to determine whether there is a significant difference between the
mean of a sample and a known population mean, or between two sample means, when the population
standard deviation is unknown.
✅ Significance of t-test
✅ Types of t-tests
Type Purpose
•
Step 4: Determine Degrees of Freedom
df = n - 1
✅ Example:
✅ Conclusion
There is no statistically significant difference between the sample mean and population mean.
(ii) A study finds that racism in cricket event more often takes place when the
game is played in England or Australia or New Zealand (Say EAN countries).
Given that
✅ Given:
Let:
• BAN = Game is played in Bangladesh, Australia, or New Zealand (but here it's a typo — we assume
it still means “EAN”)
We are to find:
• (a) P(No Racism takes place)
We are not directly given P(R)P(R), but we can use conditional probability:
To solve this t-test problem for the library loan periods, we will perform a one-sample t-test to determine
whether the average loan period differs significantly from the current loan period policy of 21 days.
✅ Problem Summary
• Sample data (n = 8): 21, 15, 12, 24, 20, 21, 13, 16
• Hypothesized population mean (μ0) = 21 days
• Calculated t = -2.12
• Critical t = ±2.365
Since ∣−2.12∣<2.365, we fail to reject the null hypothesis.
✅ Conclusion
At the 5% level of significance, there is not enough evidence to conclude that the average loan period
differs significantly from 21 days. Therefore, the library may keep the current policy unless further data
suggests otherwise.
✅ Given:
A sample of 10 wheat package weights (in ounces):
16, 15, 14, 15, 14, 15, 16, 14, 14, 14
Value xi xi−xˉ (xi−xˉ)2
16 1.3 1.69
15 0.3 0.09
14 -0.7 0.49
15 0.3 0.09
14 -0.7 0.49
15 0.3 0.09
16 1.3 1.69
14 -0.7 0.49
14 -0.7 0.49
14 -0.7 0.49
Sum of squares:
✅ Final Answers:
• Sample Mean: 14.7 ounces
• Estimated Standard Error of the Mean: 0.26 ounces
b)(i) Illustrate in detail about one factor ANOVA with example. (7)
Definition:
One-Factor ANOVA is a statistical technique used to test if three or more group means are significantly
different from each other. It analyzes the impact of a single independent variable (factor) on a
dependent variable.
Example Scenario:
Suppose a researcher wants to test whether three different fertilizers (A, B, and C) lead to different crop
yields.
20 25 30
21 27 29
19 26 31
• k= number of groups = 3
• n = total observations = 9
• dfB=k−1=2
• dfW=n−k=6
Step 6: F-ratio
Interpretation:
If the F-test is significant, it means at least one fertilizer type results in a significantly different crop
yield.
Total SST 8
Conclusion:
One-Factor ANOVA helps test mean differences among more than two groups using variance analysis.
It’s a fundamental concept in statistical analysis and a cornerstone of inferential data science.
ii) A random sample of 90 College students indicates whether they most desire
love, wealth power, health, fame, or family happiness. Using the .05 level of
significance and the following results, test the null hypothesis that in the
underlying Population, the various desires are equally popular using chi-
square test. (6)
Category O E (O−E)2/E
(Observed) (Expected)
Health 25 15 6.67
Fame 10 15 1.67
Family 15 15 0.00
Happiness
✅ Conclusion:
At the 0.05 level of significance, there is sufficient evidence to conclude that not all desires
are equally popular among college students.
(i) Using t, test the null hypothesis at the .05 level of significance. (5)
(iii) Are there any special precautions that should be taken with the present
experimental design?
🔹 Given:
• df=29
• α= 0.05
• One-tailed test
tcritical=1.699
Step 5: Decision Rule
Since:
tcalculated=1.41<tcritical=1.699
❌ Do not reject H₀ — there is insufficient evidence at the 0.05 level to conclude that the
additive improves mileage.
p-value≈0.085p
✅ Final Summary:
b)(i) A library system lends books for periods of 21 days. This policy is
being reevaluated in view of a possible new loan period that could be
either longer or shorter than 21 days. To aid in making this decision,
book-lending records were consulted to determine the loan periods
actually used by the patrons. A random sample of eight records
revealed the following loan periods in days: 21, 15, 12, 24, 20, 21, 13,
and 16. Test the null hypothesis with t-test, using the .05 level of
significance.
21 3.25 10.56
15 -2.75 7.56
12 -5.75 33.06
24 6.25 39.06
20 2.25 5.06
21 3.25 10.56
13 -4.75 22.56
16 -1.75 3.06
✅ Conclusion:
At the 0.05 level of significance, there is insufficient evidence to conclude that the average loan period is
different from 21 days.
ii) A random sample of 90 college students indicates whether they most desire
love, wealth, power, health, fame, or family happiness. Using the .05 level of
significance and the following results, test the null hypothesis that, in the
underlying population, the various desires are equally popular using chi-square
test.
Desires of college students
Frequency Love Wealth Power Health Fame Family Hap. Total
Observed (/0) 25 10 5 25 10 15 90
✅ Problem Details:
• Total sample size (n) = 90
• Desire categories: Love, Wealth, Power, Health, Fame, Family Happiness
• Number of categories (k) = 6
• Significance level (α) = 0.05
Observed Frequencies (O):
Desire Frequency (O)
Love 25
Wealth 10
Power 5
Health 25
Fame 10
Family Happiness 15
Total 90
✅ Final Conclusion:
At the 0.05 level of significance, there is sufficient evidence to conclude that not all desires are equally
popular among college students.
Logistic Regression is a classification algorithm used when the dependent variable is categorical,
typically binary (0 or 1, yes or no, success or failure).
Unlike linear regression, logistic regression predicts the probability of class membership rather than a
continuous outcome. It is widely used in predictive analysis to model outcomes such as disease
presence, customer churn, fraud detection, etc.
• To identify influential factors and quantify their effect on the binary outcome.
3. Mathematical Model
4. Key Concepts
5. Assumptions of Logistic Regression
2. Independent observations
6. Example
Use Case: Predict whether a student will pass (1) or fail (0) an exam based on study hours.
7. Evaluation Metrics
Conclusion
1. Introduction
Multiple Linear Regression (MLR) is a supervised learning algorithm used to predict the value of a
continuous dependent variable based on two or more independent (predictor) variables.
It extends simple linear regression, which uses one predictor, by considering multiple factors that jointly
affect the outcome.
3. Mathematical Model
4. Assumptions of MLR
5. Example Scenario
Problem:
A university wants to predict a student’s final score (Y) based on the following factors:
Sample Data:
Student Hours Studied (X₁) Attendance (%) (X₂) Final Score (Y)
A 10 90 85
B 8 80 78
C 12 95 88
D 6 70 72
Fitted Model (using regression):
Y= 5+ 3.2X1 + 0.4X2
Prediction Example:
Y=5+3.2(9)+0.4(85)=5+28.8+34=67.8
• Adjusted R²: Adjusts R² for number of predictors; better for comparing models.
Conclusion
Multiple Linear Regression is a fundamental predictive modeling tool that captures the combined effect
of multiple independent variables on a single continuous outcome. It's widely used in business, science,
healthcare, and social sciences to make informed decisions based on data.
b) Explain in depth about Time series analysis and its techniques with
relevant examples.
Time Series Analysis is a statistical technique used to analyze data points collected or recorded at
successive points in time, usually at equal intervals (daily, monthly, yearly, etc.).
The goal is to identify patterns (trends, seasonality, cycles) and make predictions about future values
based on previously observed values.
Property Description
Trend Long-term increase/decrease in the data
Seasonality Regular patterns at specific intervals (e.g., monthly, yearly)
Cyclic Repeating patterns but not fixed like seasonality
Irregular Random or unpredictable variations
Where:
• T: Trend component
• S: Seasonal component
• C: Cyclical component
Metric Description
MAE Mean Absolute Error
RMSE Root Mean Squared Error
MAPE Mean Absolute Percentage Error
AIC/BIC Model selection criteria for ARIMA-type models
Conclusion
Time Series Analysis is a vital tool in data science for understanding and forecasting temporal patterns.
By applying techniques like ARIMA, exponential smoothing, and LSTM, analysts can build models that
capture both short-term fluctuations and long-term trends, making it essential for industries like
finance, meteorology, retail, and healthcare.
15) (a) Illustrate in depth about time series forecasting, its components, moving
averages and its various methods with examples.
Time Series Forecasting is the process of using historical time-stamped data to make predictions about
future values. This is a key technique in data science for analyzing trends, seasonality, and temporal
patterns in data over consistent intervals (e.g., hourly, daily, monthly).
Applications:
• Weather forecasting
Component Description
Trend (T) The long-term progression of the data (increasing, decreasing, or stable).
Component Description
Seasonality (S) Repetitive short-term cycles that follow a fixed calendar-based period.
Definition:
Moving Average (MA) is a technique that smooths out short-term fluctuations by taking the average of
data points over a fixed number of time periods.
4. Time Series Forecasting Methods
A. Statistical Methods
Method Description
Naive Forecasting Assumes future value = last observed value. Useful as a
baseline.
Moving Average Smooths data and forecasts based on average of past n
periods.
Exponential Smoothing (SES) Uses weighted average with exponentially decreasing
weights.
Holt’s Linear Trend Model Captures linear trends.
Holt-Winters Method Captures both trend and seasonality.
ARIMA (Auto-Regressive Integrated Captures autocorrelation, trends, and lags. Best for non-
Moving Average) seasonal stationary data.
SARIMA (Seasonal ARIMA) Extension of ARIMA that handles seasonality.
Example (ARIMA):
ARIMA(p,d,q) where:
• p = lag order
Method Use
Random Forest Regressor Used for multivariate forecasting.
Support Vector Regression Works well with non-linear data.
(SVR)
LSTM (Long Short-Term Deep learning model for sequence prediction. Suitable for complex
Memory) temporal dependencies.
Conclusion
Time Series Forecasting is a crucial part of data science that involves identifying patterns in temporal
data and predicting future outcomes. From classical methods like ARIMA and Holt-Winters to modern
deep learning models like LSTM, each technique serves unique needs based on the data structure and
forecasting goals. Mastery of these concepts enables accurate, actionable predictions in real-world
scenarios.
(b) (i) Compare and contrast between multiple regression and logistic
regression techniques with examples.
Here is a clear and in-depth comparison of Multiple Regression and Logistic Regression, including
definitions, differences, and examples — suitable for your Fundamentals of Data Science and Analysis
exam.
Nature of Real numbers (e.g., 45.3, 99.1) Probabilities (0–1), converted into class labels
Output
Equation
Form
Model Output Predicts actual value Predicts probability & class label
Example Use Predicting house prices based on Predicting whether a student will pass or fail
Case features
Features (X):
• Number of bedrooms
• Age of house
Features (X):
• Age of customer
• Salary
• Previous purchases
Summary Points
• Logistic regression applies a logit function (sigmoid) to map outputs between 0 and 1.
Problem Statement
Temperature t (°C) 10 20 30 40 50 60 70 80 90
We aim to:
1. Find the linear regression equation: y=a + bt
t y t2 y2 t.y
90 5 8100 25 450
• n=9
• ∑t = 450
• ∑y= 1691
• ∑t2= 28500
• ∑ty= 52670
Item Result
(i) What is the estimated number of hours for the shortest— suffering
5 percent? (3)
(ii) What proportion of sufferers estimate that their colds lasted longer than
48 hours? (2)
(v) What proportion suffered for between 1 and 3 days, that is, between
24 and 75 hours? (3)
Given Data:
• Mean μ = 83 hours
Sub-question Answer
(ii) 95.99%
(iii) 13.57%
(v) 34.30%
(vi) 70.21%
(i) If applicants with GPAs of 3.50 or above are automatically admitted, what
proportion of applicants will be in this category? (4)
(iii) A special honors program is open to all applicants with GPAs of 3.75 or
better. What proportion of applicants are eligible? (4)
(vi) If the special honors program is limited to students whose GPAs rank
in the upper 10 percent, what will Brittany’s GPA have to be for admission.
Let’s solve this problem using the standard normal distribution (Z-distribution). We’re given:
Question Answer
(i) Proportion admitted (GPA ≥ 3.50) 15.87%
(ii) Proportion denied (GPA ≤ 2.50) 0.99%
(iii) Proportion eligible for honors (GPA ≥ 3.75) 3.36%
16. (a) An investigator polls common cold sufferers, asking them to estimate the number
of hours of physical discomfort caused by their most recent colds. Assume that
their estimates approximate a normal curve with a mean of 83 hours and a
standard deviation of 20 hours.
(ii) What proportion of sufferers estimate that their colds lasted longer than
48 hours?
(iv) What is the estimated number of hours suffered by the extreme 1 percent
either above or below the mean?
(v) What proportion suffered for between 1 and 3 days, that is, between 24
and 72 hours?
(vi) What proportion suffered for between 2 and 4 days?
Given:
• Mean μ = 83 hours
• Distribution: Normal
Sub-question Answer
(i) Shortest 5% suffering hours 50.1 hours
(ii) > 48 hours 95.99%
(iii) < 61 hours 13.57%
(iv) Extreme 1% range [31.5, 134.5] hours
(v) 1–3 days (24–72 hrs) 28.96%
(vi) 2–4 days (48–96 hrs) 70.21%
p-value≈0.0015
Final Summary:
Question Result