KEMBAR78
Foundations of Data Science Questions | PDF | Confidence Interval | Student's T Test
0% found this document useful (0 votes)
17 views93 pages

Foundations of Data Science Questions

Uploaded by

sri2krishna123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views93 pages

Foundations of Data Science Questions

Uploaded by

sri2krishna123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

AD 3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS

Part A

1. What is brushing and linking in exploratory data analysis?


Brushing is the process of interactively selecting data points in a plot. It allows the analyst to highlight
specific data points and observe how they behave in other related visualizations.

Linking means when data selected (brushed) in one visualization is automatically highlighted in other
visualizations. This helps explore relationships across multiple variables.

Example: Selecting outliers in a scatter plot and seeing where they lie on a histogram.

2. How does confusion matrix define the performance of a classification


algorithm?
A confusion matrix shows the actual vs. predicted classifications. It includes:

• True Positive (TP): Correctly predicted positives

• False Positive (FP): Incorrectly predicted positives

• True Negative (TN): Correctly predicted negatives

• False Negative (FN): Incorrectly predicted negatives

Performance Metrics:

• Accuracy = (TP + TN) / Total

• Precision = TP / (TP + FP)

• Recall = TP / (TP + FN)

• F1-Score = 2 * (Precision * Recall) / (Precision + Recall)

It provides insights into misclassifications and class balance.

3. Identify the correct distribution reflects the below scenario 'Reading


achievement scores for a third-grade class consisting of about equal
numbers of regular students and learning-challenged students””
The correct distribution is bimodal distribution.
Explanation: Since the class has two distinct groups (regular and learning-challenged students), their
scores likely form two peaks or modes, each representing a subgroup, which leads to a bimodal pattern.

4. How regression toward the mean differs from other parameters? Give an
example.
Regression toward the mean is a statistical phenomenon where extreme values tend to move closer to
the average in subsequent measurements.

Example: A student scoring exceptionally high on one test is likely to score closer to the class average on
the next, not due to a change in ability but due to natural variation.

It differs from regression analysis, which models relationships between variables, while this concept
explains statistical tendency of extremes returning to normal.

5. State the Central Limit Theorem.


Central Limit Theorem (CLT):
For any population with finite mean (μ) and variance (σ²), the sampling distribution of the sample mean
tends to become approximately normal as the sample size (n) becomes large (typically n > 30),
regardless of the original distribution.

Significance:

• Allows use of normal distribution for inference.

• Forms the basis for hypothesis testing and confidence intervals.

6. Compare between population and sample.


Feature Population Sample

Definition Entire group under study Subset of the population

Size Usually large or infinite Smaller, manageable portion

Representation Contains all elements Should represent the population well

Usage Ideal but hard to collect Used in practical statistical analysis

Example All college students in India 500 students selected for survey

7. What is the significance of p-value in hypothesis testing?


P-value measures the probability of obtaining a test statistic as extreme as, or more extreme than, the
observed value under the null hypothesis.

• Low p-value (< 0.05): Strong evidence against null hypothesis → reject it.

• High p-value (> 0.05): Weak evidence against null → fail to reject.

Significance: Helps in decision-making about hypotheses.

8. Compare between t-test and ANOVA.


Feature t-test ANOVA

Purpose Compare means of two groups Compare means of three or more groups

Output t-statistic F-statistic

Types One-sample, independent, paired One-way, two-way

Assumption Normal distribution, equal variance Same

Example Drug A vs Drug B Drug A vs Drug B vs Drug C

9. Why do we need Goodness of Fit?


Goodness of Fit tests how well a statistical model (or distribution) fits observed data.

• It checks if sample data follows a specific distribution (e.g., normal, Poisson).

• Common test: Chi-Square Goodness of Fit Test

Use Cases:

• Validating assumptions before using statistical tests.

• Confirming theoretical distributions match real data.

10. What is survival analysis?


Survival analysis involves statistical methods to analyze time-to-event data.

• Event: death, failure, churn, etc.

• Key function: Survival function S(t)S(t): probability of surviving beyond time t.

Applications:
• Medical research: patient survival time.

• Business: time until customer churn.

11. Differentiate between structured and unstructured data.


Feature Structured Data Unstructured Data

Format Rows and columns (tabular) Free-form (text, images, video)

Storage Relational DBs (SQL) NoSQL, Hadoop

Examples Sales data, employee records Tweets, emails, medical scans

Ease of Analysis Easy with traditional methods Needs advanced tools (NLP, CV)

14. What are outliers in the data?


Outliers are data points that significantly differ from the rest of the dataset.

• Can be due to variability or errors.

• Affect mean, standard deviation, and model accuracy.

Detection Methods:

• Box plot

• Z-score

• IQR (Interquartile Range)

Example: A person earning ₹1 crore in a dataset of ₹20,000–₹80,000 salaries.

15. Compare between one-tailed and two-tailed tests.


Feature One-Tailed Test Two-Tailed Test

Hypothesis Direction Tests effect in one direction Tests effect in both directions

P-value split All in one tail Split between both tails

Use Case Checking if A > B Checking if A ≠ B

Sensitivity More sensitive to one side Balanced for any deviation


18. Write a note on F-test.
F-test is used to compare variances between two or more groups.

• Formula: F=Variance of group 1Variance of group 2F = \frac{\text{Variance of group


1}}{\text{Variance of group 2}}

• Used in ANOVA, regression model testing, and comparing model fits.

Applications:

• Determine if groups have equal variances.

• Model selection in regression.

20. Why do we need weighted resampling?


Weighted resampling assigns different sampling probabilities to data points based on importance or
frequency.

Use Cases:

• Handling imbalanced datasets.

• Emphasize rare but important classes.

• Bootstrapping models to improve generalization.

Example: In fraud detection, fraud samples may be rare but must be sampled more during training.

Part B

1. (a)(i) Exemplify in detail about different Facets of data with examples.


Data is at the heart of data science, and it exists in many different forms. Understanding the facets of data
is essential for data scientists to prepare, analyze, and make meaningful interpretations. The key facets of
data include:

1. Structured Data
Definition:
Structured data is highly organized and easily searchable using traditional databases and tools like SQL.
Characteristics:
• Organized in rows and columns
• Follows a fixed schema
• Easily stored in relational databases
Examples:
• Student database (Name, Roll Number, Marks)
• Banking transactions (Date, Account No, Amount)
Tabular Example:
Student ID Name Age Marks
101 Ravi 20 87
102 Priya 21 91

2. Unstructured Data
Definition:
Unstructured data lacks a predefined format or organization, making it harder to analyze directly.
Characteristics:
• No fixed schema
• Requires advanced tools (NLP, CV)
• High in volume and complexity
Examples:
• Emails, social media posts
• Audio/video files
• Medical images

3. Semi-Structured Data
Definition:
Data that is not stored in a relational database but still has some organizational properties.
Characteristics:
• Has tags or markers
• Easier to analyze than unstructured data
• Common in data exchange formats
Examples:
• JSON, XML, HTML documents
Example JSON:
{
"name": "Anil",
"age": 25,
"city": "Mumbai"
}

4. Temporal Data
Definition:
Temporal data is time-stamped and varies with time.
Characteristics:
• Time-series in nature
• Useful for forecasting
Examples:
• Stock market prices
• Temperature recordings
Visualization:
Stock Price vs Time → Line chart

5. Spatial Data
Definition:
Spatial data is related to physical space or geography.
Characteristics:
• Includes coordinates, maps
• Requires GIS tools
Examples:
• GPS data
• Satellite images
• Store locations

6. Quantitative and Qualitative Data


Type Description Example
Quantitative Numerical data Age, salary, temperature
Qualitative Descriptive data (categorical) Gender, color, city name

7. Big Data
Definition:
Big data refers to large, complex data sets that cannot be handled using traditional processing tools.
3Vs of Big Data:
• Volume: Huge amount
• Velocity: Speed of data generation
• Variety: Different types and sources
Example:
• Data generated by Facebook every second

8. Metadata
Definition:
Metadata is "data about data". It provides information about other data.
Example:
• Author name and date for a file
• File size and type

Summary Table
Facet Description Tools Used
Structured Tabular data SQL, Excel
Unstructured Raw formats like text, video NLP, Computer Vision
Semi-structured Tagged data (JSON, XML) XML parsers, NoSQL
Temporal Time-based measurements Time series analysis
Spatial Location-based data GIS, Google Maps
Quantitative Numerical measurements Statistical analysis
Qualitative Categorical attributes Label encoding
Big Data High-volume and fast data Hadoop, Spark
Metadata Info about other data Data catalogs

ii) Sketch and outline the step-by-step activities in the data science process.
The Data Science Life Cycle defines the structured workflow followed by data scientists to solve real-world
problems using data.

Diagram of Data Science Process


[Problem Definition]

[Data Collection]

[Data Cleaning & Preparation]

[Exploratory Data Analysis (EDA)]

[Model Building]

[Model Evaluation]

[Deployment]

[Monitoring & Maintenance]

1. Problem Definition
Goal: Understand the problem clearly.
• Define objectives
• Identify business needs
• Set success criteria
Example: "Predict customer churn in a telecom company."

2. Data Collection
Goal: Gather data from various sources.
• Internal databases
• Public datasets
• APIs, sensors, surveys
Example: Collect customer data, call records, service usage logs.

3. Data Cleaning & Preparation


Goal: Prepare data for analysis.
• Handle missing values
• Remove duplicates
• Convert data types
• Encode categorical variables
Example: Replace missing age values with the mean.

4. Exploratory Data Analysis (EDA)


Goal: Understand patterns, trends, and anomalies.
• Descriptive statistics (mean, median, mode)
• Visualizations (histograms, boxplots, scatterplots)
• Correlation analysis
Example: Plotting customer age vs churn rate.

5. Feature Engineering
Goal: Create meaningful features from raw data.
• Normalize/scale data
• Extract useful variables
• Reduce dimensionality (PCA)
Example: Derive average monthly usage from daily logs.

6. Model Building
Goal: Train machine learning models.
• Choose algorithms (Linear Regression, Decision Tree, etc.)
• Split data into train/test sets
• Use libraries like Scikit-learn
Example: Use Logistic Regression to predict churn.

7. Model Evaluation
Goal: Check model performance.
• Metrics: Accuracy, Precision, Recall, F1-Score
• Use validation techniques (K-Fold, Cross-validation)
Example:
• Accuracy = 92%
• F1-score = 0.88

8. Model Deployment
Goal: Deploy the model into production.
• Use tools: Flask API, Streamlit, Docker
• Make predictions accessible to users
Example: Integrate churn prediction into CRM software.

9. Monitoring & Maintenance


Goal: Ensure model works well over time.
• Retrain with new data
• Monitor drift in performance
• Handle system issues
Example: If churn prediction accuracy drops below 80%, retrain model.

Conclusion
The facets of data help us understand the variety and complexity of real-world information. Each type of
data requires specific tools and handling techniques. Similarly, the data science process is a structured
approach involving multiple iterative stages — from understanding the problem to deploying the model
and monitoring it.
By mastering both the facets of data and the workflow of data science, students and professionals are
better equipped to solve analytical problems in domains like healthcare, finance, marketing, and more.

11. (b) Explain in Detail About Cleansing, Integrating, and Transforming Data
with Examples
Data preprocessing is a crucial step in the data science pipeline. It ensures that the data used for analysis
and modeling is clean, consistent, and usable.

The three core components of data preprocessing are:

1. Data Cleansing (Cleaning)

2. Data Integration

3. Data Transformation
🔹 1. Data Cleansing (Data Cleaning)

✅ Definition:

Data cleansing is the process of detecting and correcting (or removing) inaccurate, incomplete,
inconsistent, or irrelevant parts of the data.

🧽 Why is Data Cleansing Important?

• Ensures data quality

• Increases model performance

• Removes noise and errors

⚠️ Common Issues in Raw Data:

Issue Example

Missing values Employee Age = NULL

Duplicates Same record entered twice

Inconsistencies “Male”, “M”, “male” – same category, different values

Outliers Salary = ₹99,99,999 in dataset of ₹20,000–₹60,000

Spelling mistakes "Kolkatta", "Kolkata"

🛠️ Techniques of Data Cleaning

1. Handling Missing Data:

o Remove rows/columns

o Fill with mean, median, or mode

o Predict missing values using models

df['Age'].fillna(df['Age'].mean(), inplace=True)

2. Removing Duplicates:

df.drop_duplicates(inplace=True)

3. Correcting Inconsistent Data:


Convert all strings to lowercase or use mapping.
df['Gender'] = df['Gender'].str.lower()

4. Handling Outliers:

o Z-score method

o IQR method

o Winsorizing

5. Standardizing Values:
Replace variations of categorical values with a standard label.

📝 Example:

Raw Data:

Name Age Gender City

Ravi 24 Male Chennai

RAVI MALE chennai

Priya 22 Female Bengaluru

Cleaned Data:

Name Age Gender City

Ravi 24 male chennai

Priya 22 female bengaluru

🔹 2. Data Integration

✅ Definition:

Data integration is the process of combining data from multiple sources into a coherent dataset.

🧩 Why Data Integration Is Needed?

• Data may come from different departments or databases.

• Ensures a unified view.

• Essential for data warehouses and business intelligence.


📚 Sources of Data Integration:

• SQL Databases

• CSV Files

• APIs

• Cloud Storage

• IoT Sensors

🛠️ Techniques in Data Integration

1. Schema Integration:
Aligning different structures into one format.

2. Entity Identification:
Matching records that refer to the same entity.

o E.g., "Customer ID" vs "Client No."

3. Data Value Conflicts Resolution:


Handling differences in naming, units, encoding.

📝 Example:

Customer Table (CRM):

Cust_ID Name Email

101 Ravi ravi@mail.com

Transaction Table (Billing):

Client_No Name Amount

101 Ravi K. ₹5000

After Integration:

Customer_ID Name Email Amount

101 Ravi ravi@mail.com ₹5000


💻 Tools for Data Integration:

• ETL tools: Talend, Apache Nifi

• Cloud platforms: AWS Glue, Azure Data Factory

• Manual scripting: Python (pandas.merge)

🔹 3. Data Transformation

✅ Definition:

Data transformation is the process of converting data into a suitable format or structure for analysis or
modeling.

🔁 Why is Transformation Needed?

• Prepares data for machine learning

• Enhances data quality

• Increases model accuracy

🛠️ Techniques in Data Transformation

1. Normalization/Standardization:

o Brings numerical values to a similar scale.

o Example: Min-Max Scaling

from sklearn.preprocessing import MinMaxScaler


scaler = MinMaxScaler()
df_scaled = scaler.fit_transform(df[['Age']])

2. Encoding Categorical Variables:

o Convert text labels into numerical form.

o One-Hot Encoding or Label Encoding

pd.get_dummies(df['Gender'])

3. Binning:

o Convert continuous data into categories.


o E.g., Age into child, adult, senior.

4. Log Transformation:

o Useful for skewed data.

df['Salary'] = np.log(df['Salary'])

5. Feature Construction:

o Creating new features from existing data.

o Example: BMI = weight / (height)^2

6. Date-Time Transformation:

o Extracting year, month, day from timestamp

df['Year'] = df['Date'].dt.year

📝 Example:

Before Transformation:

Age Gender Income

25 Male ₹50,000

30 Female ₹60,000

After Transformation:

Age_norm Gender_male Income_log

0.33 1 10.82

0.67 0 11.00

📊 Diagram: Data Preprocessing Flow

[Raw Data]

[Cleansing]

[Integration]

[Transformation]

[Analysis / Modeling]

✅ Benefits of Cleansing, Integrating, and Transforming Data

Process Benefit

Cleansing Improves data quality

Integration Unified and consistent dataset

Transformation Better performance of algorithms

🧠 Conclusion

The success of any data science project depends heavily on data preprocessing. Cleansing removes
errors, integration combines data from diverse sources, and transformation makes the data suitable for
modeling. Mastering these techniques ensures accuracy, efficiency, and robustness in analytics and
predictive models.

“Garbage in, garbage out” — if raw data is poor, the model output will also be poor. Hence,
preprocessing is the foundation of data science success.

11) c) Explain in detail about the benefits and uses of data science with
counter examples.
📘 Introduction to Data Science

Data Science is the study of extracting knowledge and insights from large volumes of data using a
combination of statistical methods, algorithms, machine learning, and domain expertise. In today’s
digital world, where data is generated at an unprecedented scale, data science plays a vital role in
helping individuals, organizations, and governments make informed, accurate, and strategic decisions.

✅ Benefits of Data Science

1. Improved Decision-Making

Data science enables companies to make decisions based on facts, patterns, and data-driven
predictions, rather than gut instinct.
• Example: A retail chain analyzes purchasing patterns to stock high-demand products before a
festival season.

• Counter Example: A business that makes assumptions without analyzing customer behavior may
stock the wrong inventory, resulting in losses.

2. Accurate Predictions

Data science models can forecast future outcomes with high accuracy when trained on historical data.

• Example: Weather forecasting using historical climate and satellite data helps predict storms.

• Counter Example: Without using data science models, predictions may be based on outdated or
insufficient information, causing inaccurate planning or disaster response.

3. Personalized Customer Experiences

Data science tailors products, content, and advertisements to user behavior, improving satisfaction and
engagement.

• Example: Netflix recommends shows based on a user’s watch history and similar viewer profiles.

• Counter Example: Without personalization, customers may be overwhelmed or dissatisfied by


irrelevant content, reducing user retention.

4. Fraud Detection and Risk Management

Data science helps detect anomalies and frauds by identifying unusual patterns in large datasets.

• Example: Banks use machine learning models to detect credit card fraud in real time.

• Counter Example: Institutions without such detection systems are prone to higher risks and
financial loss due to undetected fraud.

5. Operational Efficiency

Data science improves efficiency by automating processes and optimizing resource allocation.

• Example: Logistics companies like FedEx use data to optimize delivery routes and reduce fuel
costs.

• Counter Example: Without analysis, inefficient routing could cause delays and operational losses.

🎯 Applications and Uses of Data Science


Data science is applied across numerous industries:

🔬 1. Healthcare

• Use: Disease prediction, image diagnostics, drug discovery.

• Example: Predicting patient readmission rates using electronic health records (EHR).

• Counter Example: Without data science, patient monitoring may miss early warning signs, risking
health outcomes.

🛍️ 2. Retail and E-Commerce

• Use: Customer segmentation, inventory prediction, recommendation engines.

• Example: Amazon uses recommendation systems to show similar products.

• Counter Example: Retailers without data analysis may fail to anticipate customer needs or run
inefficient sales campaigns.

🏦 3. Banking and Finance

• Use: Credit scoring, fraud detection, algorithmic trading.

• Example: Loan eligibility prediction based on customer income, credit history, and spending
behavior.

• Counter Example: Relying only on fixed rules for credit approval may exclude creditworthy but
unconventional applicants.

🚗 4. Transportation and Logistics

• Use: Route planning, vehicle maintenance prediction, ride matching (Uber/Ola).

• Example: Predicting taxi demand in urban areas to optimize driver deployment.

• Counter Example: Poor planning may result in long wait times and customer dissatisfaction.

🎓 5. Education

• Use: Personalized learning, performance prediction, dropout prediction.

• Example: Online platforms recommend courses based on previous user activity.

• Counter Example: Without adaptation, learners may receive one-size-fits-all material, reducing
effectiveness.
📈 6. Business Intelligence

• Use: Dashboards, sales forecasting, trend analysis.

• Example: BI tools like Power BI and Tableau display real-time KPIs for executive decisions.

• Counter Example: Without visualization and trend insights, businesses may miss critical
opportunities or risks.

⚖️ Counter Examples – Where Data Science May Fail

Despite its benefits, data science is not foolproof. It can lead to negative consequences if not applied
properly.

Problem Description Example

Biased Data If training data contains bias, the model A recruitment model favoring one
will reflect that bias. gender due to biased historical data.

Overfitting Model fits training data too closely, but A spam detection model that marks
fails on new data. all unfamiliar emails as spam.

Incorrect Misinterpretation of data relationships. Assuming correlation implies


Assumptions causation.

Lack of Context Models can’t understand real-world A model might suggest layoffs to cut
implications without human insight. costs without considering morale.

Privacy Improper handling of sensitive data can The Facebook-Cambridge Analytica


Violations lead to ethical concerns. scandal.

📌 Summary Table: Benefits vs Counter Examples

Benefit Example Counter Example

Better decision- Analyzing customer trends Decisions made on assumptions


making

Accurate predictions Forecasting product demand Guesswork leads to over/understock

Personalization Recommending products on shopping Irrelevant suggestions frustrate users


sites

Fraud detection Banking systems detecting anomalies Missed fraud leads to huge losses

Operational Optimized delivery routes in logistics Manual scheduling leads to


efficiency inefficiencies
Health predictions Predicting high-risk patients Misdiagnosis due to lack of data
insight

Resource Allocating doctors based on patient Wastage of staff or long patient wait
optimization flow data times

🧠 Ethical Considerations

Data scientists must also consider:

• Fairness: Avoid discrimination.

• Transparency: Explainable models.

• Accountability: Know who’s responsible for decisions.

• Privacy: Respect user data.

📌 Conclusion

Data Science has revolutionized the way the world operates, bringing efficiency, accuracy, and
intelligence into decision-making and operations. From healthcare to banking, education to
entertainment, the uses of data science are vast and impactful.

However, it must be used with caution, ethical responsibility, and awareness of its limitations. Blind
reliance without understanding the data, context, or model behavior can lead to bias, errors, and real-
world harm.

"With great data comes great responsibility."


A good data scientist not only analyzes data but also understands the implications of its use.

ii) Describe in depth about Exploratory Data Analysis (EDA) Techniques

📘 Introduction to Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is the first and most critical step in any data science process. It involves
analyzing datasets to summarize their main characteristics, often using visualization techniques.

EDA helps in:

• Understanding data distribution

• Identifying missing values and outliers

• Uncovering patterns, trends, and relationships


• Selecting features for model building

“EDA is the process of letting the data speak for itself.”

🎯 Goals of EDA

1. Gain insight into data structure

2. Detect outliers and anomalies

3. Test assumptions

4. Prepare for model selection

5. Guide feature engineering

🛠️ Key Techniques of EDA

1. Univariate Analysis

Analysis of a single variable to understand its distribution.

📊 For Categorical Variables:

• Bar Chart: Shows frequency of each category

• Pie Chart: Shows proportion of each category

Example:

• Plotting the number of male and female customers

📈 For Numerical Variables:

• Histogram: Shows distribution (e.g., normal, skewed)

• Boxplot: Identifies median, quartiles, and outliers

• Density Plot: Smooth curve to represent distribution

Example:

• Histogram of salaries of employees

2. Bivariate Analysis

Analyzing two variables to find relationships between them.


🔁 Numerical vs Numerical:

• Scatter Plot: Visualizes correlation

• Correlation Coefficient (Pearson’s r):

o +1: Perfect positive correlation

o 0: No correlation

o –1: Perfect negative correlation

Example:

• Scatter plot of age vs income

📊 Categorical vs Numerical:

• Box Plot: Distribution of a numeric variable across categories

Example:

• Boxplot of customer spending across regions

📋 Categorical vs Categorical:

• Stacked Bar Chart / Grouped Bar Chart

• Contingency Table: Cross-tabulation

Example:

• Frequency table of gender vs purchase decision

3. Multivariate Analysis

Analysis involving more than two variables to discover interactions.

🖼️ Techniques:

• Pair Plot (Scatterplot Matrix): All pairwise combinations

• Heatmaps: Correlation matrix for numerical features

• Bubble Plot: Uses size, color, and position

Example:

• Heatmap showing correlation between height, weight, and BMI


🔍 Data Cleaning Techniques in EDA

Before visualizing, EDA includes preprocessing:

1. Handling Missing Data

• Removal: Drop rows/columns with missing values

• Imputation: Fill with mean, median, mode, or predicted values

Example:

• Replace missing age values with average age

2. Outlier Detection

• Boxplot: Visual identification

• Z-Score: Values with |z| > 3 are outliers

• IQR Method: Values outside [Q1 – 1.5×IQR, Q3 + 1.5×IQR]

3. Data Type Conversion

• Convert strings to datetime or numeric format

🖼️ Visualization Tools

Tool Use

Matplotlib Base Python plotting library

Seaborn Statistical plots with advanced themes

Plotly Interactive graphs

Tableau Drag-and-drop business visualizations

Excel Basic charts and pivot tables

📚 Advanced EDA Concepts

1. Brushing and Linking

• Highlighting data in one plot reflects in other plots.

• Used in interactive dashboards.

Example:
• Selecting high-income customers in one graph also highlights their age in another.

2. Dimensionality Reduction

• Techniques like PCA (Principal Component Analysis) reduce features to 2D/3D for visualization.

Example:

• Visualizing high-dimensional genetic data in 2D

📌 EDA Case Study Example

Suppose we have a dataset: Titanic.csv (passenger data)

Step-by-step EDA:

1. Read the data:


df = pd.read_csv("titanic.csv")

2. Check shape:
df.shape

3. Null values:
df.isnull().sum()

4. Univariate Analysis:

o Bar plot of Sex

o Histogram of Age

5. Bivariate Analysis:

o Survival rate by Sex: sns.barplot(x='Sex', y='Survived', data=df)

6. Outliers:

o Boxplot of Fare

7. Multivariate Analysis:

o Heatmap of correlations between Age, Fare, and Pclass

📊 Table: Summary of EDA Techniques

Technique Variable Type Purpose Visualization


Histogram Numerical Distribution Matplotlib/Seaborn

Boxplot Numerical Outlier Detection Seaborn

Scatter Plot Num vs Num Relationship (correlation) Seaborn/Plotly

Bar Chart Categorical Frequency count Matplotlib

Heatmap Numerical features Correlation Matrix Seaborn

Pair Plot Multiple variables Pairwise scatter plots Seaborn

🧠 Importance of EDA in Data Science

Purpose Description

Understand the data Know structure, size, and type of data

Clean the data Detect missing values, wrong types, errors

Find patterns Spot trends, groupings, or clusters

Build assumptions Hypothesize relationships between features

Feature selection Identify which features are most useful for modeling

📌 Conclusion

Exploratory Data Analysis (EDA) is not just the first step — it’s the foundation of effective data science.
By using visualizations and statistical techniques, data scientists gain intuitive and structured
understanding of the data before model building.

EDA:

• Reduces errors

• Improves model quality

• Helps select relevant features

• Saves time during modeling

“If data is the new oil, then EDA is the refining process.”

12 (a) For each of the following pairs of distributions, first decide whether
their standard deviations are about the same or different. If their standard
deviations are different, indicate which distribution should have the larger
standard deviation. Note that the distribution with the more dissimilar set of
scores or individuals should produce the larger standard deviation regardless
of whether, on average, scores or individuals in one distribution differ from
those in the other distribution.
(i) SAT scores for all graduating high school seniors (al) or all college freshmen
(a2)
(ii) Ages of patients in a community hospital (bl) or a children’s hospital (b2)
(iii) Motor skill reaction times of professional baseball players (cl) or college
students (c2)
(iv) GPAs of students at some university as revealed by a random sample (dl)
or a census of the entire student body (d2)
(v) Anxiety scores (on a scale from 0 to 50) of a random sample of college
students taken from the senior class (el) or those who plan to attend an
anxiety-re duction clinic (e2)
(vi) Annual incomes of recent college graduates (fl) or of 20-year alumni (f2)

Rule: The standard deviation (SD) is a measure of how spread out the values in a dataset are.
A larger SD means the data values vary more from the mean.

✅ (i) SAT scores for all graduating high school seniors (a1) or all college freshmen (a2)

• Explanation: SAT scores of college freshmen (a2) are from students who were admitted to college,
possibly a more academically filtered group, while high school seniors (a1) include all levels of
ability.

• Conclusion:

SD is larger for a1 (all high school seniors).

✅ (ii) Ages of patients in a community hospital (b1) or a children’s hospital (b2)

• Explanation: A community hospital (b1) serves people of all ages — children, adults, seniors —
whereas a children’s hospital (b2) focuses on a narrow age group (usually 0–18).

• Conclusion:

SD is larger for b1 (community hospital).

✅ (iii) Motor skill reaction times of professional baseball players (c1) or college students (c2)
• Explanation: Professional baseball players (c1) have consistent, trained reaction times due to
athletic conditioning, whereas college students (c2) show more variability in skill levels.

• Conclusion:

SD is larger for c2 (college students).

✅ (iv) GPAs of students at some university revealed by a random sample (d1) or a census of the entire
student body (d2)

• Explanation: A random sample (d1) may have sampling variation, while a census (d2) captures
everyone, thus being more stable and less varied.

• Conclusion:

SD is larger for d1 (random sample).

✅ (v) Anxiety scores (scale 0–50) of a random sample of senior college students (e1) or those who plan
to attend an anxiety-reduction clinic (e2)

• Explanation: Students planning to attend a clinic (e2) are likely to show higher and more varied
anxiety levels, while the random sample (e1) might show a narrower, more moderate distribution.

• Conclusion:

SD is larger for e2 (clinic group).

✅ (vi) Annual incomes of recent college graduates (f1) or of 20-year alumni (f2)

• Explanation: Recent graduates (f1) typically have similar entry-level salaries. In contrast, 20-year
alumni (f2) are likely to have diverse careers, promotions, and financial outcomes, leading to
greater spread.

• Conclusion:

SD is larger for f2 (20-year alumni).

📌 Summary Table

Pair Larger Standard Deviation Reason

(i) a1 (HS seniors) Broader academic ability range

(ii) b1 (Community hospital) Serves all age groups

(iii) c2 (College students) More variable reaction times

(iv) d1 (Random sample) Random samples show more variability


(v) e2 (Clinic group) Greater variability in anxiety

(vi) f2 (20-year alumni) Wide range of income paths

(b) In a survey, a question was asked


“During your lifetime, how often have you changed your permanent
residence?” a group of 18 college students replied as follows:
1,3, 4, 1, 0, 2, 5, 8, 0, 2, 3, 4, 7, 11, 0, 2, 3, 3.
Find the mode, median, mean and standard deviation.

Question Recap:

In a survey, 18 college students were asked:


“During your lifetime, how often have you changed your permanent residence?”
The responses were:
1, 3, 4, 1, 0, 2, 5, 8, 0, 2, 3, 4, 7, 11, 0, 2, 3, 3

Step 1: Organize the Data

Sorted Data:
0, 0, 0, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 7, 8, 11
Total numbers (n = 18)

Step 2: Mode

Mode = value(s) that occur most frequently

Frequency table:

Value Frequency

0 3

1 2

2 3

3 4

4 2

5 1

7 1
8 1

11 1

Mode = 3 (appears 4 times)

Step 3: Median

Since n = 18 (even), median is the average of the 9th and 10th values.

From sorted list:

• 9th value = 3

• 10th value = 3

Median = (3 + 3)/2 = 3

Step 4: Mean

Step 5: Standard Deviation

xi xi−μ (xi−μ)2

1 -2.83 8.00
3 -0.83 0.69

4 0.17 0.03

1 -2.83 8.00

0 -3.83 14.66

2 -1.83 3.35

5 1.17 1.37

8 4.17 17.39

0 -3.83 14.66

2 -1.83 3.35

3 -0.83 0.69

4 0.17 0.03

7 3.17 10.03

11 7.17 51.40

0 -3.83 14.66

2 -1.83 3.35

3 -0.83 0.69

3 -0.83 0.69

Final Answers Summary

Measure Value

Mode 3

Median 3

Mean 3.83

Standard Deviation 3.09


12) a) The frequency distribution for the length, in seconds, of 100
telephone calls was:
Time (seconds) Frequency
0-20 0
21-40 5
41-60 7
61-80 14
81-100 28
101-120 21
121-140 13
141-160 9
161-180 3

Compute mean, median and variance.

✅ Question:

The frequency distribution for the length (in seconds) of 100 telephone calls is:

Time Interval (sec) Frequency (f)

0–20 0

21–40 5

41–60 7

61–80 14

81–100 28

101–120 21

121–140 13

141–160 9

161–180 3
✅ Step 1: Compute the Class Midpoints

Midpoint for each class = (Lower Limit + Upper Limit) / 2

Class Interval Frequency (f) Midpoint (x)

0–20 0 10

21–40 5 30.5

41–60 7 50.5

61–80 14 70.5

81–100 28 90.5

101–120 21 110.5

121–140 13 130.5

141–160 9 150.5

161–180 3 170.5

✅ Step 2: Compute Mean

f x (midpoint) f·x

0 10 0

5 30.5 152.5

7 50.5 353.5

14 70.5 987.0

28 90.5 2534.0

21 110.5 2320.5

13 130.5 1696.5

9 150.5 1354.5
3 170.5 511.5

Now sum up:

➡️ Mean = 99.1 seconds

✅ Step 3: Compute Median

Use formula:

Where:

• N=100N = 100

• N\2= 50

• Median class: The class where cumulative frequency ≥ 50 → 81–100

• L = 80.5 (lower boundary of class 81–100)

• fm= 28 (frequency of median class)

• F=26F = 26 (cumulative freq before median class = 0 + 5 + 7 + 14 = 26)

• h=20h = 20 (class width)

Now plug in:

Median=80.5+(50−2628)⋅20=80.5+(2428)⋅20=80.5+(0.8571⋅20)=80.5+17.14=97.64 seconds (approx)\text


✅ Step 4: Compute Variance

f x x² f · x²

0 10 100 0

5 30.5 930.25 4651.25

7 50.5 2550.25 17851.75

14 70.5 4970.25 69583.5

28 90.5 8190.25 229327.0

21 110.5 12210.25 256415.25

13 130.5 17030.25 221393.25

9 150.5 22650.25 203852.25

3 170.5 29070.25 87210.75


✅ Final Answers:

Measure Value

Mean 99.1 sec

Median 97.64 sec

Variance 1112.03 sec²

b) (i) The wind speed X in miles per hour and wave height Y in feet were
measured under various conditions on an enclosed deep water sea, with
the results shown in the table.
X 0 2 7 9 13 22
Y 0 5 10 14 22 31
Create a scatter plot and predict the type of correlation.

Given Data

Wind Speed (X) in mph 0 2 7 9 13 22

Wave Height (Y) in ft 0 5 10 14 22 31

Step 1: Plotting the Scatter Plot

Since you're preparing for a written exam, below is a simple hand-drawable version of how to sketch the
scatter plot on graph paper.

X-axis: Wind Speed (mph)

Range: 0 to 25
Y-axis: Wave Height (feet)

Range: 0 to 35

Plot points:

• (0, 0)

• (2, 5)

• (7, 10)

• (9, 14)

• (13, 22)

• (22, 31)

Hand Sketch Suggestion:

• Start with labeled X and Y axes

• Use equal spacing for both axes

• Plot each point carefully with small circles or dots

• Draw a trend line or observe the pattern

Step 2: Determine the Type of Correlation

Definition:
Correlation describes the strength and direction of a linear relationship between two variables.

Observation:

As X (wind speed) increases, Y (wave height) also increases consistently.

This suggests:

• Positive relationship between wind speed and wave height

• The relationship looks close to linear, though the increase may become more rapid for higher
values

Conclusion:

The scatter plot shows a strong positive correlation between wind speed and wave height.
As wind speed increases, wave height increases, indicating that higher wind speeds generate taller waves
in deep water conditions.
Key Exam Writing Points:

Include the following:

1. Labelled scatter plot with all points plotted

2. Mention of increasing trend

3. Definition and conclusion on positive correlation

ii) Assume that an r of -.80 describes the strong negative relationship


between years of heavy smoking (X) and life expectancy (Y). Assume,
furthermore, that the distributions of heavy smoking and life expectancy
each have the following means and sums of squares:

X’=5 Y’ = 60
SS,=35 SSy=70

Determine the least squares regression equation for predicting life


expectancy from years of heavy smoking. (7)
✅ Step 2: Find the Intercept (a)

✅ Step 3: Final Regression Equation


Y = 65.655 - 1.131X

✅ Interpretation:
This regression equation predicts life expectancy (Y) from years of heavy smoking (X).
• The slope is negative, indicating that as years of heavy smoking increase, life expectancy decreases.
• For each additional year of heavy smoking, life expectancy drops by approximately 1.13 years.

📝 Use in Exam:
Make sure to:
• Clearly define what each variable stands for
• Show each step of calculation (slope and intercept)
• Write the final equation in a box or underline it

13 (a) (i) Reading achievement scores are obtained for a group of fourth
graders. A score of 4.0 indicates a level of achievement appropriate for
fourth grade, a score below 4.0 indicates underachievement, and a score
above 4.0 indicates overachievement. Assume that the population
standard deviation equals 0.4. A random sample of 64 fourth graders
reveal a mean achievement score of 3.82. Construct a 95 percent
confidence interval for the unknown population mean. (Remember to
convert the standard deviation to a standard error.)
Interpret this confidence interval; that is, do you find any consistent
evidence either of overachievement or of underachievement?

✅ Question Breakdown:
We are given:
• Population standard deviation σ=0.4\sigma = 0.4
• Sample size n=64n = 64
• Sample mean xˉ=3.82\bar{x} = 3.82
• Confidence level = 95%

🧮 Step-by-Step Solution
✅ Step 2: Determine the Z-value for 95% confidence
For a 95% confidence level, the critical z-value is:
z=1.96z = 1.96

✅ Step 4: Interpretation of the Confidence Interval


• Confidence Interval: [3.722, 3.918]
• This means we are 95% confident that the true population mean reading achievement score lies
between 3.722 and 3.918.

🎯 Interpretation in Plain Terms:


• Since the entire confidence interval lies below 4.0, there is statistical evidence of
underachievement.
• No part of the interval reaches the “expected level” of 4.0.
• Therefore, it is reasonable to conclude that fourth graders in this group are underachieving in
reading compared to the expected level.

✅ Final Answer Summary


• Confidence Interval: [3.722, 3.918]
• Interpretation: There is statistically significant evidence of underachievement among these
fourth graders.
• Graph: Bell curve with shaded region below 4.0, CI marked.
Here is the graph showing the 95% confidence interval for the reading achievement scores:
• The blue curve is the sampling distribution of the sample mean.
• The red shaded area shows the confidence interval from 3.722 to 3.918.
• The dashed black line marks the sample mean (3.82).
• The dotted green line at 4.0 shows the expected standard achievement score.
Since the entire shaded region lies below 4.0, it visually confirms underachievement.

ii)Illustrate in detail about estimation method and confidence interval.

1. Introduction

Estimation in statistics is the process of inferring the value of a population parameter using data from a
sample. Since we rarely have access to entire populations, estimation helps us make educated guesses
about characteristics like mean, proportion, or standard deviation.

There are two major types of estimation methods:

• Point Estimation

• Interval Estimation
2. Estimation Methods

Example:
If the sample mean of height of students is 167.2 cm, it is a point estimate of the population mean.

B. Interval Estimation

• Provides a range (interval) of values believed to contain the population parameter with a certain
level of confidence.

• The most common form of interval estimation is the confidence interval.

3. Confidence Interval

A Confidence Interval (CI) is a range of values, derived from the sample data, that is likely to contain the
true population parameter with a specified probability (confidence level).
Confidence Levels:

Confidence Level Z-value

90% 1.645

95% 1.96

99% 2.576

4. Example of Confidence Interval

Suppose:

• Sample mean (xˉ\bar{x}) = 50

• Population standard deviation (σ\sigma) = 10

• Sample size (n) = 25

• Confidence level = 95% (z = 1.96)

Then:
Interpretation:
We are 95% confident that the true population mean lies between 46.08 and 53.92.

5. Importance of Confidence Interval

• Gives a range of likely values rather than a single estimate.

• Reflects the uncertainty in estimation.

• Widely used in hypothesis testing, market surveys, clinical trials, etc.

6. Graphical Illustration (Visual for Exams)

Sketch a bell curve:

• X-axis: Sample data

• Mark center with sample mean

• Mark both ends of the CI (e.g., xˉ±1.96⋅S)

• Shade the region between CI bounds

• Label the confidence level (e.g., 95%)

7. Conclusion

• Point Estimation gives a single value as an estimate of the parameter.

• Confidence Interval gives a range where the true parameter likely lies, with a confidence level.

• Confidence intervals are more informative as they quantify the uncertainty of estimation.

(b) (i) For the population at large, the Wechsler Adult Intelligence Scale is designed to yield a normal
distribution of test scores with a mean of 100 and a standard deviation of 15. School district
officials wonder whether, on the average, an IQ score different from 100 describes the intellectual
aptitudes of all students in their district. Wechsler IQ scores are obtained for a random sample of
25 of their students, and the mean IQ is found to equal 105. Using the step-by-step procedure,
test the null hypothesis at the .05 level of significance.

ii) Imagine a simple population consisting of only five observations: 2, 4, 6, 8, 10. List all
possible samples of size two. Construct a relative frequency table showing the sampling
distribution of the mean.

Here’s a step-by-step solution to the hypothesis testing problem for Part B, 10 marks in the subject
Fundamentals of Data Science.
🔍 Question Overview:

The question asks whether the mean IQ in the district is different from 100, so this is a two-tailed test.

✅ Step-by-Step Hypothesis Testing Procedure:

Step 2: Choose the Significance Level

• Given: α = 0.05 (two-tailed test)

• Corresponding critical z-values are:

zcritical=±1.96
Step 4: Compare z with Critical Value

• Computed z = 1.67

• Critical values = ±1.96

Since 1.67 < 1.96, the test statistic does not fall in the rejection region.

Step 5: Conclusion

We fail to reject the null hypothesis.

Interpretation:
At the 0.05 level of significance, there is not enough evidence to conclude that the average IQ of students
in the district is significantly different from the general population mean of 100.

✅ Final Answer Summary:

• Test Used: Z-test for population mean

• Z-statistic: 1.67

• Decision: Fail to reject H₀

• Conclusion: No statistically significant difference in IQ levels of district students at the 0.05 level.

13. (a) Imagine that one of the following 95 percent confidence intervals estimates the effect of vitamin
C on IQ scores:
95% Confidence Interval Lower Limit Upper Limit
1 100 102
2 95 99
3 102 106
4 90 111
6 91 98

(i) Which one most strongly supports the conclusion that vitamin C increases IQ scores?(4)
(ii) . Which one implies the largest sample size? (3)
(iii) Which one most strongly supports the conclusion that vitamin C decreases IQ scores?(3)

(iv) Which one would most likely stimulate the investigator to conduct an additional
experiment using larger sample sizes? (3)

13 (a): Interpretation of Confidence Intervals

🔢 Given: 95% Confidence Intervals for the effect of Vitamin C on IQ scores

Interval No. Lower Limit Upper Limit

1 100 102

2 95 99

3 102 106

4 90 111

6 91 98

✅ (i) Which one most strongly supports the conclusion that vitamin C increases IQ scores?
(4 marks)

• Answer: Interval 3: [102, 106]

• Reason: This interval lies entirely above 100, the average IQ score in the population. It strongly
suggests that IQ scores increased due to vitamin C.

✅ (ii) Which one implies the largest sample size? (3 marks)


• Answer: Interval 1: [100, 102]

• Reason: The narrowest confidence interval typically comes from a larger sample size, which
reduces the margin of error. Interval 1 is the narrowest (just 2 units wide), implying a high level of
precision.

✅ (iii) Which one most strongly supports the conclusion that vitamin C decreases IQ scores?
(3 marks)

• Answer: Interval 6: [91, 98]

• Reason: This interval lies entirely below 100, indicating a consistent drop in IQ scores, and thus
supports the idea that vitamin C may reduce IQ.

✅ (iv) Which one would most likely stimulate the investigator to conduct another
experiment with larger sample sizes? (3 marks)

• Answer: Interval 4: [90, 111]

• Reason: This interval is very wide, suggesting high variability and uncertainty in the results. A wide
confidence interval usually indicates the sample size was small, and more data is needed to
improve precision.

(b) (i) Exemplify in detail about the significance of attest, its procedure and
decision rule with example. (6)

✅ What is a t-test?

A t-test is a statistical method used to determine whether there is a significant difference between the
mean of a sample and a known population mean, or between two sample means, when the population
standard deviation is unknown.

✅ Significance of t-test

• Used when sample size is small (n < 30)

• Applicable when population standard deviation is unknown

• Useful in hypothesis testing to check the statistical significance of differences in data


• Common in medical, social sciences, and business analytics research

✅ Types of t-tests

Type Purpose

One-sample t-test Compares a sample mean to a known population mean

Two-sample t-test Compares means of two independent samples

Paired t-test Compares means of the same group at different times

✅ Procedure for t-test

Step 1: State the Hypotheses

• Null Hypothesis (H₀): No significant difference exists (e.g., μ=μ0)

• Alternative Hypothesis (H₁): Significant difference exists (e.g., μ≠μ0)

Step 2: Select Significance Level

• Common value: α=0.05\alpha = 0.05

Step 3: Calculate Test Statistic


Step 4: Determine Degrees of Freedom

df = n - 1

✅ Example:

• Sample mean (xˉ) = 72

• Population mean (μ) = 70

✅ Conclusion

There is no statistically significant difference between the sample mean and population mean.
(ii) A study finds that racism in cricket event more often takes place when the
game is played in England or Australia or New Zealand (Say EAN countries).
Given that

• Racism takes place or Game is played in EAN is 9/13


• Racism takes place and Game is played in BAN is 5/7
• Game is played in EAN given that Racism takes place is 4/5
Find the probability of

• No Racism takes place

• Game is played in EAN

• Racism takes place given that Game is played in EAN (7)

✅ Given:

Let:

• R = Racism takes place

• EAN = Game is played in England, Australia, or New Zealand

• BAN = Game is played in Bangladesh, Australia, or New Zealand (but here it's a typo — we assume
it still means “EAN”)

We are to find:
• (a) P(No Racism takes place)

• (b) P(Game is played in EAN)

• (c) P(Racism | EAN)

✅ (a) P(No Racism takes place)

We are not directly given P(R)P(R), but we can use conditional probability:

✅ (b) P(Game is played in EAN)


✅ (c) P(R | EAN)

Using conditional probability:


14) a) (i) A library system lends books for periods of 21 days. This policy is
being reevaluated in view of a possible new loan period that could be
either longer or shorter than 21 days. To aid in making this decision,
book-lending records were consulted to determine the loan periods
actually used by the patrons. A random sample of eight records revealed
the following loan periods in days: 21, 15, 12, 24, 20, 21, 13, and 16. Test
the null hypothesis with t-test, using the .05 level of significance.

To solve this t-test problem for the library loan periods, we will perform a one-sample t-test to determine
whether the average loan period differs significantly from the current loan period policy of 21 days.

✅ Problem Summary

• Sample data (n = 8): 21, 15, 12, 24, 20, 21, 13, 16
• Hypothesized population mean (μ0) = 21 days

• Significance level (α) = 0.05

This is a two-tailed test.

✅ Step 1: Calculate Sample Mean ( xˉ )

xˉ=(21+15+12+24+20+21+13+16)\8 =142\8 = 17.75

✅ Step 2: Calculate Sample Standard Deviation (s)


✅ Step 5: Make the Decision

• Calculated t = -2.12

• Critical t = ±2.365
Since ∣−2.12∣<2.365, we fail to reject the null hypothesis.

✅ Conclusion

At the 5% level of significance, there is not enough evidence to conclude that the average loan period
differs significantly from 21 days. Therefore, the library may keep the current policy unless further data
suggests otherwise.

ii) A consumers’ group randomly samples 10 “one-pound” packages of ground


wheat sold by a supermarket. Calculate the mean and the estimated standard
error of the mean for this sample, given the following weights in ounces:
16,15, 14, 15, 14, 15, 16, 14, 14, 14 (6)

✅ Given:
A sample of 10 wheat package weights (in ounces):
16, 15, 14, 15, 14, 15, 16, 14, 14, 14
Value xi xi−xˉ (xi−xˉ)2
16 1.3 1.69
15 0.3 0.09
14 -0.7 0.49
15 0.3 0.09
14 -0.7 0.49
15 0.3 0.09
16 1.3 1.69
14 -0.7 0.49
14 -0.7 0.49
14 -0.7 0.49
Sum of squares:

✅ Final Answers:
• Sample Mean: 14.7 ounces
• Estimated Standard Error of the Mean: 0.26 ounces
b)(i) Illustrate in detail about one factor ANOVA with example. (7)

(i) One-Factor ANOVA – Detailed Illustration

Definition:

One-Factor ANOVA is a statistical technique used to test if three or more group means are significantly
different from each other. It analyzes the impact of a single independent variable (factor) on a
dependent variable.

Why use One-Factor ANOVA?

When you have:

• One categorical independent variable (e.g., treatment type, brand, location)

• One continuous dependent variable (e.g., weight, test score, yield)

• More than two groups to compare

Example Scenario:

Suppose a researcher wants to test whether three different fertilizers (A, B, and C) lead to different crop
yields.

Fertilizer A Fertilizer B Fertilizer C

20 25 30

21 27 29

19 26 31

Step-by-Step ANOVA Calculation:


Let’s denote:

• k= number of groups = 3

• n = total observations = 9

• ni = observations per group = 3

Step 3: Compute Sum of Squares

Step 4: Degrees of Freedom

• dfB=k−1=2

• dfW=n−k=6

Step 6: F-ratio
Interpretation:

If the F-test is significant, it means at least one fertilizer type results in a significantly different crop
yield.

Summary Table of ANOVA:

Source Sum of Squares df Mean Square F

Between (A) SSB 2 MSB F-ratio

Within (Error) SSW 6 MSW

Total SST 8

Conclusion:

One-Factor ANOVA helps test mean differences among more than two groups using variance analysis.
It’s a fundamental concept in statistical analysis and a cornerstone of inferential data science.

ii) A random sample of 90 College students indicates whether they most desire
love, wealth power, health, fame, or family happiness. Using the .05 level of
significance and the following results, test the null hypothesis that in the
underlying Population, the various desires are equally popular using chi-
square test. (6)

DESKS OF College Students


FREQUENCY LOVE WEMTN MWSI HEALTH MME FANNY HAP. TOTAL
Observed (A) 25 10 5 25 10 15 90
✅ Problem Breakdown
• Objective: Test whether all six desires are equally popular in the population.
• Significance level: α = 0.05
• Total sample size: 90 students
• Observed frequencies (O):
o Love: 25
o Wealth: 10
o Power: 5
o Health: 25
o Fame: 10
o Family Happiness: 15

🔍 Step-by-Step Chi-Square Goodness-of-Fit Test


🔹 Step 1: Null and Alternate Hypothesis
• H₀ (Null Hypothesis): All categories are equally popular (uniform distribution).
• H₁ (Alternate Hypothesis): At least one category is not equally popular.

🔹 Step 2: Expected Frequencies (E)


Since all 6 desires are equally likely under H₀:
E=90\6=15 for each category

🔹 Step 3: Chi-Square Test Statistic Formula

Category O E (O−E)2/E
(Observed) (Expected)

Love 25 15 (10² / 15) =


6.67

Wealth 10 15 (−5² / 15) =


1.67

Power 5 15 (−10² / 15)


= 6.67

Health 25 15 6.67

Fame 10 15 1.67
Family 15 15 0.00
Happiness

χ2 = 6.67 + 1.67 + 6.67 + 6.67 + 1.67 + 0 = 23.35

🔹 Step 4: Degrees of Freedom


df = k - 1 = 6 - 1 = 5

🔹 Step 5: Critical Value at α = 0.05, df = 5

We reject the null hypothesis.

✅ Conclusion:
At the 0.05 level of significance, there is sufficient evidence to conclude that not all desires
are equally popular among college students.

14.(a) A manufacturer of a gas additive claims that it improves gas mileage. A


random sample of 30 drivers tests this claim by determining their gas
mileage for a full tank of gas that contains the additive (Xx) and for a full tank
of gas that does not contain the additive (A'2). The sample mean difference,
D’, equals 2.12 miles (in favor of the additive), and the estimated standard
error equals 1.50 miles.

(i) Using t, test the null hypothesis at the .05 level of significance. (5)

(ii) Specify the p - value for this result. (4)

(iii) Are there any special precautions that should be taken with the present
experimental design?
🔹 Given:

• Sample size, n=30

• Sample mean difference, Dˉ = 2.12 miles

• Standard error of the mean difference,SE = 1.50

• Significance level, α= 0.05

• Claim: Additive improves mileage → this is a one-tailed test

✅ (i) Hypothesis Test Using t-distribution

Step 1: Set Hypotheses

• Null Hypothesis (H₀): The additive has no effect: μD = 0

• Alternative Hypothesis (H₁): The additive improves mileage: μD>0

Step 4: Critical t-value

From the t-distribution table at:

• df=29

• α= 0.05

• One-tailed test

tcritical=1.699
Step 5: Decision Rule

Since:

tcalculated=1.41<tcritical=1.699

❌ Do not reject H₀ — there is insufficient evidence at the 0.05 level to conclude that the
additive improves mileage.

✅ (ii) p-value for the result

• For t=1.41 and df = 29, using t-distribution tables or calculator:

p-value≈0.085p

🔹 Since p=0.085>α=0.05p , the result is not statistically significant.

✅ (iii) Experimental Design Considerations

Yes — there are important precautions for this type of design:

1. Paired Sample Validity:


Ensure each driver tests with and without the additive under similar conditions (same vehicle,
same route, similar load, etc.).
2. Order Effect:
The order in which the tests are taken may influence results (e.g., one test after a tune-up).
Randomize the order or counterbalance it.
3. External Factors:
Factors like weather, tire pressure, and traffic conditions can bias mileage. Control or record these
confounders.
4. Sample Size:
While n=30n = 30 is decent, a larger sample may provide better statistical power to detect
smaller effects.
5. Blinding:
Ideally, drivers should be blind to the presence of the additive to prevent bias in driving behavior.

✅ Final Summary:

• (i) t = 1.41, not statistically significant at α = 0.05

• (ii) p-value ≈ 0.085


• (iii) Experimental design should control for pairing validity, external variables, and order effects

b)(i) A library system lends books for periods of 21 days. This policy is
being reevaluated in view of a possible new loan period that could be
either longer or shorter than 21 days. To aid in making this decision,
book-lending records were consulted to determine the loan periods
actually used by the patrons. A random sample of eight records
revealed the following loan periods in days: 21, 15, 12, 24, 20, 21, 13,
and 16. Test the null hypothesis with t-test, using the .05 level of
significance.

🔹 Step 1: State Hypotheses


• Null Hypothesis (H₀): μ = 21
• Alternative Hypothesis (H₁): μ≠ 21

🔹 Step 2: Calculate Sample Mean and Standard Deviation


Sample Mean:
xˉ=21+15+12+24+20+21+13+168=1428=17.75
Sample Standard Deviation:
Let's calculate:
xi xi−17.75x (xi−xˉ)2

21 3.25 10.56
15 -2.75 7.56
12 -5.75 33.06
24 6.25 39.06
20 2.25 5.06
21 3.25 10.56
13 -4.75 22.56
16 -1.75 3.06

🔹 Step 4: Degrees of Freedom and Critical Value

✅ Conclusion:
At the 0.05 level of significance, there is insufficient evidence to conclude that the average loan period is
different from 21 days.
ii) A random sample of 90 college students indicates whether they most desire
love, wealth, power, health, fame, or family happiness. Using the .05 level of
significance and the following results, test the null hypothesis that, in the
underlying population, the various desires are equally popular using chi-square
test.
Desires of college students
Frequency Love Wealth Power Health Fame Family Hap. Total
Observed (/0) 25 10 5 25 10 15 90

✅ Problem Details:
• Total sample size (n) = 90
• Desire categories: Love, Wealth, Power, Health, Fame, Family Happiness
• Number of categories (k) = 6
• Significance level (α) = 0.05
Observed Frequencies (O):
Desire Frequency (O)
Love 25
Wealth 10
Power 5
Health 25
Fame 10
Family Happiness 15
Total 90

✅ Step-by-Step: Chi-Square Goodness-of-Fit Test


🔹 Step 1: State the Hypotheses
• Null Hypothesis (H₀): All desires are equally popular → uniform distribution.
• Alternative Hypothesis (H₁): At least one desire has a different frequency.

🔹 Step 2: Compute Expected Frequencies (E)


Under H₀ (equal popularity), each desire should have:
E=90/6=15
So, expected frequency for each = 15

🔹 Step 3: Calculate Chi-Square Test Statistic


Desire O E (O−E)2/E
Love 25 15 (10)2/15=6.67(10)^2 / 15 = 6.67
Wealth 10 15 (−5)2/15=1.67(-5)^2 / 15 = 1.67
Power 5 15 (−10)2/15=6.67(-10)^2 / 15 = 6.67
Health 25 15 (10)2/15=6.67(10)^2 / 15 = 6.67
Fame 10 15 (−5)2/15=1.67(-5)^2 / 15 = 1.67
Family Happiness 15 15 (0)2/15=0.00(0)^2 / 15 = 0.00
Chi-square statistic:
χ2=6.67+1.67+6.67+6.67+1.67+0=23.35

🔹 Step 4: Degrees of Freedom


df=k−1=6−1=5

🔹 Step 5: Find the Critical Value


From the chi-square distribution table at:
• df = 5
• α= 0.05
χcritical2 = 11.07

✅ Reject the null hypothesis.

✅ Final Conclusion:
At the 0.05 level of significance, there is sufficient evidence to conclude that not all desires are equally
popular among college students.

15) a) i) Describe in detail about logistic regression model in predictive analysis.


1. Introduction

Logistic Regression is a classification algorithm used when the dependent variable is categorical,
typically binary (0 or 1, yes or no, success or failure).
Unlike linear regression, logistic regression predicts the probability of class membership rather than a
continuous outcome. It is widely used in predictive analysis to model outcomes such as disease
presence, customer churn, fraud detection, etc.

2. Purpose in Predictive Analysis

• To estimate the probability that a given input belongs to a particular category.

• To classify observations based on one or more independent (predictor) variables.

• To identify influential factors and quantify their effect on the binary outcome.

3. Mathematical Model

The logistic regression model uses the logit (log-odds) function:

4. Key Concepts
5. Assumptions of Logistic Regression

1. Binary output (for basic logistic regression)

2. Independent observations

3. No multicollinearity among independent variables

4. Linearity of logit (not the raw outcome) with predictors

6. Example

Use Case: Predict whether a student will pass (1) or fail (0) an exam based on study hours.

Student Study Hours (X) Result (Y)


A 2 0
B 4 1
C 6 1
Model:

7. Evaluation Metrics

• Accuracy: Proportion of correct predictions

• Precision & Recall: Especially for imbalanced data

• ROC Curve & AUC: Overall model performance

• Confusion Matrix: True Positive/Negative, False Positive/Negative

8. Applications in Predictive Analysis

• Healthcare: Predict disease presence

• Finance: Predict loan default or credit card fraud


• Marketing: Predict if a customer will buy or churn

• E-commerce: Click-through rate prediction

Conclusion

Logistic Regression is a foundational classification model in predictive analysis, offering interpretable,


probabilistic predictions for binary outcomes. Its simplicity, mathematical elegance, and wide
applicability make it a first choice for many real-world problems in data science.

ii) Exemplify in detail about multiple regression model with example.

1. Introduction

Multiple Linear Regression (MLR) is a supervised learning algorithm used to predict the value of a
continuous dependent variable based on two or more independent (predictor) variables.

It extends simple linear regression, which uses one predictor, by considering multiple factors that jointly
affect the outcome.

2. Purpose in Predictive Analysis

• To model relationships between multiple features and a continuous target.

• To forecast outcomes using historical data.

• To understand the influence of each variable on the target.

3. Mathematical Model
4. Assumptions of MLR

1. Linearity: The relationship between dependent and independent variables is linear.

2. Independence: Observations are independent.

3. Homoscedasticity: Constant variance of residuals.

4. Normality: Residuals are normally distributed.

5. No multicollinearity: Independent variables are not highly correlated.

5. Example Scenario

Problem:

A university wants to predict a student’s final score (Y) based on the following factors:

• X1: Hours studied

• X2: Attendance percentage

Sample Data:

Student Hours Studied (X₁) Attendance (%) (X₂) Final Score (Y)
A 10 90 85
B 8 80 78
C 12 95 88
D 6 70 72
Fitted Model (using regression):

Y= 5+ 3.2X1 + 0.4X2

Prediction Example:

For a student who studied 9 hours and had 85% attendance:

Y=5+3.2(9)+0.4(85)=5+28.8+34=67.8

Predicted Final Score = 67.8

6. Model Evaluation Metrics

• R² (Coefficient of Determination): Proportion of variance in YY explained by predictors.

• Adjusted R²: Adjusts R² for number of predictors; better for comparing models.

• RMSE (Root Mean Squared Error): Average prediction error.

• F-test: Tests if at least one predictor variable is statistically significant.


7. Applications of MLR in Predictive Analysis

Domain Use Case


Economics Predict GDP based on inflation, employment
Marketing Predict sales based on price, advertising
Health Predict blood pressure based on BMI, age
Education Predict student performance from study habits

Conclusion

Multiple Linear Regression is a fundamental predictive modeling tool that captures the combined effect
of multiple independent variables on a single continuous outcome. It's widely used in business, science,
healthcare, and social sciences to make informed decisions based on data.

b) Explain in depth about Time series analysis and its techniques with
relevant examples.

1. What is Time Series Analysis?

Time Series Analysis is a statistical technique used to analyze data points collected or recorded at
successive points in time, usually at equal intervals (daily, monthly, yearly, etc.).

The goal is to identify patterns (trends, seasonality, cycles) and make predictions about future values
based on previously observed values.

2. Characteristics of Time Series Data

Property Description
Trend Long-term increase/decrease in the data
Seasonality Regular patterns at specific intervals (e.g., monthly, yearly)
Cyclic Repeating patterns but not fixed like seasonality
Irregular Random or unpredictable variations

3. Components of a Time Series

Time Series (Y)=T+S+C+I

Where:

• T: Trend component

• S: Seasonal component

• C: Cyclical component

• I: Irregular (random noise)


4. Importance in Predictive Analysis

• Forecasting future sales, temperature, stock prices, etc.

• Detecting anomalies or sudden shifts (e.g., fraud detection).

• Supporting business decisions and resource planning.

5. Time Series Analysis Techniques

Technique Description & Use


Moving Average (MA) Smoothens data to identify trend by averaging over fixed window.
Exponential Smoothing Gives more weight to recent observations; useful for short-term forecasting.
ARIMA Combines AutoRegression (AR), Integration (I), and Moving Average (MA) to
model time series data.
Seasonal ARIMA Extension of ARIMA that handles seasonality explicitly.
(SARIMA)
Decomposition Separates time series into trend, seasonal, and residual components.
Prophet (Facebook) Open-source forecasting tool that handles missing data and trend changes
automatically.
LSTM (Deep Learning) Recurrent Neural Network model that learns long-term dependencies in
sequential data.

6. Examples of Time Series Applications

Example 1: Sales Forecasting

• Data: Monthly sales of a retail store over 3 years.

• Goal: Predict sales for next 6 months.

• Technique: SARIMA or Exponential Smoothing

Example 2: Weather Prediction

• Data: Daily temperature data

• Goal: Forecast temperature for the next week

• Technique: ARIMA or LSTM (if data is nonlinear)

Example 3: Stock Market

• Data: Hourly stock prices

• Goal: Predict future stock prices

• Technique: LSTM (for nonlinear, high-frequency data)


7. Steps in Time Series Analysis

1. Collect Data: Ensure regular time intervals.

2. Visualize: Plot to detect patterns or anomalies.

3. Decompose: Separate components (trend, seasonality).

4. Stationarize: Remove trend/seasonality for modeling.

5. Model: Use ARIMA, MA, LSTM, etc.

6. Evaluate: RMSE, MAPE, AIC, BIC, etc.

7. Forecast: Predict future values and validate.

8. Evaluation Metrics for Time Series Models

Metric Description
MAE Mean Absolute Error
RMSE Root Mean Squared Error
MAPE Mean Absolute Percentage Error
AIC/BIC Model selection criteria for ARIMA-type models

Conclusion

Time Series Analysis is a vital tool in data science for understanding and forecasting temporal patterns.
By applying techniques like ARIMA, exponential smoothing, and LSTM, analysts can build models that
capture both short-term fluctuations and long-term trends, making it essential for industries like
finance, meteorology, retail, and healthcare.
15) (a) Illustrate in depth about time series forecasting, its components, moving
averages and its various methods with examples.

1. What is Time Series Forecasting?

Time Series Forecasting is the process of using historical time-stamped data to make predictions about
future values. This is a key technique in data science for analyzing trends, seasonality, and temporal
patterns in data over consistent intervals (e.g., hourly, daily, monthly).

Applications:

• Stock price prediction

• Weather forecasting

• Sales and demand forecasting

• Traffic volume prediction

• Economic indicators (GDP, inflation)

2. Components of Time Series

A time series can be mathematically decomposed into four key components:

Component Description

Trend (T) The long-term progression of the data (increasing, decreasing, or stable).
Component Description

Seasonality (S) Repetitive short-term cycles that follow a fixed calendar-based period.

Cyclic (C) Long-term oscillations due to economic, environmental, or political cycles.

Irregular (I) Residual/random noise that cannot be explained by trend or seasonality.

3. Moving Averages in Time Series

Definition:

Moving Average (MA) is a technique that smooths out short-term fluctuations by taking the average of
data points over a fixed number of time periods.
4. Time Series Forecasting Methods

A. Statistical Methods

Method Description
Naive Forecasting Assumes future value = last observed value. Useful as a
baseline.
Moving Average Smooths data and forecasts based on average of past n
periods.
Exponential Smoothing (SES) Uses weighted average with exponentially decreasing
weights.
Holt’s Linear Trend Model Captures linear trends.
Holt-Winters Method Captures both trend and seasonality.
ARIMA (Auto-Regressive Integrated Captures autocorrelation, trends, and lags. Best for non-
Moving Average) seasonal stationary data.
SARIMA (Seasonal ARIMA) Extension of ARIMA that handles seasonality.

Example (ARIMA):

ARIMA(p,d,q) where:

• p = lag order

• d = differencing required to make data stationary

• q = size of moving average window

B. Machine Learning Methods

Method Use
Random Forest Regressor Used for multivariate forecasting.
Support Vector Regression Works well with non-linear data.
(SVR)
LSTM (Long Short-Term Deep learning model for sequence prediction. Suitable for complex
Memory) temporal dependencies.

5. Steps in Time Series Forecasting

1. Data Collection: Time-indexed data at regular intervals.

2. Visualization: Line plots to identify patterns.


3. Stationarity Check: Using ADF test; if needed, apply differencing.

4. Decomposition: Separate trend, seasonality, and noise.

5. Model Building: Use ARIMA, Holt-Winters, Prophet, or LSTM.

6. Model Evaluation: Using MAE, RMSE, MAPE.

7. Forecast: Predict future values and validate with actual outcomes.

Conclusion

Time Series Forecasting is a crucial part of data science that involves identifying patterns in temporal
data and predicting future outcomes. From classical methods like ARIMA and Holt-Winters to modern
deep learning models like LSTM, each technique serves unique needs based on the data structure and
forecasting goals. Mastery of these concepts enables accurate, actionable predictions in real-world
scenarios.

(b) (i) Compare and contrast between multiple regression and logistic
regression techniques with examples.
Here is a clear and in-depth comparison of Multiple Regression and Logistic Regression, including
definitions, differences, and examples — suitable for your Fundamentals of Data Science and Analysis
exam.

(b)(i) Multiple Regression vs Logistic Regression – Comparison Table

Feature Multiple Regression Logistic Regression

Purpose Predicts a continuous numeric Predicts a categorical outcome (typically binary


value or multiclass)

Dependent Quantitative (e.g., income, Categorical (e.g., yes/no, 0/1, pass/fail)


Variable (Y) weight, score)

Nature of Real numbers (e.g., 45.3, 99.1) Probabilities (0–1), converted into class labels
Output

Equation
Form

Technique Least Squares Minimization Maximum Likelihood Estimation

Evaluation R², RMSE, MAE Accuracy, Precision, Recall, AUC-ROC


Metrics
Assumptions Linearity, homoscedasticity, Independence, logit linearity, large sample size
normality of residuals

Model Output Predicts actual value Predicts probability & class label

Example Use Predicting house prices based on Predicting whether a student will pass or fail
Case features

Explanation with Examples

1. Multiple Linear Regression – Example

Problem: Predict the price of a house based on features.

Features (X):

• Size of the house (sqft)

• Number of bedrooms

• Age of house

2. Logistic Regression – Example

Problem: Predict if a customer will buy a product (Yes/No)

Features (X):

• Age of customer

• Salary

• Previous purchases
Summary Points

• Multiple Regression is used when target is continuous.

• Logistic Regression is used when target is categorical (binary/multiclass).

• Logistic regression applies a logit function (sigmoid) to map outputs between 0 and 1.

• Both models can be extended:

o Multiple Logistic Regression for binary/multiclass

o Polynomial Regression for non-linear trends

ii) A company manufactures an electronic device to be used in a very wide


temperature range. The company knows that increased temperature shortens
the life time of the device, and a study is therefore performed in which the life
time is determined as a function of temperature. The following data is found:
Temperature in Celcius (t) 10 20 30 40 50 60 70 80 90
Life time in hours(y) 420 365 285 220 176 117 69 34 5
Find the linear regression equation. Also find the estimated life time when temperature
is 55.

Problem Statement

We’re given the following dataset:

Temperature t (°C) 10 20 30 40 50 60 70 80 90

Life Time y (hrs) 420 365 285 220 176 117 69 34 5

We aim to:
1. Find the linear regression equation: y=a + bt

2. Use this equation to estimate life time at temperature = 55°C

Step 1: Compute Required Sums

Let’s calculate the necessary summations:

t y t2 y2 t.y

10 420 100 176400 4200

20 365 400 133225 7300

30 285 900 81225 8550

40 220 1600 48400 8800

50 176 2500 30976 8800

60 117 3600 13689 7020

70 69 4900 4761 4830

80 34 6400 1156 2720

90 5 8100 25 450

Σ 1691 28500 498857 52670

• n=9

• ∑t = 450

• ∑y= 1691

• ∑t2= 28500

• ∑ty= 52670

Step 2: Calculate Regression Coefficients


Final Answer Summary:

Item Result

Regression Equation y=453.54−5.313t

Estimated life at 55°C 161.33 hours


Part C
16.(a) An investigator polls common cold sufferers, asking them to estimate the
number of hours of physical discomfort caused by their most recent colds. Assume
that their estimates approximate a normal curve with a mean of 83 hours and a
standard deviation of 20 hours.

(i) What is the estimated number of hours for the shortest— suffering
5 percent? (3)

(ii) What proportion of sufferers estimate that their colds lasted longer than
48 hours? (2)

(iii) What proportion suffered for fewer than 61 hours? (2)

(iv) What is the estimated number of hours suffered by the extreme


1 percent either above or below the mean? (2)

(v) What proportion suffered for between 1 and 3 days, that is, between
24 and 75 hours? (3)

(vi) What proportion suffered for between 2 and 4 days? (3)


Here’s a detailed solution to Q16(a) using standard normal distribution (Z-distribution) concepts.

Given Data:

• Mean μ = 83 hours

• Standard Deviation σ= 20 hours

• Assumption: Normally distributed


(v) Proportion suffering between 1 and 3 days (24 to 75 hours)
(vi) Proportion suffering between 2 and 4 days (48 to 96 hours)

Final Boxed Answers Summary:

Sub-question Answer

(i) 50.1 hours

(ii) 95.99%

(iii) 13.57%

(iv) [31.5, 134.5] hours

(v) 34.30%

(vi) 70.21%

b) Admission to a state university depends partially on the applicant’s high school


GPA. Assume that the applicants’ GPAs approximate a normal curve with a
mean of 3.20 and a standard deviation of 0.30.

(i) If applicants with GPAs of 3.50 or above are automatically admitted, what
proportion of applicants will be in this category? (4)

(ii) If applicants with GPAs of 2.50 or below are automatically denied


admission, what proportion of applicants will be in this category? (3)

(iii) A special honors program is open to all applicants with GPAs of 3.75 or
better. What proportion of applicants are eligible? (4)

(vi) If the special honors program is limited to students whose GPAs rank
in the upper 10 percent, what will Brittany’s GPA have to be for admission.
Let’s solve this problem using the standard normal distribution (Z-distribution). We’re given:

• Mean GPA μ= 3.20

• Standard deviation σ= 0.30

• GPA follows a normal distribution

(iv) GPA cutoff for top 10% (upper 10%)


Final Summary

Question Answer
(i) Proportion admitted (GPA ≥ 3.50) 15.87%
(ii) Proportion denied (GPA ≤ 2.50) 0.99%
(iii) Proportion eligible for honors (GPA ≥ 3.75) 3.36%

(iv) GPA required for top 10% honors 3.58

16. (a) An investigator polls common cold sufferers, asking them to estimate the number
of hours of physical discomfort caused by their most recent colds. Assume that
their estimates approximate a normal curve with a mean of 83 hours and a
standard deviation of 20 hours.

i) What is the estimated number of hours for the shortest-suffering 5 percent?

(ii) What proportion of sufferers estimate that their colds lasted longer than
48 hours?

(iii) What proportion suffered for fewer than 61 hours?

(iv) What is the estimated number of hours suffered by the extreme 1 percent
either above or below the mean?
(v) What proportion suffered for between 1 and 3 days, that is, between 24
and 72 hours?
(vi) What proportion suffered for between 2 and 4 days?
Given:

• Mean μ = 83 hours

• Standard deviation σ= = 20 hours

• Distribution: Normal

(iv) Hours for extreme 1% (0.5% on each tail)


Final Answer Summary

Sub-question Answer
(i) Shortest 5% suffering hours 50.1 hours
(ii) > 48 hours 95.99%
(iii) < 61 hours 13.57%
(iv) Extreme 1% range [31.5, 134.5] hours
(v) 1–3 days (24–72 hrs) 28.96%
(vi) 2–4 days (48–96 hrs) 70.21%

(b) An investigator wishes to determine whether alcohol consumption causes a


deterioration in the performance of automobile drivers. Before the driving test,
subjects drink a glass of orange juice, which, in the case of the treatment group,
is laced with two ounces of vodka. Performance is measured by the number of
errors made on a driving simulator. A total of 120 volunteer subjects are
randomly assigned, in equal numbers, to the two groups. For subjects in the
treatment group, the mean number of errors X, equals 26.4, and for subjects in
the control group, the mean number of errors X2 equals 18,6. The estimated
standard error equals 2.4.
(i) Use t to test the null hypothesis at the .05 level of significance.
(ii) Specify the p-value for this test result.
(iii) If appropriate, construct a 95 percent confidence interval for the true
population mean difference and interpret this interval.
If the test result is statistically significant, use Cohen’s d to estimate the effect
size, given that the standard deviation, equals 13.15.
Given:

• n1=n2=60 (equal groups from 120 subjects)

• Xˉ1 = 26.4 (treatment: vodka group)

• Xˉ2= 18.6 (control: no alcohol)

• Estimated standard error SE = 2.4

• Population standard deviation (for Cohen's d): s = 13.15

(i) Hypothesis Test using t-test

Null Hypothesis H0:

There is no difference in mean performance errors:

Degrees of Freedom (df):

df = n_1 + n_2 - 2 = 60 + 60 - 2 = 118


(ii) p-value

Using a t-distribution table or calculator for t = 3.25, df=118:

p-value≈0.0015

Since p<0.05p < 0.05, the result is statistically significant.

Final Summary:

Question Result

(i) t-statistic 3.25 (Significant)

(ii) p-value 0.0015

(iii) 95% CI [3.05, 12.55] errors

(iv) Cohen's d 0.593 (medium-large effect)

You might also like