MEENAKSHI RAMASWAMY ENGINEERING COLLEGE,
Thathanur- 621804
DEPARTMENT CSE (AIML)
ENGINEERING
PART A &B, C QUESTION BANK
SUB CODE: CS3352
SUB NAME: FOUNDATION OF DATA SCIENCE
PREPARED BY
R.JANCY RANI
AP/CSE,MREC
CS3352 FOUNDATIONS OF DATA SCIENCE L T P C3 0 0 3
COURSE OBJECTIVES:
To understand the data science fundamentals and process.
To learn to describe the data for the data science process.
To learn to describe the relationship between data.
To utilize the Python libraries for Data Wrangling.
To present and interpret data using visualization libraries in Python
UNIT I INTRODUCTION 9
Data Science: Benefits and uses – facets of data - Data Science Process: Overview – Defining
research goals – Retrieving data – Data preparation - Exploratory Data analysis – build the model–
presenting findings and building applications - Data Mining - Data Warehousing – Basic Statistical
descriptions of Data
UNIT II DESCRIBING DATA 9
Types of Data - Types of Variables -Describing Data with Tables and Graphs –Describing Data
with Averages - Describing Variability - Normal Distributions and Standard (z) Scores
UNIT III DESCRIBING RELATIONSHIPS 9
Correlation –Scatter plots –correlation coefficient for quantitative data –computational formula for
correlation coefficient – Regression –regression line –least squares regression line – Standard
error of estimate – interpretation of r2 –multiple regression equations –regression towards the
mean
UNIT IV PYTHON LIBRARIES FOR DATA WRANGLING 9
Basics of Numpy arrays –aggregations –computations on arrays –comparisons, masks, boolean
logic – fancy indexing – structured arrays – Data manipulation with Pandas – data indexing and
selection – operating on data – missing data – Hierarchical indexing – combining datasets –
aggregation and grouping – pivot tables
UNIT V DATA VISUALIZATION 9
Importing Matplotlib – Line plots – Scatter plots – visualizing errors – density and contour plots –
Histograms – legends – colors – subplots – text and annotation – customization – three
dimensional plotting - Geographic Data with Basemap - Visualization with
TEXTBOOKS
1. David Cielen, Arno D. B. Meysman, and Mohamed Ali, “Introducing Data Science”,
ManningPublications, 2016. (Unit I)
2. Robert S. Witte and John S. Witte, “Statistics”, Eleventh Edition, Wiley Publications, 2017.(Units II
and III)
3. Jake VanderPlas, “Python Data Science Handbook”, O’Reilly, 2016. (Units IV and V)
REFERENCE:
1. Allen B. Downey, “Think Stats: Exploratory Data Analysis in Python”, Green Tea Press,20
UNIT 1
PART A
1.Define Data Science.
Answer:
Data Science is an interdisciplinary field that uses scientific methods, processes,
algorithms, and systems to extract knowledge and insights from structured and
unstructured data. It combines statistics, computer science, and domain expertise.
2.What are the facets of data?
✅ Answer:
Volume: Amount of data generated.
Velocity: Speed at which data is generated and processed.
Variety: Different forms of data (text, images, video).
Veracity: Quality and trustworthiness of data.
Value: Usefulness of data for decision-making.
3..Define Data Mining.
✅ Answer:
Data Mining is the process of discovering hidden patterns, correlations, and trends in
large datasets using statistical, machine learning, and AI techniques.
4.. What is Data Warehousing?
✅ Answer:
Data Warehousing is a system used to collect, store, and manage large volumes of
historical data from multiple sources for analysis and reporting.
5’. What is Exploratory Data Analysis (EDA)?
✅ Answer:
EDA is the initial phase of data analysis where data is summarized and visualized to
understand patterns, detect anomalies, and form hypotheses.
6. Write any two stages of the Data Science Process.
✅ Answer:
1. Data Preparation: Cleaning and transforming raw data.
2. Model Building: Applying machine learning algorithms for predictions.
7.Differentiate between Data Mining and Data Warehousing (one point).
✅ Answer:
Data Mining: Extracts patterns and knowledge from data.
Data Warehousing: Stores and organizes data for analysis.
8.What is the role of defining research goals in Data Science?
✅ Answer:
It establishes the purpose of analysis, defines questions to be answered, and guides data
collection and modeling.
9.Give any two basic statistical descriptions of data.
✅ Answer:
1. Mean (Average): Sum of all values divided by count.
2. Standard Deviation: Measure of data dispersion.
10. Outline the difference between structured and unstructured data
Aspect Structured Data Unstructured Data
Data organized in a predefined format Data that has no predefined structure
Definition
(rows & columns). or format.
Text, images, videos, social media
Format Tabular format (e.g., database tables).
posts, etc.
Stored in relational databases Stored in data lakes, NoSQL
Storage
(RDBMS). databases, or files.
Easy to search, query, and analyze using Requires advanced tools (NLP, AI)
Processing
SQL. for analysis.
Customer records, transaction details, Emails, PDFs, images, audio files,
Examples
inventory data. CCTV footage.
11.Identify and write down various data analytics challenges faced in the
conventional system
Handling large volumes of data efficiently
Managing data variety and inconsistency
Dealing with data quality issues like missing or inaccurate data
Limited processing speed and computational power
Integration of data from multiple sources
12. How the missing value present in a dataset are treated during data analysis
phase?
Handling missing values is an important step in data cleaning and preprocessing to
ensure accuracy in analysis and modeling.
Removing rows or columns with missing data, or Imputing values using methods such
as mean, median, mode, forward/backward fill, or advanced techniques like KNN or
regression imputation to ensure data quality during analysis.
13. Define data science and big data
Data Science:
Data Science is the field that uses scientific methods, algorithms, and tools to extract
meaningful insights and knowledge from structured and unstructured data.
Big Data:
Big Data refers to extremely large and complex datasets characterized by Volume,
Velocity, Variety, Veracity, and Value (5Vs), which cannot be processed efficiently
using traditional data processing tools.
14. what is structured data ?
Structured data is data that is organized in a fixed format, usually in rows and columns,
making it easy to store, search, and analyze using relational databases and SQL.
Examples include spreadsheets and tables in databases.
15. Discern the differences between data analysis and data science
Data analysis focuses on inspecting and interpreting data to find insights and support
decision-making, mainly using statistical and visualization techniques. Data science is a
broader field that includes data analysis but also involves building predictive models,
handling large datasets, and using machine learning to extract deeper knowledge from
data.
16.Give an approach to solve any data analytics based project
Understand the problem and collect relevant data.
Clean and preprocess the data, analyze it, and build models to extract insights.
17. Give an overview of common error
Data quality issues such as missing, duplicate, or incorrect data.
Incorrect analysis due to wrong assumptions, biased data, or improper methods.
PART B
15 marks
1. Elaborate about the steps in the Data Science process with a diagram.
The Data Science Process is a systematic approach used to extract meaningful insights
and build data-driven solutions.
Steps in Data Science Process:
1. Defining Research Goals:
o Identify the problem or objective clearly.
o Set measurable goals (e.g., predicting sales, detecting fraud).
2. Retrieving Data:
o Collect data from multiple sources (databases, APIs, sensors, web scraping,
etc.).
o Ensure the data is relevant and comprehensive.
3. Data Preparation:
o Clean and preprocess data (remove duplicates, handle missing values,
normalize data).
o Transform raw data into a usable format.
4. Exploratory Data Analysis (EDA):
o Analyze data to understand patterns and relationships.
o Use visualization tools (e.g., histograms, scatter plots) and descriptive
statistics.
5. Build the Model:
o Apply machine learning or statistical models.
o Train and test models to improve accuracy.
6. Present Findings:
o Visualize insights through dashboards, graphs, or reports.
o Communicate results clearly to stakeholders.
7. Build Applications:
o Deploy predictive models into production.
o Automate processes or integrate with applications.
Diagram of Data Science Process:
[Define Goals]
[Retrieve Data]
[Data Preparation]
↓
[Exploratory Data Analysis]
[Build Model]
[Present Findings]
[Build Applications]
2.What is Data Warehouse? Outline the architecture of a data warehouse with
a neat diagram
Definition of Data Warehouse:
A Data Warehouse is a centralized repository that stores large volumes of data
collected from multiple sources. It is designed to support business intelligence (BI),
reporting, and data analysis.
It is subject-oriented, integrated, time-variant, and non-volatile.
Used for decision-making and strategic planning.
Key Features:
1. Subject-Oriented: Organized around major subjects like sales, customers, etc.
2. Integrated: Data is collected from multiple sources and combined.
3. Time-Variant: Stores historical data for trend analysis.
4. Non-Volatile: Data is stable; only read and analyzed, not frequently updated.
Architecture of Data Warehouse:
The three-tier architecture is most common:
1. Bottom Tier (Data Sources):
Includes operational databases, ERP, CRM, flat files, and external data.
Data is extracted, cleaned, and transformed using ETL (Extract, Transform,
Load) tools.
2. Middle Tier (Data Warehouse Server):
Central data warehouse database where integrated data is stored.
Organized using schemas (Star, Snowflake).
Supports OLAP (Online Analytical Processing) for fast query performance.
3. Top Tier (Front-End Tools):
Reporting and analysis tools used by end-users.
Includes dashboards, visualization tools (Tableau, Power BI), and ad-hoc query
tools. [ Front-End Tools / BI (Reports, Dashboards) ]
---------------------
| Top Tier |
---------------------
[ OLAP Server / Query Engine ]
---------------------
| Middle Tier |
---------------------
[ Data Sources ] --> [ ETL Tools ] --> [ Data Warehouse ]
---------------------
| Bottom Tier |
---------------------
3.Examine the different facets of data with the challenges in their processing.
Facets of Data:
Data in data science can be characterized by several facets (dimensions) often referred to
as the 5 V’s of Big Data:
1. Volume (Size of Data)
Description: Refers to the massive amount of data generated daily from various
sources (social media, IoT, business transactions, etc.).
Challenge:
o Storing and managing large datasets efficiently.
o Requires distributed storage systems (Hadoop, Cloud storage).
2. Velocity (Speed of Data Generation)
Description: Speed at which data is created, processed, and analyzed (real-time
streaming data).
Challenge:
o Real-time data processing (e.g., financial transactions, IoT sensors).
o Implementing stream-processing tools like Apache Kafka or Spark
Streaming.
3. Variety (Types of Data)
Description: Data exists in multiple formats – structured (databases), semi-
structured (XML, JSON), and unstructured (images, audio, video).
Challenge:
o Integrating heterogeneous data sources.
o Processing unstructured data requires NLP, image recognition, or deep
learning tools.
4. Veracity (Data Quality and Reliability)
Description: Refers to the accuracy, consistency, and trustworthiness of data.
Challenge:
o Handling noisy, incomplete, or biased data.
o Data cleaning and validation are time-consuming.
5. Value (Extracting Insights)
Description: Ability to derive meaningful business value or insights from raw
data.
Challenge:
o Identifying relevant features for analysis.
o High costs in processing irrelevant or redundant data.
Additional Facets:
Variability: Fluctuations in data flow (e.g., seasonal data spikes).
Visualization: Presenting insights in an understandable way for decision-makers.
Challenges in Processing Data:
1. Scalability: Processing huge volumes of data requires advanced tools and high-
performance systems.
2. Data Integration: Combining data from multiple sources with different formats
and standards.
3. Storage and Retrieval: Efficiently storing and retrieving historical and real-time
data.
4. Security and Privacy: Protecting sensitive data against breaches and compliance
violations.
5. Complex Analytics: Implementing advanced algorithms (ML/AI) requires
expertise and computing power.
6. Data Governance: Maintaining proper metadata, ownership, and regulatory
compliance.
7. Cost Management: Infrastructure and software for big data processing can be
expensive.
4.Explain in detail about Data Cleaning, Data Integration, Transforming Data,
and Building a Model.
1. Data Cleaning
Data cleaning is the process of detecting and correcting errors or inconsistencies in data
to improve its quality and reliability.
Steps in Data Cleaning:
Handling Missing Values:
o Fill missing values using mean, median, or mode (imputation).
o Remove rows/columns with too many missing values.
Removing Duplicates:
o Identify and remove duplicate records.
Correcting Inaccurate Data:
o Validate data using reference sources or business rules.
Handling Outliers:
o Detect extreme values using statistical methods (z-score, IQR) and remove
or cap them.
Standardizing Formats:
o Convert data into uniform formats (e.g., date formats, units of measure).
Importance:
Clean data ensures accurate analysis, reliable models, and better decision-making.
2. Data Integration
Data integration is the process of combining data from different sources into a unified
view.
Steps in Data Integration:
Data Extraction:
o Gather data from multiple sources like databases, APIs, or files.
Schema Integration:
o Resolve differences in schema (field names, data types).
Entity Resolution:
o Identify and merge duplicate entities across datasets.
Data Consolidation:
o Combine datasets into a central repository (e.g., Data Warehouse).
Importance:
Integrated data eliminates redundancy, creates a single source of truth, and provides a
complete view for analysis.
3. Data Transformation
Data transformation converts raw data into a suitable format for analysis or modeling.
Common Transformation Techniques:
Normalization/Standardization:
o Scaling data to a fixed range (e.g., 0-1) or standard normal distribution (z-
score).
Encoding Categorical Variables:
o Convert categories into numerical values (One-hot encoding, label
encoding).
Feature Engineering:
o Create new derived features from existing ones.
Aggregation:
o Summarizing data (e.g., monthly sales from daily sales).
Log Transformation:
o Reduce skewness and stabilize variance.
Importance:
Properly transformed data improves model accuracy and performance.
4. Building a Model
This step involves using machine learning or statistical algorithms to create predictive or
analytical models.
Steps in Building a Model:
Split Data:
o Divide into training (to train model) and testing (to evaluate performance)
sets.
Select Algorithm:
o Choose based on the problem type:
Regression: Linear regression, decision trees (for continuous data).
Classification: Logistic regression, random forest, SVM (for
categorical data).
Train the Model:
o Fit the model using training data to learn patterns.
Validate and Tune:
o Use cross-validation to fine-tune hyperparameters.
Evaluate Model:
o Measure accuracy, precision, recall, F1-score, RMSE, etc.
Deploy Model:
o Integrate into applications for real-world use.
5. Explain in brief about Exploratory Data Analysis (EDA).
Definition:
Exploratory Data Analysis (EDA) is the process of examining and visualizing datasets
to summarize their main characteristics, discover patterns, detect anomalies, and test
hypotheses before applying formal modeling or machine learning techniques.
Objectives of EDA:
1. Understand data structure: Identify data types, dimensions, and distributions.
2. Detect errors/anomalies: Find missing values, outliers, or inconsistencies.
3. Visualize relationships: Explore relationships between variables (e.g.,
correlation).
4. Feature selection: Identify important variables for modeling.
5. Guide modeling: Helps in choosing suitable algorithms and transformations.
Steps in EDA:
1. Data Collection and Loading:
Import the dataset from sources (CSV, databases, APIs).
Example tools: Pandas, NumPy.
2. Data Inspection:
Understand structure using commands like .head(), .info(), .describe().
Check data types (numeric, categorical).
3. Data Cleaning:
Handle missing values (imputation or removal).
Remove duplicates and correct inconsistencies.
4. Descriptive Statistics:
Compute measures such as:
o Central tendency: Mean, median, mode.
o Dispersion: Standard deviation, variance, range.
o Shape: Skewness, kurtosis.
5. Data Visualization:
Univariate analysis (single variable): Histograms, box plots.
Bivariate analysis (two variables): Scatter plots, correlation heatmaps.
Multivariate analysis: Pair plots, 3D plots.
6. Identify Patterns and Relationships:
Use correlation analysis to find linear relationships.
Detect multicollinearity among features.
7. Outlier Detection:
Use boxplots or z-scores to identify and handle outliers.
8. Feature Engineering:
Create derived features, encode categorical data, and scale numerical data.
Techniques and Tools:
Statistical methods: Mean, variance, correlation.
Visualization tools: Matplotlib, Seaborn, Plotly.
Python libraries: Pandas (EDA), NumPy (computations).
Benefits of EDA:
1. Improves data quality by identifying issues early.
2. Provides insights for better feature selection.
3. Reduces modeling errors through better understanding of data.
4. Ensures efficient modeling by selecting appropriate algorithms.
6.Identify and list down various data analytics challenges faced in the conventional
system. Explain any two challenges in detail.
Data Analytics Challenges in Conventional Systems:
1. Handling Large Volume of Data
2. Managing Data Variety (Structured and Unstructured Data)
3. Data Quality Issues (Missing, Noisy, or Inaccurate Data)
4. Data Integration from Multiple Sources
5. Limited Processing Speed and Computational Power
6. Lack of Real-Time Processing Capabilities
7. Ensuring Data Security and Privacy
8. Scalability Issues in Traditional Systems
Explanation of Any Two Challenges:
1. Handling Large Volume of Data
Description:
In conventional systems, processing and storing massive volumes of data generated from sources
such as social media, IoT devices, and e-commerce is extremely difficult.
Why it’s a challenge:
o Traditional relational databases have limited storage and cannot efficiently handle Big
Data.
o Processing large datasets is slow and resource-intensive.
Solution (Modern Approach):
o Use of distributed storage (Hadoop HDFS) and parallel computing frameworks (Apache
Spark).
o Cloud storage platforms (AWS, Azure, Google Cloud) provide scalability.
2. Data Quality Issues
Description:
Data collected from multiple sources often contains errors, missing values, duplicates, and
noise. Poor data quality directly impacts the accuracy of analysis and decision-making.
Why it’s a challenge:
o Time-consuming data cleaning in conventional systems.
o Manual error correction leads to delays.
o No automation tools in traditional systems to validate and cleanse data.
Solution (Modern Approach):
o Use automated data cleaning tools (Python Pandas, OpenRefine).
o Apply data preprocessing techniques like imputation, normalization, and deduplication.
UNIT 2
PART A
1.Define Data Science.
Answer:
Data Science is an interdisciplinary field that uses scientific methods, processes,
algorithms, and systems to extract knowledge and insights from structured and
unstructured data. It combines statistics, computer science, and domain expertise.
2.What are the facets of data?
✅ Answer:
Volume: Amount of data generated.
Velocity: Speed at which data is generated and processed.
Variety: Different forms of data (text, images, video).
Veracity: Quality and trustworthiness of data.
Value: Usefulness of data for decision-making.
3..Define Data Mining.
✅ Answer:
Data Mining is the process of discovering hidden patterns, correlations, and trends in
large datasets using statistical, machine learning, and AI techniques.
4.. What is Data Warehousing?
✅ Answer:
Data Warehousing is a system used to collect, store, and manage large volumes of
historical data from multiple sources for analysis and reporting.
5’. What is Exploratory Data Analysis (EDA)?
✅ Answer:
EDA is the initial phase of data analysis where data is summarized and visualized to
understand patterns, detect anomalies, and form hypotheses.
6. Write any two stages of the Data Science Process.
✅ Answer:
3. Data Preparation: Cleaning and transforming raw data.
4. Model Building: Applying machine learning algorithms for predictions.
7.Differentiate between Data Mining and Data Warehousing (one point).
✅ Answer:
Data Mining: Extracts patterns and knowledge from data.
Data Warehousing: Stores and organizes data for analysis.
8.What is the role of defining research goals in Data Science?
✅ Answer:
It establishes the purpose of analysis, defines questions to be answered, and guides data
collection and modeling.
9.Give any two basic statistical descriptions of data.
✅ Answer:
3. Mean (Average): Sum of all values divided by count.
4. Standard Deviation: Measure of data dispersion.
10. Outline the difference between structured and unstructured data
Aspect Structured Data Unstructured Data
Data organized in a predefined format Data that has no predefined structure
Definition
(rows & columns). or format.
Text, images, videos, social media
Format Tabular format (e.g., database tables).
posts, etc.
Stored in relational databases Stored in data lakes, NoSQL
Storage
(RDBMS). databases, or files.
Easy to search, query, and analyze using Requires advanced tools (NLP, AI)
Processing
SQL. for analysis.
Customer records, transaction details, Emails, PDFs, images, audio files,
Examples
inventory data. CCTV footage.
11.Identify and write down various data analytics challenges faced in the
conventional system
Handling large volumes of data efficiently
Managing data variety and inconsistency
Dealing with data quality issues like missing or inaccurate data
Limited processing speed and computational power
Integration of data from multiple sources
12. How the missing value present in a dataset are treated during data analysis
phase?
Handling missing values is an important step in data cleaning and preprocessing to
ensure accuracy in analysis and modeling.
Removing rows or columns with missing data, or Imputing values using methods such
as mean, median, mode, forward/backward fill, or advanced techniques like KNN or
regression imputation to ensure data quality during analysis.
13. Define data science and big data
Data Science:
Data Science is the field that uses scientific methods, algorithms, and tools to extract
meaningful insights and knowledge from structured and unstructured data.
Big Data:
Big Data refers to extremely large and complex datasets characterized by Volume,
Velocity, Variety, Veracity, and Value (5Vs), which cannot be processed efficiently
using traditional data processing tools.
14. what is structured data ?
Structured data is data that is organized in a fixed format, usually in rows and columns,
making it easy to store, search, and analyze using relational databases and SQL.
Examples include spreadsheets and tables in databases.
15. Discern the differences between data analysis and data science
Data analysis focuses on inspecting and interpreting data to find insights and support
decision-making, mainly using statistical and visualization techniques. Data science is a
broader field that includes data analysis but also involves building predictive models,
handling large datasets, and using machine learning to extract deeper knowledge from
data.
16.Give an approach to solve any data analytics based project
Understand the problem and collect relevant data.
Clean and preprocess the data, analyze it, and build models to extract insights.
17. Give an overview of common error
Data quality issues such as missing, duplicate, or incorrect data.
Incorrect analysis due to wrong assumptions, biased data, or improper methods.
15 marks
1.Demonstrate the different types of variables used in data analysis with examples":
1. Quantitative Variables (Numerical)
These represent measurable quantities.
(a) Discrete Variables:
Definition: Take countable, distinct values (no decimals).
Examples:
o Number of students in a class (e.g., 30, 31).
o Goals scored in a match (e.g., 0, 1, 2).
(b) Continuous Variables:
Definition: Can take any value within a range (including decimals).
Examples:
o Height of students (e.g., 165.5 cm).
o Temperature (e.g., 36.7°C).
2. Qualitative Variables (Categorical)
These describe qualities or characteristics.
(a) Nominal Variables:
Definition: Categories with no order or ranking.
Examples:
o Gender (Male/Female).
o Blood Group (A, B, AB, O).
(b) Ordinal Variables:
Definition: Categories with an inherent order, but differences between ranks are not measurable.
Examples:
o Education level (Primary, Secondary, College).
o Customer satisfaction (Poor, Good, Excellent).
3. Binary (Dichotomous) Variables
Definition: Special case of nominal variables with only two categories.
Examples:
o Yes/No answers.
o Pass/Fail result.
4. Dependent and Independent Variables
Independent Variable (Predictor):
o Controlled or manipulated during analysis.
o Example: Hours studied.
Dependent Variable (Response):
o Outcome affected by the independent variable.
o Example: Exam marks scored.
5. Example Table Demonstrating Variables
Variable Type Example Value
Age Quantitative (Continuous) 21.5 years
No. of Siblings Quantitative (Discrete) 2
Gender Qualitative (Nominal) Male
Satisfaction Level Qualitative (Ordinal) Good
Result Binary (Dichotomous) Pass
2. The following data shows the weights (in kg) of 12 students in a class:
50, 52, 48, 55, 60, 50, 58, 53, 57, 54, 52, 56
(a) Calculate the mean, median, and mode of the weights.
(b) Calculate the variance and standard deviation of the weights.
(a) Calculate the Mean:
Mean=∑X/N
Sum of weights = 50 + 52 + 48 + 55 + 60 + 50 + 58 + 53 + 57 + 54 + 52 + 56 = 645
Number of students, N=12
Mean=645/12=53.75 kg
b.Calculate the Median:
Sort the data in ascending order:
48, 50, 50, 52, 52, 53, 54, 55, 56, 57, 58, 60
Since N=12 (even number), median is average of 6th and 7th values:
Median=53+54/2=107/2=53.5 kg
Calculate the Mode:
The values 50 and 52 appear twice, more than any others.
Mode = 50 kg and 52 kg (Bimodal)
(d) Calculate the Variance:
Calculate squared deviations (X−μ)2:
Weight (X) X−μX - \mu (X−μ)2(X - \mu)^2
48 -5.75 33.06
50 -3.75 14.06
50 -3.75 14.06
52 -1.75 3.06
52 -1.75 3.06
53 -0.75 0.56
54 0.25 0.06
55 1.25 1.56
56 2.25 5.06
57 3.25 10.56
58 4.25 18.06
60 6.25 39.06
Sum of squared deviations = 142.2
Variance σ2=∑(X−μ)2/N=142.2/12=11.85 kg2
e) Calculate the Standard Deviation:
σ=11.85=3.44 kg
3. Explain normal curve and z score
Normal Curve (Normal Distribution)
The normal distribution is a continuous probability distribution that is symmetric and bell-
shaped.
It is one of the most important distributions in statistics because many natural phenomena
approximate this pattern (e.g., heights, test scores, measurement errors).
The curve is defined by two parameters:
o Mean (μ): The center of the curve, where the peak occurs.
o Standard deviation (σ): Measures the spread or width of the curve.
Properties of the Normal Curve:
1. Symmetry: The curve is perfectly symmetric about the mean (μ).
2. Mean = Median = Mode: All three measures of central tendency are equal and located at the
center.
3. Total Area = 1: The total area under the curve equals 1, representing the entire probability.
4. Empirical Rule:
o About 68% of data falls within ±1σ from the mean.
o About 95% falls within ±2σ.
o About 99.7% falls within ±3σ.
Z-Score
A z-score (standard score) tells how many standard deviations a particular data point (x) is from
the mean (μ).
It standardizes different data points allowing comparison across different scales or distributions.
Formula:
z=x−μσz = \frac{x - \mu}{\sigma}z=σx−μ
Where:
xxx = observed data value
μ\muμ = mean of the dataset
σ\sigmaσ = standard deviation
Interpretation:
z = 0: The data point is exactly at the mean.
Positive z-score: Data point is above the mean.
Negative z-score: Data point is below the mean.
Example: If z = 2, the data point is 2 standard deviations above the mean.
Why use z-scores?
To identify outliers (usually if ∣z∣>3|z| > 3∣z∣>3).
To find the relative position of data points in a distribution.
To calculate probabilities and percentiles using the standard normal distribution table.
4. What is z score? Outline the steps to obtain a z score
A z-score (also called a standard score) is a statistical measurement that describes a data point’s position
relative to the mean of a group of values, expressed in terms of standard deviations. It tells us how many
standard deviations a particular value xxx is above or below the mean μ\muμ.
Importance of Z-Score:
It standardizes different data points from different scales to a common scale.
Helps compare scores from different distributions.
Useful in identifying outliers and calculating probabilities using the standard normal distribution
table.
Steps to Obtain a Z-Score
Step 1: Calculate the Mean (μ)
Find the average of all data points by adding them up and dividing by the number of data points NNN:
μ=∑X/N
Step 2: Calculate the Standard Deviation (σ)
Calculate the dispersion or spread of data points around the mean. Use the formula:
=∑(X−μ)2/N
Step 3: Select the Data Value (x)
Choose the specific data point for which you want to find the z-score.
Step 4: Calculate the Z-Score
Use the formula:
z=x−μ /
This gives the number of standard deviations the data value xxx is from the mean.
Example
Suppose the mean weight of students is 60 kg with a standard deviation of 5 kg. Find the z-score