0% found this document useful (0 votes)

122 views56 pages

Cs3352 - Foundation of Data Science

The document is a question bank for the course 'Foundation of Data Science' at Meenakshi Ramaswamy Engineering College, covering key concepts such as data science fundamentals, data description, relationships, Python libraries for data wrangling, and data visualization. It includes objectives, unit breakdowns, and sample questions related to data science processes, data mining, data warehousing, and exploratory data analysis. The content is structured to aid students in understanding and applying data science principles effectively.

Uploaded by

nklatha1975

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

122 views56 pages

Cs3352 - Foundation of Data Science

Uploaded by

nklatha1975

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 56

MEENAKSHI RAMASWAMY ENGINEERING COLLEGE,

Thathanur- 621804

DEPARTMENT CSE (AIML)

ENGINEERING

PART A &B, C QUESTION BANK

SUB CODE: CS3352

SUB NAME: FOUNDATION OF DATA SCIENCE

PREPARED BY

R.JANCY RANI

AP/CSE,MREC
CS3352 FOUNDATIONS OF DATA SCIENCE L T P C3 0 0 3

COURSE OBJECTIVES:
 To understand the data science fundamentals and process.
 To learn to describe the data for the data science process.
 To learn to describe the relationship between data.
 To utilize the Python libraries for Data Wrangling.
 To present and interpret data using visualization libraries in Python
UNIT I INTRODUCTION 9
Data Science: Benefits and uses – facets of data - Data Science Process: Overview – Defining
research goals – Retrieving data – Data preparation - Exploratory Data analysis – build the model–
presenting findings and building applications - Data Mining - Data Warehousing – Basic Statistical
descriptions of Data
UNIT II DESCRIBING DATA 9
Types of Data - Types of Variables -Describing Data with Tables and Graphs –Describing Data
with Averages - Describing Variability - Normal Distributions and Standard (z) Scores
UNIT III DESCRIBING RELATIONSHIPS 9
Correlation –Scatter plots –correlation coefficient for quantitative data –computational formula for
correlation coefficient – Regression –regression line –least squares regression line – Standard
error of estimate – interpretation of r2 –multiple regression equations –regression towards the
mean
UNIT IV PYTHON LIBRARIES FOR DATA WRANGLING 9
Basics of Numpy arrays –aggregations –computations on arrays –comparisons, masks, boolean
logic – fancy indexing – structured arrays – Data manipulation with Pandas – data indexing and
selection – operating on data – missing data – Hierarchical indexing – combining datasets –
aggregation and grouping – pivot tables
UNIT V DATA VISUALIZATION 9
Importing Matplotlib – Line plots – Scatter plots – visualizing errors – density and contour plots –
Histograms – legends – colors – subplots – text and annotation – customization – three
dimensional plotting - Geographic Data with Basemap - Visualization with
TEXTBOOKS
1. David Cielen, Arno D. B. Meysman, and Mohamed Ali, “Introducing Data Science”,
ManningPublications, 2016. (Unit I)
2. Robert S. Witte and John S. Witte, “Statistics”, Eleventh Edition, Wiley Publications, 2017.(Units II
and III)
3. Jake VanderPlas, “Python Data Science Handbook”, O’Reilly, 2016. (Units IV and V)
REFERENCE:

1. Allen B. Downey, “Think Stats: Exploratory Data Analysis in Python”, Green Tea Press,20
UNIT 1

PART A

1.Define Data Science.

Answer:
Data Science is an interdisciplinary field that uses scientific methods, processes,
algorithms, and systems to extract knowledge and insights from structured and
unstructured data. It combines statistics, computer science, and domain expertise.

2.What are the facets of data?

✅ Answer:

 Volume: Amount of data generated.

 Velocity: Speed at which data is generated and processed.
 Variety: Different forms of data (text, images, video).
 Veracity: Quality and trustworthiness of data.
 Value: Usefulness of data for decision-making.

3..Define Data Mining.

✅ Answer:
Data Mining is the process of discovering hidden patterns, correlations, and trends in
large datasets using statistical, machine learning, and AI techniques.

4.. What is Data Warehousing?

✅ Answer:
Data Warehousing is a system used to collect, store, and manage large volumes of
historical data from multiple sources for analysis and reporting.

5’. What is Exploratory Data Analysis (EDA)?

✅ Answer:
EDA is the initial phase of data analysis where data is summarized and visualized to
understand patterns, detect anomalies, and form hypotheses.

6. Write any two stages of the Data Science Process.

✅ Answer:

1. Data Preparation: Cleaning and transforming raw data.

2. Model Building: Applying machine learning algorithms for predictions.

7.Differentiate between Data Mining and Data Warehousing (one point).

✅ Answer:

 Data Mining: Extracts patterns and knowledge from data.

 Data Warehousing: Stores and organizes data for analysis.
8.What is the role of defining research goals in Data Science?
✅ Answer:
It establishes the purpose of analysis, defines questions to be answered, and guides data
collection and modeling.

9.Give any two basic statistical descriptions of data.

✅ Answer:

1. Mean (Average): Sum of all values divided by count.

2. Standard Deviation: Measure of data dispersion.

10. Outline the difference between structured and unstructured data

Aspect Structured Data Unstructured Data
Data organized in a predefined format Data that has no predefined structure
Definition
(rows & columns). or format.
Text, images, videos, social media
Format Tabular format (e.g., database tables).
posts, etc.
Stored in relational databases Stored in data lakes, NoSQL
Storage
(RDBMS). databases, or files.
Easy to search, query, and analyze using Requires advanced tools (NLP, AI)
Processing
SQL. for analysis.
Customer records, transaction details, Emails, PDFs, images, audio files,
Examples
inventory data. CCTV footage.
11.Identify and write down various data analytics challenges faced in the
conventional system

 Handling large volumes of data efficiently

 Managing data variety and inconsistency

 Dealing with data quality issues like missing or inaccurate data

 Limited processing speed and computational power

 Integration of data from multiple sources

12. How the missing value present in a dataset are treated during data analysis
phase?

Handling missing values is an important step in data cleaning and preprocessing to

ensure accuracy in analysis and modeling.

Removing rows or columns with missing data, or Imputing values using methods such
as mean, median, mode, forward/backward fill, or advanced techniques like KNN or
regression imputation to ensure data quality during analysis.
13. Define data science and big data

Data Science:
Data Science is the field that uses scientific methods, algorithms, and tools to extract
meaningful insights and knowledge from structured and unstructured data.

Big Data:
Big Data refers to extremely large and complex datasets characterized by Volume,
Velocity, Variety, Veracity, and Value (5Vs), which cannot be processed efficiently
using traditional data processing tools.

14. what is structured data ?

Structured data is data that is organized in a fixed format, usually in rows and columns,
making it easy to store, search, and analyze using relational databases and SQL.
Examples include spreadsheets and tables in databases.

15. Discern the differences between data analysis and data science
Data analysis focuses on inspecting and interpreting data to find insights and support
decision-making, mainly using statistical and visualization techniques. Data science is a
broader field that includes data analysis but also involves building predictive models,
handling large datasets, and using machine learning to extract deeper knowledge from
data.
16.Give an approach to solve any data analytics based project

 Understand the problem and collect relevant data.

 Clean and preprocess the data, analyze it, and build models to extract insights.

17. Give an overview of common error

 Data quality issues such as missing, duplicate, or incorrect data.

 Incorrect analysis due to wrong assumptions, biased data, or improper methods.

PART B
15 marks

1. Elaborate about the steps in the Data Science process with a diagram.

The Data Science Process is a systematic approach used to extract meaningful insights
and build data-driven solutions.

Steps in Data Science Process:

1. Defining Research Goals:

o Identify the problem or objective clearly.
o Set measurable goals (e.g., predicting sales, detecting fraud).
2. Retrieving Data:
o Collect data from multiple sources (databases, APIs, sensors, web scraping,
etc.).
o Ensure the data is relevant and comprehensive.
3. Data Preparation:
o Clean and preprocess data (remove duplicates, handle missing values,
normalize data).
o Transform raw data into a usable format.
4. Exploratory Data Analysis (EDA):
o Analyze data to understand patterns and relationships.
o Use visualization tools (e.g., histograms, scatter plots) and descriptive
statistics.
5. Build the Model:
o Apply machine learning or statistical models.
o Train and test models to improve accuracy.
6. Present Findings:
o Visualize insights through dashboards, graphs, or reports.
o Communicate results clearly to stakeholders.
7. Build Applications:
o Deploy predictive models into production.
o Automate processes or integrate with applications.

Diagram of Data Science Process:

[Define Goals]

[Retrieve Data]

[Data Preparation]

↓
[Exploratory Data Analysis]

[Build Model]

[Present Findings]

[Build Applications]

2.What is Data Warehouse? Outline the architecture of a data warehouse with

a neat diagram

Definition of Data Warehouse:

A Data Warehouse is a centralized repository that stores large volumes of data

collected from multiple sources. It is designed to support business intelligence (BI),
reporting, and data analysis.

 It is subject-oriented, integrated, time-variant, and non-volatile.

 Used for decision-making and strategic planning.

Key Features:

1. Subject-Oriented: Organized around major subjects like sales, customers, etc.

2. Integrated: Data is collected from multiple sources and combined.
3. Time-Variant: Stores historical data for trend analysis.
4. Non-Volatile: Data is stable; only read and analyzed, not frequently updated.

Architecture of Data Warehouse:

The three-tier architecture is most common:

1. Bottom Tier (Data Sources):

 Includes operational databases, ERP, CRM, flat files, and external data.
 Data is extracted, cleaned, and transformed using ETL (Extract, Transform,
Load) tools.
 2. Middle Tier (Data Warehouse Server):

 Central data warehouse database where integrated data is stored.

 Organized using schemas (Star, Snowflake).
 Supports OLAP (Online Analytical Processing) for fast query performance.

3. Top Tier (Front-End Tools):

 Reporting and analysis tools used by end-users.

 Includes dashboards, visualization tools (Tableau, Power BI), and ad-hoc query
tools. [ Front-End Tools / BI (Reports, Dashboards) ]

---------------------

| Top Tier |

---------------------

[ OLAP Server / Query Engine ]

---------------------

| Middle Tier |

---------------------

[ Data Sources ] --> [ ETL Tools ] --> [ Data Warehouse ]

---------------------

| Bottom Tier |

---------------------

3.Examine the different facets of data with the challenges in their processing.
Facets of Data:

Data in data science can be characterized by several facets (dimensions) often referred to
as the 5 V’s of Big Data:
1. Volume (Size of Data)

 Description: Refers to the massive amount of data generated daily from various
sources (social media, IoT, business transactions, etc.).
 Challenge:
o Storing and managing large datasets efficiently.
o Requires distributed storage systems (Hadoop, Cloud storage).

2. Velocity (Speed of Data Generation)

 Description: Speed at which data is created, processed, and analyzed (real-time

streaming data).
 Challenge:
o Real-time data processing (e.g., financial transactions, IoT sensors).
o Implementing stream-processing tools like Apache Kafka or Spark
Streaming.

3. Variety (Types of Data)

 Description: Data exists in multiple formats – structured (databases), semi-

structured (XML, JSON), and unstructured (images, audio, video).
 Challenge:
o Integrating heterogeneous data sources.
o Processing unstructured data requires NLP, image recognition, or deep
learning tools.

4. Veracity (Data Quality and Reliability)

 Description: Refers to the accuracy, consistency, and trustworthiness of data.

 Challenge:
o Handling noisy, incomplete, or biased data.
o Data cleaning and validation are time-consuming.

5. Value (Extracting Insights)

 Description: Ability to derive meaningful business value or insights from raw

data.
 Challenge:
o Identifying relevant features for analysis.
o High costs in processing irrelevant or redundant data.

Additional Facets:

 Variability: Fluctuations in data flow (e.g., seasonal data spikes).

 Visualization: Presenting insights in an understandable way for decision-makers.
Challenges in Processing Data:

1. Scalability: Processing huge volumes of data requires advanced tools and high-
performance systems.
2. Data Integration: Combining data from multiple sources with different formats
and standards.
3. Storage and Retrieval: Efficiently storing and retrieving historical and real-time
data.
4. Security and Privacy: Protecting sensitive data against breaches and compliance
violations.
5. Complex Analytics: Implementing advanced algorithms (ML/AI) requires
expertise and computing power.
6. Data Governance: Maintaining proper metadata, ownership, and regulatory
compliance.
7. Cost Management: Infrastructure and software for big data processing can be
expensive.

4.Explain in detail about Data Cleaning, Data Integration, Transforming Data,

and Building a Model.
1. Data Cleaning

Data cleaning is the process of detecting and correcting errors or inconsistencies in data
to improve its quality and reliability.

Steps in Data Cleaning:

 Handling Missing Values:

o Fill missing values using mean, median, or mode (imputation).
o Remove rows/columns with too many missing values.
 Removing Duplicates:
o Identify and remove duplicate records.
 Correcting Inaccurate Data:
o Validate data using reference sources or business rules.
 Handling Outliers:
o Detect extreme values using statistical methods (z-score, IQR) and remove
or cap them.
 Standardizing Formats:
o Convert data into uniform formats (e.g., date formats, units of measure).

Importance:
Clean data ensures accurate analysis, reliable models, and better decision-making.
2. Data Integration

Data integration is the process of combining data from different sources into a unified
view.

Steps in Data Integration:

 Data Extraction:
o Gather data from multiple sources like databases, APIs, or files.
 Schema Integration:
o Resolve differences in schema (field names, data types).
 Entity Resolution:
o Identify and merge duplicate entities across datasets.
 Data Consolidation:
o Combine datasets into a central repository (e.g., Data Warehouse).

Importance:
Integrated data eliminates redundancy, creates a single source of truth, and provides a
complete view for analysis.

3. Data Transformation

Data transformation converts raw data into a suitable format for analysis or modeling.

Common Transformation Techniques:

 Normalization/Standardization:
o Scaling data to a fixed range (e.g., 0-1) or standard normal distribution (z-
score).
 Encoding Categorical Variables:
o Convert categories into numerical values (One-hot encoding, label
encoding).
 Feature Engineering:
o Create new derived features from existing ones.
 Aggregation:
o Summarizing data (e.g., monthly sales from daily sales).
 Log Transformation:
o Reduce skewness and stabilize variance.

Importance:
Properly transformed data improves model accuracy and performance.

4. Building a Model

This step involves using machine learning or statistical algorithms to create predictive or
analytical models.
Steps in Building a Model:

 Split Data:
o Divide into training (to train model) and testing (to evaluate performance)
sets.
 Select Algorithm:
o Choose based on the problem type:
 Regression: Linear regression, decision trees (for continuous data).
 Classification: Logistic regression, random forest, SVM (for
categorical data).
 Train the Model:
o Fit the model using training data to learn patterns.
 Validate and Tune:
o Use cross-validation to fine-tune hyperparameters.
 Evaluate Model:
o Measure accuracy, precision, recall, F1-score, RMSE, etc.
 Deploy Model:
o Integrate into applications for real-world use.

5. Explain in brief about Exploratory Data Analysis (EDA).

Definition:

Exploratory Data Analysis (EDA) is the process of examining and visualizing datasets
to summarize their main characteristics, discover patterns, detect anomalies, and test
hypotheses before applying formal modeling or machine learning techniques.

Objectives of EDA:

1. Understand data structure: Identify data types, dimensions, and distributions.

2. Detect errors/anomalies: Find missing values, outliers, or inconsistencies.
3. Visualize relationships: Explore relationships between variables (e.g.,
correlation).
4. Feature selection: Identify important variables for modeling.
5. Guide modeling: Helps in choosing suitable algorithms and transformations.

Steps in EDA:
1. Data Collection and Loading:

 Import the dataset from sources (CSV, databases, APIs).

 Example tools: Pandas, NumPy.
2. Data Inspection:

 Understand structure using commands like .head(), .info(), .describe().

 Check data types (numeric, categorical).

3. Data Cleaning:

 Handle missing values (imputation or removal).

 Remove duplicates and correct inconsistencies.

4. Descriptive Statistics:

 Compute measures such as:

o Central tendency: Mean, median, mode.
o Dispersion: Standard deviation, variance, range.
o Shape: Skewness, kurtosis.

5. Data Visualization:

 Univariate analysis (single variable): Histograms, box plots.

 Bivariate analysis (two variables): Scatter plots, correlation heatmaps.
 Multivariate analysis: Pair plots, 3D plots.

6. Identify Patterns and Relationships:

 Use correlation analysis to find linear relationships.

 Detect multicollinearity among features.

7. Outlier Detection:

 Use boxplots or z-scores to identify and handle outliers.

8. Feature Engineering:

 Create derived features, encode categorical data, and scale numerical data.

Techniques and Tools:

 Statistical methods: Mean, variance, correlation.

 Visualization tools: Matplotlib, Seaborn, Plotly.
 Python libraries: Pandas (EDA), NumPy (computations).
Benefits of EDA:

1. Improves data quality by identifying issues early.

2. Provides insights for better feature selection.
3. Reduces modeling errors through better understanding of data.
4. Ensures efficient modeling by selecting appropriate algorithms.

6.Identify and list down various data analytics challenges faced in the conventional
system. Explain any two challenges in detail.
Data Analytics Challenges in Conventional Systems:

1. Handling Large Volume of Data

2. Managing Data Variety (Structured and Unstructured Data)
3. Data Quality Issues (Missing, Noisy, or Inaccurate Data)
4. Data Integration from Multiple Sources
5. Limited Processing Speed and Computational Power
6. Lack of Real-Time Processing Capabilities
7. Ensuring Data Security and Privacy
8. Scalability Issues in Traditional Systems

Explanation of Any Two Challenges:

1. Handling Large Volume of Data

 Description:
In conventional systems, processing and storing massive volumes of data generated from sources
such as social media, IoT devices, and e-commerce is extremely difficult.
 Why it’s a challenge:
o Traditional relational databases have limited storage and cannot efficiently handle Big
Data.
o Processing large datasets is slow and resource-intensive.
 Solution (Modern Approach):
o Use of distributed storage (Hadoop HDFS) and parallel computing frameworks (Apache
Spark).
o Cloud storage platforms (AWS, Azure, Google Cloud) provide scalability.

2. Data Quality Issues

 Description:
Data collected from multiple sources often contains errors, missing values, duplicates, and
noise. Poor data quality directly impacts the accuracy of analysis and decision-making.
 Why it’s a challenge:
o Time-consuming data cleaning in conventional systems.
o Manual error correction leads to delays.
o No automation tools in traditional systems to validate and cleanse data.
 Solution (Modern Approach):
o Use automated data cleaning tools (Python Pandas, OpenRefine).
o Apply data preprocessing techniques like imputation, normalization, and deduplication.
UNIT 2
PART A
1.Define Data Science.
Answer:
Data Science is an interdisciplinary field that uses scientific methods, processes,
algorithms, and systems to extract knowledge and insights from structured and
unstructured data. It combines statistics, computer science, and domain expertise.

2.What are the facets of data?

✅ Answer:

Volume: Amount of data generated.

Velocity: Speed at which data is generated and processed.

Variety: Different forms of data (text, images, video).

Veracity: Quality and trustworthiness of data.

Value: Usefulness of data for decision-making.

3..Define Data Mining.

✅ Answer:
Data Mining is the process of discovering hidden patterns, correlations, and trends in
large datasets using statistical, machine learning, and AI techniques.

4.. What is Data Warehousing?

✅ Answer:
Data Warehousing is a system used to collect, store, and manage large volumes of
historical data from multiple sources for analysis and reporting.

5’. What is Exploratory Data Analysis (EDA)?

✅ Answer:
EDA is the initial phase of data analysis where data is summarized and visualized to
understand patterns, detect anomalies, and form hypotheses.

6. Write any two stages of the Data Science Process.

✅ Answer:

3. Data Preparation: Cleaning and transforming raw data.

4. Model Building: Applying machine learning algorithms for predictions.

7.Differentiate between Data Mining and Data Warehousing (one point).

✅ Answer:
 Data Mining: Extracts patterns and knowledge from data.
 Data Warehousing: Stores and organizes data for analysis.

8.What is the role of defining research goals in Data Science?

✅ Answer:
It establishes the purpose of analysis, defines questions to be answered, and guides data
collection and modeling.

9.Give any two basic statistical descriptions of data.

✅ Answer:

3. Mean (Average): Sum of all values divided by count.

4. Standard Deviation: Measure of data dispersion.

10. Outline the difference between structured and unstructured data

 Handling large volumes of data efficiently

 Managing data variety and inconsistency

 Dealing with data quality issues like missing or inaccurate data

 Limited processing speed and computational power

 Integration of data from multiple sources

12. How the missing value present in a dataset are treated during data analysis
phase?

Handling missing values is an important step in data cleaning and preprocessing to

ensure accuracy in analysis and modeling.

13. Define data science and big data

Data Science:
Data Science is the field that uses scientific methods, algorithms, and tools to extract
meaningful insights and knowledge from structured and unstructured data.

14. what is structured data ?

 Understand the problem and collect relevant data.

 Clean and preprocess the data, analyze it, and build models to extract insights.

17. Give an overview of common error

 Data quality issues such as missing, duplicate, or incorrect data.

 Incorrect analysis due to wrong assumptions, biased data, or improper methods.

15 marks

1.Demonstrate the different types of variables used in data analysis with examples":

1. Quantitative Variables (Numerical)

These represent measurable quantities.

(a) Discrete Variables:

 Definition: Take countable, distinct values (no decimals).

 Examples:
o Number of students in a class (e.g., 30, 31).
o Goals scored in a match (e.g., 0, 1, 2).

(b) Continuous Variables:

 Definition: Can take any value within a range (including decimals).

 Examples:
o Height of students (e.g., 165.5 cm).
o Temperature (e.g., 36.7°C).

2. Qualitative Variables (Categorical)

These describe qualities or characteristics.

(a) Nominal Variables:

 Definition: Categories with no order or ranking.

 Examples:
o Gender (Male/Female).
o Blood Group (A, B, AB, O).

(b) Ordinal Variables:

 Definition: Categories with an inherent order, but differences between ranks are not measurable.
 Examples:
o Education level (Primary, Secondary, College).
o Customer satisfaction (Poor, Good, Excellent).

3. Binary (Dichotomous) Variables

 Definition: Special case of nominal variables with only two categories.

 Examples:
o Yes/No answers.
o Pass/Fail result.

4. Dependent and Independent Variables

 Independent Variable (Predictor):

o Controlled or manipulated during analysis.
o Example: Hours studied.
 Dependent Variable (Response):
o Outcome affected by the independent variable.
o Example: Exam marks scored.

5. Example Table Demonstrating Variables

Variable Type Example Value

Age Quantitative (Continuous) 21.5 years
No. of Siblings Quantitative (Discrete) 2
Gender Qualitative (Nominal) Male
Satisfaction Level Qualitative (Ordinal) Good
Result Binary (Dichotomous) Pass

2. The following data shows the weights (in kg) of 12 students in a class:
50, 52, 48, 55, 60, 50, 58, 53, 57, 54, 52, 56

(a) Calculate the mean, median, and mode of the weights.

(b) Calculate the variance and standard deviation of the weights.

(a) Calculate the Mean:

Mean=∑X/N

Sum of weights = 50 + 52 + 48 + 55 + 60 + 50 + 58 + 53 + 57 + 54 + 52 + 56 = 645

Number of students, N=12
Mean=645/12=53.75 kg

b.Calculate the Median:

Sort the data in ascending order:

48, 50, 50, 52, 52, 53, 54, 55, 56, 57, 58, 60

Since N=12 (even number), median is average of 6th and 7th values:

Median=53+54/2=107/2=53.5 kg

Calculate the Mode:

The values 50 and 52 appear twice, more than any others.

Mode = 50 kg and 52 kg (Bimodal)

(d) Calculate the Variance:

Calculate squared deviations (X−μ)2:

Weight (X) X−μX - \mu (X−μ)2(X - \mu)^2
48 -5.75 33.06
50 -3.75 14.06
50 -3.75 14.06
52 -1.75 3.06
52 -1.75 3.06
53 -0.75 0.56
54 0.25 0.06
55 1.25 1.56
56 2.25 5.06
57 3.25 10.56
58 4.25 18.06
60 6.25 39.06

Sum of squared deviations = 142.2

Variance σ2=∑(X−μ)2/N=142.2/12=11.85 kg2

e) Calculate the Standard Deviation:

σ=11.85=3.44 kg

3. Explain normal curve and z score

Normal Curve (Normal Distribution)

 The normal distribution is a continuous probability distribution that is symmetric and bell-
shaped.
 It is one of the most important distributions in statistics because many natural phenomena
approximate this pattern (e.g., heights, test scores, measurement errors).
 The curve is defined by two parameters:
o Mean (μ): The center of the curve, where the peak occurs.
o Standard deviation (σ): Measures the spread or width of the curve.

Properties of the Normal Curve:

1. Symmetry: The curve is perfectly symmetric about the mean (μ).

2. Mean = Median = Mode: All three measures of central tendency are equal and located at the
center.
3. Total Area = 1: The total area under the curve equals 1, representing the entire probability.
4. Empirical Rule:
o About 68% of data falls within ±1σ from the mean.
o About 95% falls within ±2σ.
o About 99.7% falls within ±3σ.

Z-Score

 A z-score (standard score) tells how many standard deviations a particular data point (x) is from
the mean (μ).
 It standardizes different data points allowing comparison across different scales or distributions.
Formula:
z=x−μσz = \frac{x - \mu}{\sigma}z=σx−μ

Where:

 xxx = observed data value

 μ\muμ = mean of the dataset
 σ\sigmaσ = standard deviation

Interpretation:

 z = 0: The data point is exactly at the mean.

 Positive z-score: Data point is above the mean.
 Negative z-score: Data point is below the mean.
 Example: If z = 2, the data point is 2 standard deviations above the mean.

Why use z-scores?

To identify outliers (usually if ∣z∣>3|z| > 3∣z∣>3).

 To find the relative position of data points in a distribution.

 To calculate probabilities and percentiles using the standard normal distribution table.

4. What is z score? Outline the steps to obtain a z score

A z-score (also called a standard score) is a statistical measurement that describes a data point’s position
relative to the mean of a group of values, expressed in terms of standard deviations. It tells us how many
standard deviations a particular value xxx is above or below the mean μ\muμ.

Importance of Z-Score:

 It standardizes different data points from different scales to a common scale.

 Helps compare scores from different distributions.
 Useful in identifying outliers and calculating probabilities using the standard normal distribution
table.
 Steps to Obtain a Z-Score

Step 1: Calculate the Mean (μ)

Find the average of all data points by adding them up and dividing by the number of data points NNN:

μ=∑X/N

Step 2: Calculate the Standard Deviation (σ)

Calculate the dispersion or spread of data points around the mean. Use the formula:

=∑(X−μ)2/N

Step 3: Select the Data Value (x)

Choose the specific data point for which you want to find the z-score.
Step 4: Calculate the Z-Score

Use the formula:

z=x−μ /

This gives the number of standard deviations the data value xxx is from the mean.

Example
Suppose the mean weight of students is 60 kg with a standard deviation of 5 kg. Find the z-score

OCS353 Data Science Fundamentals LAB QUESTION SET
No ratings yet
OCS353 Data Science Fundamentals LAB QUESTION SET
2 pages
CS3352 - Foundations of Data Science
No ratings yet
CS3352 - Foundations of Data Science
142 pages
MATPLOTLIB Updated
No ratings yet
MATPLOTLIB Updated
95 pages
CS3352 Fds
No ratings yet
CS3352 Fds
23 pages
CCS341 DW QP 28.04.25
No ratings yet
CCS341 DW QP 28.04.25
4 pages
Data Science and Big Data Analytics
0% (1)
Data Science and Big Data Analytics
3 pages
Fdsa Unit 3
No ratings yet
Fdsa Unit 3
42 pages
Data Science Unit-1 Notes
No ratings yet
Data Science Unit-1 Notes
19 pages
SCSA3016 Data Science L T P Credits Total Marks 3 0 0 3 100
No ratings yet
SCSA3016 Data Science L T P Credits Total Marks 3 0 0 3 100
1 page
Lecture 01 05.08.2024 AI-ML Introduction
No ratings yet
Lecture 01 05.08.2024 AI-ML Introduction
46 pages
Foundations of Data Science
No ratings yet
Foundations of Data Science
138 pages
Lecture Notes: Introduction To Data Science and Big Data
No ratings yet
Lecture Notes: Introduction To Data Science and Big Data
5 pages
Question Bank - CSE-DS
No ratings yet
Question Bank - CSE-DS
5 pages
Big Data Analytics Course 2023
No ratings yet
Big Data Analytics Course 2023
6 pages
Data Science Lab
No ratings yet
Data Science Lab
28 pages
Module 6 Data Visualiztion Matplotlib
No ratings yet
Module 6 Data Visualiztion Matplotlib
69 pages
02 DataCategorization
No ratings yet
02 DataCategorization
41 pages
Data Science For Business
No ratings yet
Data Science For Business
18 pages
Matplotlib Line and Scatter Plot Guide
No ratings yet
Matplotlib Line and Scatter Plot Guide
32 pages
cd3291 Dsa Study Material
No ratings yet
cd3291 Dsa Study Material
168 pages
CS3361 Set1
No ratings yet
CS3361 Set1
5 pages
FDS Iat-2 Part-B
No ratings yet
FDS Iat-2 Part-B
4 pages
Introduction To EDA: Exploratory Data Analysis (EDA) in Data Science
No ratings yet
Introduction To EDA: Exploratory Data Analysis (EDA) in Data Science
4 pages
DSF - Unit IV Notes
No ratings yet
DSF - Unit IV Notes
40 pages
Data Science - Unit-4
No ratings yet
Data Science - Unit-4
30 pages
Logistic Regression in R
No ratings yet
Logistic Regression in R
19 pages
STA112 - Lecture - 1 - Content - Probability 1
No ratings yet
STA112 - Lecture - 1 - Content - Probability 1
42 pages
IV AI-DS AD3491 FDSA Unit3
No ratings yet
IV AI-DS AD3491 FDSA Unit3
35 pages
Ocs353 DSF Question Bank 25-26
No ratings yet
Ocs353 DSF Question Bank 25-26
13 pages
Assignment 4 On Visualization On Graph With Solution
No ratings yet
Assignment 4 On Visualization On Graph With Solution
14 pages
Ai and ML qp1 Solved
No ratings yet
Ai and ML qp1 Solved
20 pages
Fdsa Unit 5
No ratings yet
Fdsa Unit 5
48 pages
Lab Record DEV
100% (1)
Lab Record DEV
28 pages
SEO Tools for Engineering Students
No ratings yet
SEO Tools for Engineering Students
14 pages
Data Science Workshop
No ratings yet
Data Science Workshop
6 pages
DAA Important Questions
No ratings yet
DAA Important Questions
7 pages
Excel Data Analysis Resources
No ratings yet
Excel Data Analysis Resources
1 page
Assignment I Data Analytics
No ratings yet
Assignment I Data Analytics
3 pages
FDS Unit 1
No ratings yet
FDS Unit 1
21 pages
DSV Module-3
No ratings yet
DSV Module-3
24 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
Computer Architecture Overview
No ratings yet
Computer Architecture Overview
4 pages
Challenges and Scope of Data Science Project
No ratings yet
Challenges and Scope of Data Science Project
21 pages
Module 1 Introduction To Data Science
No ratings yet
Module 1 Introduction To Data Science
24 pages
cs3352 Foundation of Data Science
No ratings yet
cs3352 Foundation of Data Science
117 pages
DSDBA Sppu Dsbda QP
No ratings yet
DSDBA Sppu Dsbda QP
11 pages
Dat Science Unit 2
No ratings yet
Dat Science Unit 2
27 pages
Data Warehousing & Mining Guide
No ratings yet
Data Warehousing & Mining Guide
142 pages
Unit 3 Ids Notes
No ratings yet
Unit 3 Ids Notes
31 pages
Unit I
No ratings yet
Unit I
85 pages
Exploratory Data Analysis Unit 2
No ratings yet
Exploratory Data Analysis Unit 2
39 pages
Machine Learning Notes Anna University
No ratings yet
Machine Learning Notes Anna University
21 pages
Unit-2 Solution
No ratings yet
Unit-2 Solution
22 pages
Data Discretization Techniques
No ratings yet
Data Discretization Techniques
21 pages
KMBN It01 - Unit 4
No ratings yet
KMBN It01 - Unit 4
19 pages
1 Elements, Variables and Data Categorization
No ratings yet
1 Elements, Variables and Data Categorization
27 pages
Ad3301-Data Exploration and Visualization Important Questions For Ciat-1
No ratings yet
Ad3301-Data Exploration and Visualization Important Questions For Ciat-1
3 pages
Unit-1 Basics of Algorithms and Mathematics
No ratings yet
Unit-1 Basics of Algorithms and Mathematics
47 pages
R22-Ids-Question Bank
No ratings yet
R22-Ids-Question Bank
4 pages
Fds Question Bank
No ratings yet
Fds Question Bank
116 pages
Blue Gold Elegant Certificate of Participation
No ratings yet
Blue Gold Elegant Certificate of Participation
1 page
Vision of The Institution
No ratings yet
Vision of The Institution
1 page
IOT MCA Course Description
No ratings yet
IOT MCA Course Description
13 pages
OOPs 2&13 Q
No ratings yet
OOPs 2&13 Q
26 pages
Data Science Learning Guide
No ratings yet
Data Science Learning Guide
1 page
Aspiring Data Analyst Profile
No ratings yet
Aspiring Data Analyst Profile
3 pages
Foundation of Data Science - Concept Notes
No ratings yet
Foundation of Data Science - Concept Notes
202 pages
Summary: The Study Aims To Explore The Relationship Between The Digital of Rail
No ratings yet
Summary: The Study Aims To Explore The Relationship Between The Digital of Rail
2 pages
JSODS - Call Paper
No ratings yet
JSODS - Call Paper
2 pages
R for Data Science Enthusiasts
No ratings yet
R for Data Science Enthusiasts
85 pages
US KO PPT - Techblume
No ratings yet
US KO PPT - Techblume
12 pages
01 - Data Sciences
No ratings yet
01 - Data Sciences
8 pages
Understanding Intelligence & AI
No ratings yet
Understanding Intelligence & AI
50 pages
CH 1 - 5
100% (1)
CH 1 - 5
170 pages
Celebal Summer t-1
No ratings yet
Celebal Summer t-1
34 pages
Introductions: What Are The 5 Vs of Big Data/ Characteristics of Big Data or Nature of Data
No ratings yet
Introductions: What Are The 5 Vs of Big Data/ Characteristics of Big Data or Nature of Data
75 pages
Iat 1 - Fods
No ratings yet
Iat 1 - Fods
1 page
AI for Sustainable Student Support
No ratings yet
AI for Sustainable Student Support
10 pages
Introduction to Data Science Concepts
100% (1)
Introduction to Data Science Concepts
167 pages
Introduction To Data Science and Probability
No ratings yet
Introduction To Data Science and Probability
5 pages
Foundit Profile - Yash Bhosale
No ratings yet
Foundit Profile - Yash Bhosale
2 pages
Knowledge Discovery in Big Data From Astronomy and Earth Observation: Astrogeoinformatics 1St Edition Petr Skoda (Editor) - Ebook PDF
100% (2)
Knowledge Discovery in Big Data From Astronomy and Earth Observation: Astrogeoinformatics 1St Edition Petr Skoda (Editor) - Ebook PDF
52 pages
Data Science Unlocked
No ratings yet
Data Science Unlocked
35 pages
Python & ML for Oil & Gas Pros
No ratings yet
Python & ML for Oil & Gas Pros
2 pages
Data Science Tools
No ratings yet
Data Science Tools
2 pages
Book 8 - AI
100% (2)
Book 8 - AI
38 pages
Saniya Jaswani: Data Science & ML Expert
No ratings yet
Saniya Jaswani: Data Science & ML Expert
1 page
Krishnan N. Machine Learning For Materials Discovery. Numerical Recipes... 2024
No ratings yet
Krishnan N. Machine Learning For Materials Discovery. Numerical Recipes... 2024
287 pages
VIT MSDS Brochure 06-09-2023
No ratings yet
VIT MSDS Brochure 06-09-2023
20 pages
Data Science Dictionary
No ratings yet
Data Science Dictionary
7 pages
CSDA
No ratings yet
CSDA
7 pages
CSBS - AD3491 - FDSA - IA 1 - Answer Key
100% (11)
CSBS - AD3491 - FDSA - IA 1 - Answer Key
14 pages
End Sem Odd Examination December 2024 For Regular and Back Paper
No ratings yet
End Sem Odd Examination December 2024 For Regular and Back Paper
61 pages
Measuring Results and Impact in The Age of Big Data by York and Bamberger March 2020
No ratings yet
Measuring Results and Impact in The Age of Big Data by York and Bamberger March 2020
88 pages