Data Literacy – Data
Collection to Data Analysis
Unit 5 - XI
About Data
● can be defined as facts or instructions about some entity (students,
school, sports, business, animals etc.)
● AI is essentially data-driven. ie, Data is the core input for training
and running AI models.
● Data must be collected, organized, and analyzed properly
● Data you collected affect model performance and
decision-making.
● uses AI techniques and
data science to improve
the processes of cleaning,
AI Data Analysis inspecting, and modelling.
● used to extract valuable
information for drawing
meaningful conclusions
and decision-making.
● Structured Data – neatly
organized (tables, rows,
columns)
Types of Data ● Semi-structured Data – partial
organization (JSON, XML)
● Unstructured Data – no
defined format (text, audio,
video)
Data literacy means being able
to find and use data effectively.
This includes skills like
Data Literacy collecting data, organizing it,
checking its quality, analysing
it, understanding the results
and using it ethically.
fj
Data Collection
● It means gathering data from many places.
● This includes websites, devices, surveys, and even offline
sources
● Methods include scraping, capturing, and loading data into
systems
● It’s one of the most time-consuming and challenging parts
of any AI project
The Data Collection Process?
1. First, understand the problem you're solving
2. Decide what data is needed to solve it
3. Find good sources of that data
4. Collect and test the data in small steps
5. Repeat this process as your model improves
What data is to be collected?
● Diverse data to avoid bias and inaccuracy in predictions
Example: A face recognition model should work for different ages, skin
tones, and angles
● Simple AI models (tasks like reading number plates) need less data
● Complex AI models (tasks like detecting diseases from X-rays) need huge
amounts of data
● The more complex your model, the more data it will need.
● Depends on how many
features or variables your
model needs
How much Data
● More features = more data
required
is enough? ● start with small amounts and
improve as needed
● But overall, more data gives
better predictions
Sources of Data ● Primary Data Source
Collection ● Secondary Data Source
Primary Sources are
Primary Data sources which are
Source created to collect the
data for analysis.
Primary Data Sources
Secondary data sources are
where the data is already
stored and ready for use.
Secondary Data Data given in Books,
Source Journals, News Papers,
Websites, Internal
transactional databases etc.
Secondary Data Sources
The main goal of data exploration is to:
● Understand the overall structure and
quality of the data.
EXPLORING ● Identify any errors, missing values,
DATA
or inconsistencies.
● Detect outliers or extreme values
Exploring data is about “getting to
know” the data that may affect results.
● Gain insights that can guide further
data analysis or modeling.
Levels of Measurement
Levels of Measurement
Levels of Measurement
- NOMINAL
● Nominal scales are used for
labeling variables, without any
quantitative value.
● “Nominal” scales could simply
be called “labels.”
Levels of Measurement
- ORDINAL
● With ordinal scales, the order of the values is
what’s important and significant, but the
differences between each one is not really known.
● Eg: We can’t say difference between “OK” and
“Unhappy” is the same as the difference between
“Very Happy” and “Happy?”
● Ordinal scales are typically measures of
non-numeric concepts like satisfaction,
happiness, discomfort, etc.
Levels of Measurement
- ORDINAL
● Race positions: 1st, 2nd, 3rd
● School grades: A, B, C, D
● Customer satisfaction: Happy, Okay, Sad
● Education levels: Primary, Secondary, College
● Movie ratings: ⭐ Poor, ⭐⭐ Fair, ⭐⭐⭐ Good
Levels of Measurement
- INTERVAL
● Interval scale data is similar to ordinal data because it has a definite order.
● The key difference is that differences between values can be measured in
interval data.
● No true zero point — zero does not mean “nothing” in interval scale.
Example: Temperature in °C or °F.
● 40° is 20° more than 20° (difference makes sense).
● 0° is not the absolute lowest temperature — negative values exist.
Levels of Measurement
- INTERVAL
● Temperature: 0°C, 10°C, 20°C
● Calendar years: 2000, 2010, 2020
● Time of day: 1:00, 2:00, 3:00 (on a 12-hour clock)
● IQ scores: 90, 100, 110
● Dates on a calendar: Jan 1, Jan 2, Jan 3
Levels of Measurement
- RATIO
● Similar to interval scale, but has a true zero point.
● Ratios can be calculated.
Example: Exam scores – 80 is four times 20.
● Allows all math operations: add, subtract, multiply, divide.
● Real zero means complete absence (e.g., zero weight = no weight).
Example: Weight, height, age.
Levels of Measurement
- RATIO
● Height: 0 cm, 50 cm, 100 cm
● Weight: 0 kg, 10 kg, 20 kg
● Age: 0 years, 5 years, 10 years
● Money: ₹0, ₹100, ₹200
● Distance: 0 km, 5 km, 10 km
Statistical Analysis of Data
● Statistics is the science of data that uses mathematical techniques to
extract meaningful information.
● Statistics involves collecting, organizing, analyzing, interpreting, and
presenting data.
● In AI, statistics turns observations into insights that can be understood
and shared.
● It often works with large datasets, using Central Tendency (mean,
median, mode) to understand and analyze data.
Statistical Analysis of Data
Central Tendency is stated as the summary of a dataset in a
single value that represents the entire distribution of data domain (or
dataset).
Statistical Analysis of Data
Statistical Analysis using Python
What is Mean?
The mean in statistics is calculated by dividing the sum of all
values by the total number of observations in a sample.
What is Mean?
Example -1
The set S = {5,10,15,20,30}
What is Mean?
Program-1
There are 25 students in a class. Their heights are given below.
Write a Python Program to find the mean.
heights → 145, 151, 152, 149, 147, 152, 151,149, 152, 151, 147, 148,
155, 147,152,151, 149,145, 147, 152,146, 148, 150, 152, 151
What is Median?
The median is the middle value of a dataset when the
numbers are arranged in ascending or descending order.
Program-2
There are 25 students in a class. Their heights are given below.
Write a Python Program to find the median.
heights → 145, 151, 152, 149, 147, 152, 151,149, 152, 151, 147, 148,
155, 147,152,151, 149,145, 147, 152,146, 148, 150, 152, 151
What is Mode?
The mode is the value that appears most often in a dataset,
representing the highest bar in a bar chart or histogram.
Sorting might make it easier to spot the most frequent value
in small datasets.
Program-3
Write a program to find the mode
(heights → 145,151, 152, 149, 147, 152, 151,149, 152, 151, 147, 148,
155, 147,152,151, 149, 145, 147, 152,146, 148, 150, 152, 151)
Comparison - Mean, Median, Mode
Measures of Dispersion
Variance and Standard Deviation
Measures of central tendency (mean, median, mode) show
the central value of a dataset, while Measures of Dispersion
(Variance, Standard Deviation) describe how the data is
spread around that center.
Variance and Standard Deviation
Let us understand these two using a diagram:
Measure the height (at the shoulder) of 5 dogs (in millimetres)
Variance and Standard Deviation
Heights: 600 mm, 470 mm, 170 mm, 430 mm, 300 mm
Mean calculation:
Variance and Standard Deviation
Variance and Standard Deviation
Calculate the difference (from mean height), square them, and find the
average. This average is the value of the Variance.
And Standard Deviation is the square root of the variance.
Variance and Standard Deviation
FORMULA - VARIANCE FORMULA - VARIANCE
Important facts about Variance and Standard Deviation
● Small variance → Data points are very close to the mean and to
each other.
● High variance → Data points are widely spread from the mean
and from one another.
● Low standard deviation → Data points are very close to the mean.
● High standard deviation → Data points are spread out over a
large range of values.
Program-4
Write a program to find the variance and standard deviation.
heights → 145,151, 152, 149, 147, 152, 151,149, 152, 151, 147, 148,
155, 147,152,151, 149,145, 147, 152,146, 148, 150, 152, 151
Data Representation
Statistics uses data representation techniques to summarize large
datasets into a compact, meaningful form, allowing important
information to be understood quickly with minimum effort.
Data representation techniques are broadly classified in two ways:
● Non-Graphical technique
● Graphical Technique
Non-Graphical Technique
Eg: Tabular form and Case form
Older methods of data representation, not suitable for large datasets.
Non-graphical techniques are less effective when the goal is to make
decisions based on data analysis.
Graphical Technique
Data visualization is the graphical or pictorial representation of
statistical data using points, lines, charts, and other shapes, making
complex or large datasets easier for the human brain to understand,
as it is in visual format.
Data Visualization can be done in python using the
library Matplotlib.
pyplot is a submodule of Matplotlib that provides a
MATLAB-like interface to the library.
Line Graph
A line graph is a powerful tool used to represent continuous data
along a numbered axis.
It allows us to visualize trends and changes in data points over time.
The line can slope upwards, indicating an increase, or downwards,
signifying a decrease, reflecting the changes in the data over time.
Line chart is plotted in python using the function plot ( ).
Activity -3: Construct a simple line graph to represent the rainfall
data of Kerala as shown in the table below:
Bar Graph
A bar chart or bar graph is a graph that presents categorical data with
rectangular bars with heights or lengths proportional to the values
that they represent.
It is a good way to show comparison between different categories.
Bar chart is plotted in python using the function bar ( ).
Create a bar graph to illustrate the distribution of students from various
schools who attended a seminar on “Deep Learning”. The total number
of students from each school is provided below.
Histogram
Histograms are with vertical rectangles depicting the frequencies of
different value ranges.
They are drawn on a natural scale, making it easy to interpret the
central tendency, such as the mode, of the data.
Histograms can only represent one data distribution per axis.
Histogram is plotted in python using the function hist ( ).
Example -7
Given a dataset containing the heights of girls in class XII, construct a
histogram to visualize the distribution of heights.
141,145,142,147,144,148,141,142,149,144,143,149,146,141, 147, 142, 143
To draw a histogram from this, we first need to organize the data into intervals.
These intervals are also called logical ranges or bins.
Scatterplot
Scatter plots represent relationships between two variables by
plotting data points along both the x and y axes.
They reveal trends, clusters, and relationships within datasets.
A student had a hypothesis for a science project. He believed that the more the
students studied Math, the better their math scores would be. He took a poll in
which he asked students the average number of hours that they studied per
week during a given semester. He then found out the overall percentage that
they received in their Math classes. His data is shown in the table below:
Scatterplot is plotted using the function scatter ( )
Pie Chart
A circular graph divided into slices showing proportions or percentages
of a whole.
Best for visualizing small tables (limit to ~7 categories for clarity).
Pie Chart is plotted using the function pie ( )
A school conducted a survey to find out students’ favorite sports. The
results are shown below:
Write a Python program to create a pie chart showing the distribution
of students’ favorite sports.
MATRICES
● A matrix is a rectangular arrangement of numbers in rows
and columns.
● The numbers are arranged in tabular form as rows and
columns.
● In computer vision (AI), images are represented as
matrices of pixels.
MATRICES
Order of a matrix
● A matrix has m rows and n columns.
● It is called a matrix of order m × n or simply m×n
matrix (m by n matrix)
Operations on Matrices
1. Addition of matrices
Operations on Matrices
2. Difference of matrices
Operations on Matrices
3. Transpose of a matrix
Applications of matrices in AI
• Image Processing
• Recommendation systems use matrices to relate between
users and the purchased or viewed product(s)
• In NLP, vectors (numerical form of words) depict the
distribution of a particular word in a document. Vectors are
one-dimensional matrices.
DATA PREPROCESSING
1. Data Cleaning (Missing Data, Outliers, Inconsistent Data, Duplicate
Data)
2. Data Transformation
3. Data Reduction
4. Data Integration and Normalization
5. Feature Selection
DATA IN MODELLING & EVALUATION
Data Split: Training dataset, Testing dataset
Model Selection: Algorithms chosen based on problem type:
classification, regression, clustering.
Techniques for Evaluation:
● Train-Test Split: Train on training set, evaluate on test set.
● Cross-Validation: Ensures consistent performance across
different subsets.
● Error Analysis: Identifies areas for improvement.
DATA IN MODELLING & EVALUATION
Evaluation Metrics:
● Classification: Accuracy, Precision, Recall, F1-score, ROC curve.
● Regression: MSE, RMSE, MAE, R-squared.
Importance of Data:
● Understanding data helps make informed decisions.
● Data literacy is essential for using AI and technology wisely.
A. Multiple-choice questions
1. Which of the following best defines data literacy?
A) The ability to read and write data
B) The ability to find and use data effectively
C) The ability to analyse data using AI
D) The ability to collect and store data securely
A. Multiple-choice questions
2. What is the purpose of data preprocessing?
A) To make data more complex
B) To make data less accessible
C) To clean and prepare data for analysis
D) To increase the size of the dataset
A. Multiple-choice questions
3. How can missing data be handled in a dataset?
A) By ignoring it
B) By replacing missing values with estimates
C) By deleting rows or columns with missing values
D) By converting missing values to zero
A. Multiple-choice questions
4. Which of the following statements about the quantity of
data needed for machine learning projects is true?
A) More data is always better for good predictions.
B) Small batches of data are sufficient for complex
models.
C) Data quantity depends solely on the number of
features.
D) Data diversity is not essential for model performance.
A. Multiple-choice questions
5. Which of the following is an example of a primary
source of data collection?
A) Web scraping
B) Social media data tracking
C) Surveys
D) Kaggle datasets
A. Multiple-choice questions
6. What method of data collection involves direct
communication with individuals or groups to gather
information?
A) Observations
B) Experiments
C) Interviews
D)Marketing campaigns
A. Multiple-choice questions
7. Which of the following is an example of ratio scale data?
A) Grading students' exam papers as ‘A’, ‘B’, ‘C’, ‘D’, and ‘F’
B) Measuring the temperature in Celsius
C) Rating a meal at a restaurant as ‘unpalatable’,
‘unappetizing’, ‘just okay’, ‘tasty’ and ‘delicious’
D) Recording the weight of a person in kilograms
A. Multiple-choice questions
8. What is the distinguishing feature of ratio scale data?
A) It involves categories without a specific order
B) It has a zero point and allows for ratios to be calculated
C) It involves categories with a strict order but no
measurable differences between categories
D) It has a definite order, but the differences between
categories cannot be measured
A. Multiple-choice questions
9. Which statistical measure is most suitable for data sets
with evenly spread values and no exceptionally high or
low values?
A) Mean
B) Median
C) Mode
D) Variance
A. Multiple-choice questions
10. What is the term used to describe the graphical or
pictorial representation of data?
A) Statistical summary
B) Data organization
C) Data visualization
D) Data interpretation
B. Short answer questions:
1. Explain the concept of data literacy and its importance in today's
digital age.
2. What is data preprocessing?
3.What is data visualization and why is it important?
4. How does a line graph differ from a bar graph?
5. When would you use a scatter plot?
6. What is data?
7. What do you mean by web scraping?
8. If a matrix has 6 elements, what are the possible orders it can
have?
9. Construct a 3x2 matrix where each element is given by aij = i ∗ j
10. Find the transpose of the matrix B =
B. Long answer questions:
1. Discuss the advantages and limitations of using a pie
chart in data visualization. Provide examples to illustrate
your points.
2. Explain the terms mean, median and mode.
3. Explain the four levels of measurement.
4. Given the matrices A and B. Calculate A+ B and B – A.
Python Programs
1. The ages of a group of people in a community are: 25, 28, 30,
35, 40, 45, 50, 55, 60, 65.
Write a program to calculate the mean, median, and mode of the
ages.
2. A company recorded the daily temperatures (in degrees
Celsius) for five consecutive days:
20°C, 22°C, 25°C, 18°C, and 23°C. Determine the variance and
standard deviation of the temperatures.
Python Programs
3. Plot a line chart representing the weekly number of customer
inquiries received by a customer service center:
• Week 1: 150 inquiries
• Week 2: 170 inquiries
• Week 3: 180 inquiries
• Week 4: 200 inquiries
Python Programs
4. Plot a bar chart representing the number of books sold by
different genres in a bookstore:
• Fiction: 120 books
• Mystery: 90 books
• Science Fiction: 80 books
• Romance: 110 books
• Biography: 70 books
Python Programs
5. Visualize the distribution of different types of transportation
used by commuters in a city using a pie chart:
• Car: 40%
• Public Transit: 30%
• Walking: 20%
• Bicycle: 10%