Dealing with Different
Types of Data
Learning Objectives
By the end of this lesson, you will be able to:
List the terminologies used in data analytics
Describe the types of data
Explain the levels of measurement
Terminologies in Data Analytics
Terminologies in Data Analytics
Data Sampling
Observation
Dataset
Prediction
Terminologies in Data Analytics
Observation Data Sampling Data Set Prediction
● Observation is a single row or a record
of data from the database.
● Any data can be assumed as a set of
observations.
Terminologies in Data Analytics
Observation Data Sampling Data Set Prediction
Database Table
Age Height Nationality Gender
Variables
Rows
Observation is the unit of analysis on which the measurements are taken.
It is also known as a case, record, pattern, or row.
Terminologies in Data Analytics
Observation Data Sampling Data Set Prediction
● Data sampling is a statistical analysis
technique used to select, manipulate,
and analyze a representative subset of
data points.
● Data sampling identifies patterns and
trends in the larger data set.
Terminologies in Data Analytics
Observation Data Sampling Data Set Prediction
● If a sample is randomly selected with 1 or n
observations, then n is the sample size.
● The chart explains the sampling process where a few
people are randomly sampled from a group of
population.
● Data sampling is cost effective and surveys only the
representative sample.
● It enables data scientists, predictive modelers, and
data analysts to produce accurate findings.
Terminologies in Data Analytics
Observation Data Sampling Data Set Prediction
● Data set is a collection of data or the total
data captured about a particular use case.
● It can hold information such as medical,
insurance, and loan approval records.
● It is not limited to numbers and texts and may
include collections of images or videos.
Terminologies in Data Analytics
Observation Data Sampling Data Set Prediction
The table represents loan data with attributes such as loan ID, borrower’s gender,
education, employment status, credit history, loan amount, and property details.
Terminologies in Data Analytics
Observation Data Sampling Data Set Prediction
● The goal of prediction is to move from
what has happened to providing the best
assessment of what will happen.
● In the graph, linear prediction technique is
used to predict the number of children within
different education levels.
Types of Data
Types of Data
Structured Data Unstructured Data Semi-Structured Data
It is the data that is processed, It is the type of data that lacks It is the data type containing
stored, and retrieved in a fixed any specific form or structure. both structured and
format. unstructured data.
Example: Email
Example: Employee details, Example: CSV and
job positions, and salaries. JSON documents
Analyzing Unstructured Data
Unstructured information is
About 80% of business data is text-heavy and contains data
unstructured. such as dates, numbers, and
facts.
Internally generated information is Unstructured data is primarily
considered unstructured as the used for BI and analytics but
intelligence doesn’t fit neatly into a not for transaction processing
database. applications.
Analyzing Unstructured Data
Retailers and manufacturers analyze unstructured data to:
● Improve customer relationship management processes
● Enable targeted marketing
● Perform sentiment analysis on product reviews
The line between unstructured and semi-structured data is not clearly defined.
Unstructured data has some level of structure in it.
Qualitative and Quantitative Data
Qualitative and Quantitative Data
Qualitative Data
Data in which classification of objects is
based on attributes and properties.
Example: Softness of skin etc.
Quantitative Data
Data can be measured and expressed
numerically.
Example: Your height and shoe size.
Qualitative and Quantitative Data
Qualitative Data Quantitative Data
● Data collection is unstructured. ● Data collection is structured.
● It asks why. ● It is all about how much or how many.
● It cannot be computed as it is non- ● It is statistical and is about numbers.
statistical.
● It recommends the final course of
● It develops initial understanding and action.
defines the problem.
Subgroups of Qualitative Data
Qualitative
Nominal data Ordinal data
Data
Unordered data to which an order is Ordered data that is assigned to
assigned in relation to other named categories in a ranked fashion
categories
Example: Grade classification like pass or Example: Feedback to a product with 1–5
fail for student's test results. ranking.
Subgroups of Quantitative Data
Discrete data Quantitative Continuous data
Data
It can only take certain values. It can take any value within a
specified range.
Example: The number of students
in a class Example: Share price of a company
Data Levels of Measurement
Data Levels of Measurement
It is a classification that describes the nature of information within the values assigned to variables.
Ratio
Interval
Ordinal
Nominal
Data Levels of Measurement
Nominal Ordinal Interval Ratio
● In nominal level of measurement, numbers in the variable
are used to classify data.
● At this level, words, letters, and alphanumeric symbols can
be used.
M F
● Example: People in female gender category are classified
as F and those in male gender are category classified as M.
Data Levels of Measurement
Nominal Ordinal Interval Ratio
● Ordinal level of measurement depicts ordered
relationship among the variable’s observations.
● It indicates an order of the measurements.
● Example: A student with 100% score is assigned the
first rank, another student with 95% score would be
assigned the second rank, and so on.
Data Levels of Measurement
Nominal Ordinal Interval Ratio
● The interval level of measurement classifies Temperature in centigrade
and orders the measurements.
● It also specifies that the distances between
each interval on the scale are equivalent.
● Example: Temperature in centigrade where the
distance between 80 degrees and 100 degrees is
same as the distance between 1000 degrees and
80°C - 100°C = 1000°C - 1020°C
1020 degrees.
Data Levels of Measurement
Nominal Ordinal Interval Ratio
● In the ratio level of measurement, observations can have a value of zero.
● Although properties of ratio measurement are similar to the interval level of measurement, the zero in scale
makes it different from the other levels of measurement.
Note: The nominal level classifies data, while the ordinal level indicates an order of measurements. The
interval level and the ratio level of measurements provide the same level of measurement.
Normal Distribution of Data
Normal Distribution of Data
● Normal distribution is also known as ● It is the most important probability
Gaussian distribution or Bell curve. distribution in statistics.
● It is a perfectly symmetric bell-
● Most of the natural phenomena and
shaped distribution curve with only
occurrences follow Bell curve.
one peak.
● It is denser at the center and has
● It is continuous and have tails that
equal mean, median, and mode
are asymptotic.
values.
Statistical Parameters
Basic Statistical Parameters
Mean Variance Standard Deviation
● Mean is the average of all data ● Variance is the sum of the squares ● Standard deviation is the square
points for a given set of data. of differences between all numbers root of variance and shows the
and means divided by the number extent to which data varies from
● It is used to derive the central of data points. the mean.
tendency of the data.
● It gives a measure of how the data ● It shows how tightly data points
● It is measured by adding all distributes itself about the mean. are clustered around the mean.
data points and dividing the
sum by the number of data ● It looks at all the data points and ● It is more concrete and gives the
points. then determines their distribution. exact distances from the mean.
Basic Statistical Parameters: Example
Dataset x = {1;2;3;4;5;6}
Mean = (1+2+3+4+5+6)/6 = 3.5
Variance = [(1-3.5)2+(2-3.5) 2+(3-3.5) 2+(4-3.5) 2+(5-3.5) 2+(6-3.5) 2]/6 = 2.917
Standard deviation = √2.917 = 1.708
Key Takeaways
• Structured data, unstructured data, and semi-structured data are the three types of data.
• Nominal, ordinal, interval, and ratio are four data levels of measurement.
• Normal distribution of data is the most important probability distribution in statistics.
• Mean, variance and standard deviation are the basic statistical parameters.