KEMBAR78
CHAPTER 4 Data Management | PDF | Mode (Statistics) | Variance
0% found this document useful (0 votes)
49 views16 pages

CHAPTER 4 Data Management

REVIEWER

Uploaded by

brentogale59
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views16 pages

CHAPTER 4 Data Management

REVIEWER

Uploaded by

brentogale59
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

CHAPTER 4

DATA MANAGEMENT

Data management is the practice of collecting, organizing, storing, and analyzing data
to derive meaningful insights. In today's digital age, data has become an invaluable asset,
driving innovation and decision-making across various industries. Statistics, a branch of
mathematics, plays a crucial role in data management by providing the tools and techniques
to extract valuable information from raw data.

Learning Outcomes

At the end of this chapter, you are expected to:

1. Utilize various data management tools to process and manage quantitative data;
2. Identify the types of data and its level of measurement
3. Calculate the measure of central tendency and measures of dispersion of a given set
of data.
4. Interpret data based on the result of computation.
5. Appreciate the importance and application of measures of central tendency, and
measures of dispersions in real life situation

Components of Data Management

A. Data Collection

a. Primary Data - Collected first hand through surveys, experiments, or


observations.

b. Secondary Data - Obtained from existing sources like government reports,


academic papers, or public databases.

B. Data Cleaning and Preparation

a. Data Cleaning - Identifying and correcting errors, inconsistencies, and


missing values.

b. Data Preparation - Transforming raw data into a suitable format for analysis,
which may involve:

• Data normalization - Scaling data to a common range.


• Data imputation - Filling in missing values.
• Feature engineering - Creating new features from existing ones.

C. Data Storage and Organization

a. Data Warehouses - Centralized repositories for storing large volumes of data.

b. Data Lakes - Unstructured data storage systems.

c. Data Marts - Smaller, focused data warehouses.

1|Page
D. Data Analysis and Interpretation

a. Descriptive Statistics - Summarizing data using measures like mean, median,


mode, standard deviation, and variance.

b. Inferential Statistics - Drawing conclusions about a population based on a


sample.

c. Hypothesis Testing - Making inferences about population parameters.

d. Regression Analysis - Modelling relationships between variables.

e. Machine Learning - Using algorithms to learn patterns from data.

What is Data?

Data refers to raw and unprocessed facts, figures, or values collected or observed from the
real world. It can take various forms, including numbers, text, images, or any other
representations. Data by itself lacks context and meaning until it is processed, organized, and
interpreted.

Examples:
1. Numeric Data - Numbers like age, height, weight, temperature, sales figures, etc.
2. Textual Data - Words, sentences, paragraphs, like articles, emails, social media posts,
etc
3. Categorical Data - Red, blue, green; Yes or No; Categories like "High," "Medium," "Low"
4. Video data - Movies, TV shows, video clips, etc.
5. Image Data - Pixel values in a digital image
6. Audio Data - Waveform values in digital audio

Types of Data
Data, the raw material of the information age, comes in various forms and can be categorized
based on different criteria.

1. Quantitative Data - These variables represent measurable quantities and can be either
discrete or continuous.
 Discrete Data - Take on distinct, separate values with no intermediate values. Often
whole numbers. Examples include the number of siblings or the number of cars in a
parking lot.

 Continuous data - Can take on any value within a given range and have infinite
possible values. Examples include height, weight, and temperature.
2. Qualitative Variables - These variables represent categories or groups and can be
either nominal or ordinal.
 Nominal Data - Categories with no inherent order or ranking. Examples include
gender, ethnicity, or types of fruits.
 Ordinal Data - Categories with a meaningful order or ranking but with inconsistent
intervals between them. Examples include education levels (e.g., high school,
college, graduate), satisfaction rating (low, medium, high), Likert scale responses.

2|Page
Level of Data Measurement
Understanding the level of measurement of your data is crucial for selecting
appropriate statistical techniques and interpreting your findings accurately. There are four
primary levels of measurement, each with distinct characteristics and limitations.

1. Nominal Data - Nominal data represent categories or groups with no inherent order
or ranking.
Examples:
• Gender (categories: male, female)
• Eye color (categories: brown, blue, green)
• Marital Status (single, married, divorce, widow)
• Hair type (straight, wavy, curly, kinky)
• Car brands (Toyota, Ford, Honda, Chevrolet)
• Political affiliation (democrat, republic, independent)

2. Ordinal Data - Ordinal data have ordered categories, but the intervals between
them are not consistent or meaningful.

Examples:
• Educational levels (categories: high school, college, graduate)
• Customer satisfaction ratings (categories: dissatisfied, neutral,
satisfied)
• Socio-economic status (lower class, working class, middle class,
upper-middle class, upper class)
• Likert scale for agreement (strongly disagree, disagree, neutral, agree,
strongly agree)
• Performance rating (below expectations, meeting expectation,
exceeding expectation)

3|Page
3. Interval Data - Interval data have ordered categories with consistent and meaningful
intervals between them, but they lack a true zero point.

Examples:
• Temperature in Celsius or Fahrenheit
• IQ scores
• pH level
• Longitude or latitude
• Standardized Test Scores

4. Ratio Data - Ratio data have all the properties of interval variables, but they also
have a true zero point, indicating the absence of the attribute.
Examples:
• Height in centimeters or inches

(Measuring the plant height)

• Income
• Weight
• Distance travelled
• Time (in seconds, minutes, hours)

4|Page
Measure of Central Tendency
A measure of central tendency is a summary statistic that represents the center point
or typical value of a dataset. It also referred to as the central location of a distribution. There
are three measures of central tendency - mean, median, and mode. Choosing the best
measure of central tendency depends on the type of data.

A. Mode

Mode is a statistical measure that represents the most frequently occurring value in a dataset.
It is the value with the greatest frequency. Mode is appropriate to use when the variable
measured is in the nominal scale.

Example 1.

Let's say we surveyed a group of people about their favorite color. Here are the results:
• Blue: 15 people
• Red: 10 people
• Green: 8 people
• Yellow: 7 people
In this case, blue is the mode because it is the most frequently chosen color.

Example 2

A teacher records the following scores for a class of 10 students on a recent test:

75, 82, 85, 85, 85, 90, 92, 95, 95, 100

Solution

To find the mode, we identify the score that appears most frequently. In this case, the score
85 appears three times, which is more frequent than any other score. Therefore, the mode of
the test scores is 85.

Real-world examples:

Fashion. A clothing store owner might notice that a particular style of jeans is selling more
than any other. The mode would be the most popular style.

Weather. A meteorologist might observe that the most common daily high temperature in a
particular city during a specific month is 25 degrees Celsius. This would be the modal
temperature.

Product Sales. A supermarket manager might identify the best-selling brand of cereal by
determining the brand that appears most frequently in sales records.

Quality Control. A manufacturer might inspect a batch of products and find that a certain
defect occurs most often. This would be the modal defect.

5|Page
Characteristics:

• Multiple Modes: A dataset can have more than one mode. This is known as bimodal
or multimodal. For instance, if "red" and "blue" are equally popular colors in the
survey example, the dataset would be bimodal.
• No Mode. A dataset might not have a mode if all values occur with the same
frequency.
• Mode for Categorical Data. Mode is often used for categorical data, as it helps
identify the most common category.
• Mode for Numerical Data. While it can be used for numerical data, it's less common
than the mean or median, especially for large datasets.

When to Use Mode

• Identifying the Most Common Value - When you want to know the most frequent
occurrence.
• Categorical Data Analysis - When dealing with categorical data, mode is a useful
measure of central tendency.
• Non-Normal Distributions - In cases where the data is not normally distributed, the
mode can provide insights that might be missed by the mean or median.

B. Median

The median is the middle entry or term in a set of data arranged in either increasing or
decreasing order. The median is a positional measure. Thus, the values of the individual
measures in a set of data do not affect it. It is affected by the number of measures and not by
the size of the extreme values. This measure is appropriate to use when the distribution is at
least ordinal scale since ranking of the data is involved.

To find the median of a given set of data, take note of the following:
1. Arrange the data in either increasing or decreasing order.
2. Locate the middle value. If the number of cases is odd, the middle values is the
median. If the number of cases is even, take the arithmetic mean of the two
middle measures.

Example 1

The number of books borrowed in the library from Monday to Friday last week were 58, 60,
54, 35, and 97 respectively. Find the median.

Solution: Arrange the number of books borrowed in increasing order.

35, 54, 58, 60, 97

The median is 58.

Example 2

Cora’s quizzes for the second quarter are 8, 7,6, 10, 9, 5, 9, 6, 10, and 7. Find the median.

Solution: Arrange the scores in increasing order.

5, 6, 6, 7, 7, 8, 9, 9, 10, 10

6|Page
Since the number of measures is even, then the median is the average of the two
middle scores.

Characteristics of Median

• Less Affected by Outliers. Unlike the mean, the median is not significantly affected by
extremely large or small values.
• Quick and Easy to Calculate. It's relatively simple to find the median, especially for
smaller datasets.
• Represents the Middle Value. It provides a good measure of central tendency,
indicating the value that separates the lower half of the data from the upper half.

Real-world examples

Real State. When analyzing housing market trends, real estate agents often use the median
home price. This is because it's less affected by outliers like extremely expensive or
inexpensive homes.

Income. Economists and policymakers often use the median income to gauge the overall
economic health of a population. This is because it provides a better picture of typical income
levels, as it's less influenced by very high or very low incomes.

Demographics. Demographers use the median age to understand the age distribution of a
population. This can help in planning for future needs like healthcare, education, and social
services.

C. MEAN

The mean (also known as the arithmetic mean) is the most commonly used measure
of central position. It is the sum of measures divided by the number of measures in a variable.
It is symbolized as 𝒙𝒙� (read as x bar). Mean is appropriate to use when the distribution is at
least interval scale.

To find the mean of ungrouped data, use the formula;

7|Page
Example 1

The grades in Chemistry of 10 students are 87, 84, 85, 85, 86, 90, 79, 82, 78, 76. What is
the average grade of the 10 students?

Solution:

Example 2. Find the Average Salary of a Company

Suppose a company in the Philippines has 10 employees with the following annual salaries
in Philippine Pesos (PHP):

Php 150,000
Php 175,000
Php 200,000
Php 225,000
Php 250,000
Php 250,000
Php 275,000
Php 300,000
Php 350,000
Php 1,000,000

Solution: To find the average salary

1. Add up all the salaries: Php 150,000 + Php 175,000 + Php 200,000 + Php 225,000 +
Php 250,000 + Php 250,000 + Php 275,000 + Php 300,000 + Php 350,000 + Php
1,000,000 = Php 3,150,000
2. Divide the total salary by the number of employees: PHP 3,150,000/10 = PHP 315,000

Therefore, the average salary of the company is PHP 315,000.

Characteristics of Mean:

• Sensitivity to Outliers. The mean is sensitive to outliers. This means that extreme
values can significantly influence the mean, potentially skewing it. For example, if a few
very high salaries are included in a dataset of employee salaries, the mean salary will be
higher than the typical salary.

• Uses All Data Points. The mean takes into account every data point in the dataset. This
makes it a comprehensive measure of central tendency.

• Unique Value. A dataset can only have one mean.

Real-world examples

Academic Performance. Teachers often calculate the average score on a test to assess
class performance and identify areas where students may need additional support.
Additionally, a student's Grade Point Average (GPA) is calculated by averaging their grades
in different courses.

8|Page
Finance. Investors track the average price of a stock over a specific period to assess its
performance. Moreover, Investors calculate the average return on their investments (ROI) to
evaluate their portfolio's performance.

Business. Businesses use average sales figures to track performance and set sales targets.

Weather. Meteorologists use the average temperature to predict weather patterns and climate
trends.

Weighted Mean

A weighted mean is a type of average that assigns different weights to different data
points. This is useful when some data points are more important or reliable than others. The
formula for weighted mean is:

Example 1.

Below are Maria’s subjects and the corresponding number of units and grades she got for
the first grading period. Compute her grade point average.

9|Page
Therefore, Maria has the GPA of 81.86 for the first grading period.

10 | P a g e
Measures of Dispersion
The measures that describe the degree of spread of the data are called “measure of
dispersion” or “measure of variability” or “measure of spread”. This measure is used to
determine how scattered the values are in the distribution. In this topic, we will consider four
measures of dispersion, namely: range, average deviation, variance, and standard deviation.

Range for Ungrouped Data

The range is the simplest measure of variability. It is the difference between the largest ad
smallest measurement. To determine the range of ungrouped data, the formula is;

Example 1

Consider the four data sets presented below. Find the range of each data set.

Comparing the data sets, Data Set 1 has the least variation because it has the smallest
value of R. On the other hand, Data Set 3 has the most variation because it has the largest
value of R.

Average Deviation for Ungrouped Data

The Average Deviation (AD) is a measure of absolute dispersion that is affected by


every individual score. It is the mean of the absolute deviations of the individual scores from
the mean of all the scores.

A large average deviation would mean that a set of scores is widely dispersed about
the mean, while a small average deviation would imply that the set of scores is closer to the
mean.

11 | P a g e
The formula of average deviation for ungrouped data is:

To be able to apply the formula, the following steps can be observed:


1. Compute the mean from the given scores.
2. Subtract the mean from the individual scores to get the deviation. That is, 𝑥𝑥 − 𝑥𝑥𝑥
3. Get the absolute value of each deviation.
4. Get the sum of the absolute deviation and divide it by (n-1), where n is the total
number of scores. The quotient is the average deviation.

Example 1

The raw scores of eight students in Statistics are given as follows: 17, 17, 26, 28, 30,
30, 31, and 37. Compute the average deviation.

12 | P a g e
Example 2.

The scores of nine students in Psychology are given as follows: 15, 19, 20, 24, 28,
30, 32, 32, and 40. Calculate the average deviation.

The computed average deviation (A.D.) of scores in Statistics is 6 while test scores in
Psychology is 7.17. This can be interpreted as the scores in Statistics are less dispersed or
closely distributed near the mean (homogeneous) while the scores in Psychology are more
dispersed away from the mean (heterogeneous).

13 | P a g e
Variance for Ungrouped Data

Another way to avoid a sum of zero for the deviation scores is to square each deviation
score and get the average of all squared deviation scores. The resulting measure is called
“variance” which has a squared unit. In symbol, 𝑠𝑠2.

To compute the variance of ungrouped data, the following formula may be used

To be able to apply the formula, the following steps ca be observed:


1. Arrange the values in column from lowest to highest.
2. Compute the mean of the distribution.
3. Determine the deviation (𝑑𝑑 = 𝑥𝑥 − 𝑥𝑥̅).
4. Square the deviations.
5. Get the sum of the squared deviations.
6. Divide the sum by the total number of cases. The quotient is the variance.

Example 1. Consider the data set below. Compute the variance of each data set.

14 | P a g e
Standard Deviation for Ungrouped Data

Recall that, in the computation of the variance, the deviation was squared. This implies
that the variance is expressed in squared units. Extracting the square root of the value of the
variance will give the value of the standard deviation. In symbol, 𝑠𝑠.

To take the standard deviation of ungrouped data, extract the square root of the
variance. In mathematical formula,

Example 1. Consider the data set below. Compute the standard deviation of each data set.

15 | P a g e
On the basis of the obtained standard deviation, we say that the scores in Data Set 1
deviate from the mean by 2.06 units, on the average. For Data Set 2, the scores deviate from
the mean by an average of 2.56 units.

16 | P a g e

You might also like