KEMBAR78
cs3352 Foundations of Data Science Unit II | PDF | Level Of Measurement | Standard Deviation
0% found this document useful (0 votes)
13 views34 pages

cs3352 Foundations of Data Science Unit II

The document covers the foundations of data science, focusing on the types of data, including qualitative and quantitative data, and their respective advantages and disadvantages. It also explains various scales of measurement, variables, and the differences between discrete and continuous variables, as well as the concepts of independent and dependent variables. Additionally, it discusses observational studies, confounding variables, and the use of frequency distributions for describing data.

Uploaded by

P SANTHIYA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views34 pages

cs3352 Foundations of Data Science Unit II

The document covers the foundations of data science, focusing on the types of data, including qualitative and quantitative data, and their respective advantages and disadvantages. It also explains various scales of measurement, variables, and the differences between discrete and continuous variables, as well as the concepts of independent and dependent variables. Additionally, it discusses observational studies, confounding variables, and the use of frequency distributions for describing data.

Uploaded by

P SANTHIYA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

lOMoARcPSD|15656136

CS3352 Foundations of Data Science UNIT II

Information Technology (Anna University)

Studocu is not sponsored or endorsed by any college or university


Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)
lOMoARcPSD|15656136

UNIT II : Describing Data

Syllabus

Types of Data - Types of Variables - Describing Data with Tables and Graphs -
Describing Data with Averages - Describing Variability - Normal Distributions
and Standard (z) Scores.

Types of Data

• Data is collection of facts and figures which relay something specific, but which
are not organized in any way. It can be numbers, words, measurements,
observations or even just descriptions of things. We can say, data is raw material in
the production of information.

• Data set is collection of related records or information. The information may be


on some entity or some subject area.

• Collection of data objects and their attributes. Attributes captures the basic
characteristics of an object

• Each row of a data set is called a record. Each data set also has multiple
attributes, each of which gives information on a specific characteristic.

Qualitative and Quantitative Data

• Data can broadly be divided into following two types: Qualitative data and
quantitative data.

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

Qualitative data:

• Qualitative data provides information about the quality of an object or


information which cannot be measured. Qualitative data cannot be expressed as a
number. Data that represent nominal scales such as gender, economic status,
religious preference are usually considered to be qualitative data.

• Qualitative data is data concerned with descriptions, which can be observed but
cannot be computed. Qualitative data is also called categorical data. Qualitative
data can be further subdivided into two types as follows:

1. Nominal data

2. Ordinal data

Qualitative data:

• Qualitative data is the one that focuses on numbers and mathematical calculations
and can be calculated and computed.

• Qualitative data are anything that can be expressed as a number or quantified.


Examples of quantitative data are scores on achievement tests, number of hours of
study or weight of a subject. These data may be represented by ordinal, interval or
ratio scales and lend themselves to most statistical manipulation.

• There are two types of qualitative data: Interval data and ratio data.

Difference between Qualitative and Quantitative Data

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

Advantages and Disadvantages of Qualitative Data

1. Advantages:

• It helps in-depth analysis

• Qualitative data helps the market researchers to understand the mindset of their

customers.

• Avoid pre-judgments

2. Disadvantages:

• Time consuming

• Not easy to generalize

• Difficult to make systematic comparisons

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

Advantages and Disadvantages of Quantitative Data

1. Advantages:

• Easier to summarize and make comparisons.

• It is often easier to obtain large sample sizes

• It is less time consuming since it is based on statistical analysis.

2. Disadvantages:

• The cost is relatively high.

• There is no accurate generalization of data the researcher received

Ranked Data

• Ranked data is a variable in which the value of the data is captured from an
ordered set, which is recorded in the order of magnitude. Ranked data is also called
as Ordinal data.

• Ordinal represents the "order." Ordinal data is known as qualitative data or


categorical data. It can be grouped, named and also ranked.

• Characteristics of the Ranked data:

a) The ordinal data shows the relative ranking of the variables

b) It identifies and describes the magnitude of a variable

c) Along with the information provided by the nominal scale, ordinal scales give
the rankings of those variables

d) The interval properties are not known

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

e) The surveyors can quickly analyze the degree of agreement concerning the
identified order of variables

• Examples:

a) University ranking : 1st, 9th, 87th...

b) Socioeconomic status: poor, middle class, rich.

c) Level of agreement: yes, maybe, no.

d) Time of day: dawn, morning, noon, afternoon, evening, night

Scale of Measurement

• Scales of measurement, also called levels of measurement. Each level of


measurement scale has specific properties that determine the various use of
statistical analysis.

• There are four different scales of measurement. The data can be defined as being
one of the four scales. The four types of scales are: Nominal, ordinal, interval and
ratio.

Nominal

• A nominal data is the 1 level of measurement scale in which the numbers serve as
"tags" or "labels" to classify or identify the objects.

• A nominal data usually deals with the non-numeric variables or the numbers that
do not have any value. While developing statistical models, nominal data are
usually transformed before building the model.

• It is also known as categorical variables.

Characteristics of nominal data:

1. A nominal data variable is classified into two or more categories. In this


measurement mechanism, the answer should fall into either of the classes.

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

2. It is qualitative. The numbers are used here to identify the objects.

3. The numbers don't define the object characteristics. The only permissible aspect
of numbers in the nominal scale is "counting".

• Example:

1. Gender: Male, female, other.

2. Hair Color: Brown, black, blonde, red, other.

Interval

• Interval data corresponds to a variable in which the value is chosen from an


interval set.

• It is defined as a quantitative measurement scale in which the difference between


the two variables is meaningful. In other words, the variables are measured in an
exact manner, not as in a relative way in which the presence of zero is arbitrary.

• Characteristics of interval data:

a) The interval data is quantitative as it can quantify the difference between the
values.

b) It allows calculating the mean and median of the variables.

c) To understand the difference between the variables, you can subtract the values
between the variables.

d) The interval scale is the preferred scale in statistics as it helps to assign any
numerical values to arbitrary assessment such as feelings, calender types, etc.

• Examples:

1. Celsius temperature

2. Fahrenheit temperature

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

3. Time on a clock with hands.

Ratio

• Any variable for which the ratios can be computed and are meaningful is called
ratio data.

• It is a type of variable measurement scale. It allows researchers to compare the


differences or intervals. The ratio scale has a unique feature. It processes the
character of the origin or zero points.

• Characteristics of ratio data:

a) Ratio scale has a feature of absolute zero.

b) It doesn't have negative numbers, because of its zero-point feature.

c) It affords unique opportunities for statistical analysis. The variables can be


orderly added, subtracted, multiplied, divided. Mean, median and mode can be
calculated using the ratio scale.

d) Ratio data has unique and useful properties. One such feature is that it allows
unit conversions like kilogram - calories, gram - calories, etc.

• Examples: Age, weight, height, ruler measurements, number of children.

Example 2.1.1: Indicate whether each of the following terms is qualitative; ranked
or quantitative:

(a) ethnic group

(b) academic major

(c) age

(d) family size

(e) net worth (in Rupess)

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

(f) temperature

(g) sexual preference

(h) second-place finish

(i) IQ score

(j) gender

Solution :

(a) ethnic group→ Qualitative

(b) age → Quantitative

(c) family size → Quantitative

(d) academic major → Qualitative

(e) sexual preference → Qualitative

(f) IQ score → Quantitative

(g) net worth (in Rupess) → Quantitative

(h) second-place finish → ranked

(i) gender → Qualitative

(j) temperature → Quantitative

Types of Variables

• Variable is a characteristic or property that can take on different values.

Discrete and Continuous Variables

Discrete variables:

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

• Quantitative variables can be further distinguished in terms of whether they are


discrete or continuous.

• The word discrete means countable. For example, the number of students in a
class is countable or discrete. The value could be 2, 24, 34 or 135 students, but it
cannot be 23/32 or 12.23 students.

• Number of page in the book is a discrete variable. Discrete data can only take on
certain individual values.

Continuous variables:

• Continuous variables are a variable which can take all values within a given
interval or range. A continuous variable consists of numbers whose values, at least
in theory, have no restrictions.

• Example of continuous variables is Blood pressure, weight, high and income.

• Continuous data can take on any value in a certain range. Length of a file is a
continuous variable.

Difference between Discrete variables and Continuous variables

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

Approximate Numbers

• Approximate number is defined as a number approximated to the exact number


and there is always a difference between the exact and approximate numbers.

• For example, 2, 4, 9 are exact numbers as they do not need any approximation.

• But √2, л, √3 are approximate numbers as they cannot be expressed exactly by a


finite digits. They can be written as 1.414, 3.1416, 1.7320 etc which are only
approximations to the true values.

• Whenever values are rounded off, as is always the case with actual values for
continuous variables, the resulting numbers are approximate, never exact.

• An approximate number is one that does have uncertainty. A number can be


approximate for one of two reasons:

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

a) The number can be the result of a measurement.

b) Certain numbers simply cannot be written exactly in decimal form. Many


fractions and all irrational numbers fall into this category

Independent and Dependent Variables

• The two main variables in an experiment are the independent and dependent
variable. An experiment is a study in which the investigator decides who receives
the special treatment.

1. Independent variables

• An independent variable is the variable that is changed or controlled in a


scientific experiment to test the effects on the dependent variable.

• An independent variable is a variable that represents a quantity that is being


manipulated in an experiment.

• The independent variable is the one that the researcher intentionally changes or
controls.

• In an experiment, an independent variable is the treatment manipulated by the


investigator. Mostly in mathematical equations, independent variables are denoted
by 'x'.

• Independent variables are also termed as "explanatory variables," "manipulated


variables," or "controlled variables." In a graph, the independent variable is usually
plotted on the X-axis.

2. Dependent variables

• A dependent variable is the variable being tested and measured in a scientific


experiment.

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

• The dependent variable is 'dependent' on the independent variable. As the


experimenter changes the independent variable, the effect on the dependent
variable is observed and recorded.

• The dependent variable is the factor that the research measures. It changes in
response to the independent variable or depends upon it.

• A dependent variable represents a quantity whose value depends on how the


independent variable is manipulated.

• Mostly in mathematical equations, dependent variables are denoted by 'y'.

• Dependent variables are also termed as "measured variable," the "responding


variable," or the "explained variable". In a graph, dependent variables are usually
plotted on the Y-axis.

• When a variable is believed to have been influenced by the independent variable,


it is called a dependent variable. In an experimental setting, the dependent variable
is measured, counted or recorded by the investigator.

• Example: Suppose we want to know whether or not eating breakfast affects


student test scores. The factor under the experimenter's control is the presence or
absence of breakfast, so we know it is the independent variable. The experiment
measures test scores of students who ate breakfast versus those who did not.
Theoretically, the test results depend on breakfast, so the test results are the
dependent variable. Note that test scores are the dependent variable, even if it turns
out there is no relationship between scores and breakfast.

Observational Study

• An observational study focuses on detecting relationships between variables not


manipulated by the investigator. An observational study is used to answer a
research question based purely on what the researcher observes. There is no
interference or manipulation of the research subjects and no control and treatment
groups.

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

• These studies are often qualitative in nature and can be used for both exploratory
and explanatory research purposes. While quantitative observational studies exist,
they are less common.

• Observational studies are generally used in hard science, medical and social
science fields. This is often due to ethical or practical concerns that prevent the
researcher from conducting a traditional experiment. However, the lack of control
and treatment groups means that forming inferences is difficult and there is a risk
of confounding variables impacting user analysis.

Confounding Variable

• Confounding variables are those that affect other variables in a way that produces
spurious or distorted associations between two variables. They confound the "true"
relationship between two variables. Confounding refers to differences in outcomes
that occur because of differences in the baseline risks of the comparison groups.

• For example, if we have an association between two variables (X and Y) and that
association is due entirely to the fact that both X and Y are affected by a third
variable (Z), then we would say that the association between X and Y is spurious
and that it is a result of the effect of a confounding variable (Z).

• A difference between groups might be due not to the independent variable but to
a confounding variable.

• For a variable to be confounding:

a) It must have connected with independent variables of interest and

b) It must be connected to the outcome or dependent variable directly.

• Consider the example, in order to conduct research that has the objective that
alcohol drinkers can have more heart disease than non-alcohol drinkers such that
they can be influenced by another factor. For instance, alcohol drinkers might
consume cigarettes more than non drinkers that act as a confounding variable
(consuming cigarettes in this case) to study an association amidst drinking alcohol
and heart disease.

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

• For example, suppose a researcher collects data on ice cream sales and shark
attacks and finds that the two variables are highly correlated. Does this mean that
increased ice cream sales cause more shark attacks? That's unlikely. The more
likely cause is the confounding variable temperature. When it is warmer outside,
more people buy ice cream and more people go in the ocean.

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

Describing Data with Tables

Frequency Distributions for Quantitative Data


• Frequency distribution is a representation, either in a graphical or tabular format,
that displays the number of observations within a given interval. The interval size
depends on the data being analyzed and the goals of the analyst.

• In order to find the frequency distribution of quantitative data, we can use the
following table that gives information about "the number of smartphones owned
per family."

• For such quantitative data, it is quite straightforward to make a frequency


distribution table. People either own 1, 2, 3, 4 or 5 laptops. Then, all we need to do
is to find the frequency of 1, 2, 3, 4 and 5. Arrange this information in table format
and called as frequency table for quantitative data.

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

• When observations are sorted into classes of single values, the result is referred to
as a frequency distribution for ungrouped data. It is the representation of
ungrouped data and is typically used when we have a smaller data set.

• A frequency distribution is a means to organize a large amount of data. It takes


data from a population based on certain characteristics and organizes the data in a
way that is comprehensible to an individual that wants to make assumptions about
a given population.

• Types of frequency distribution are grouped frequency distribution, ungrouped


frequency distribution, cumulative frequency distribution, relative frequency
distribution and relative cumulative frequency distribution

1. Grouped data:

• Grouped data refers to the data which is bundled together in different classes or
categories.

• Data are grouped when the variable stretches over a wide range and there are a
large number of observations and it is not possible to arrange the data in any order,
as it consumes a lot of time. Hence, it is pertinent to convert frequency into a class
group called a class interval.

• Suppose we conduct a survey in which we ask 15 familys how many pets they
have in their home. The results are as follows:

1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 5, 5, 6, 7, 8

• Often we use grouped frequency distributions, in which we create groups of


values and then summarize how many observations from a dataset fall into those
groups. Here's an example of a grouped frequency distribution for our survey data :

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

Guidelines for Constructing FD


1. All classes should be of the same width.

2. Classes should be set up so that they do not overlap and so that each piece of
data belongs to exactly one class.

3. List all classes, even those with zero frequencies.

4. There should be between 5 and 20 classes.

5. The classes are continuous.

• The real limits are located at the midpoint of the gap between adjacent tabled
boundaries; that is, one-half of one unit of measurement below the lower tabled
boundary and one-half of one unit of measurement above the upper tabled
boundary.

• Table 2.3.4 gives a frequency distribution of the IQ test scores for 75 adults.

• IQ score is a quantitative variable and according to Table, eight of the individuals


have an IQ score between 80 and 94, fourteen have scores between 95 and 109,
twenty-four have scores between 110 and 124, sixteen have scores between 125
and 139 and thirteen have scores between 140 and 154.

• The frequency distribution given in Table is composed of five classes. The


classes are: 80-94, 95-109, 110- 124, 125-139 and 140- 154. Each class has a lower
class limit and an upper class limit. The lower class limits for this distribution are
80, 95, 110, 125 and 140. The upper class limits are 94,109, 124, 139 and 154.

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

• If the lower class limit for the second class, 95, is added to the upper class limit
for the first class,94 and the sum divided by 2, the upper boundary for the first
class and the lower boundary for the second class is determined. Table 2.3.5 gives
all the boundaries for Table 2.3.5.

• If the lower class limit is added to the upper class limit for any class and the sum
divided by 2, the class mark for that class is obtained. The class mark for a class is
the midpoint of the class and is sometimes called the class midpoint rather than the
class mark.

Example 2.3.1: Following table gives the frequency distribution for the
cholesterol values of 45 patients in a cardiac rehabilitation study. Give the
lower and upper class limits and boundaries as well as the class marks for
each class.

• Solution: Below table gives the limits, boundaries and marks for the classes.

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

Example 2.3.2: The IQ scores for a group of 35 school dropouts are as follows:

a) Construct a frequency distribution for grouped data.

b) Specify the real limits for the lowest class interval in this frequency distribution.

• Solution: Calculating the class width

(123-69)/ 10=54/10=5.4≈ 5

a) Frequency distribution for grouped data

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

b) Real limits for the lowest class interval in this frequency distribution = 64.5-
69.5.

Example 2.3.3: Given below are the weekly pocket expenses (in Rupees) of a
group of 25 students selected at random.

37, 41, 39, 34, 41, 26, 46, 31, 48, 32, 44, 39, 35, 39, 37, 49, 27, 37, 33, 38, 49, 45,
44, 37, 36

Construct a grouped frequency distribution table with class intervals of equal


widths, starting from 25-30, 30-35 and so on. Also, find the range of weekly
pocket expenses.

Solution:

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

• In the given data, the smallest value is 26 and the largest value is 49. So, the
range of the weekly pocket expenses = 49-26=23.

Outliers
• 'In statistics, an Outlier is an observation point that is distant from other
observations.'

• An outlier is a value that escapes normality and can cause anomalies in the results
obtained through algorithms and analytical systems. There, they always need some
degrees of attention.

• Understanding the outliers is critical in analyzing data for at least two aspects:

a) The outliers may negatively bias the entire result of an analysis;

b) The behavior of outliers may be precisely what is being sought.

• The simplest way to find outliers in data is to look directly at the data table, the
dataset, as data scientists call it. The case of the following table clearly exemplifies
a typing error, that is, input of the data.

• The field of the individual's age Antony Smith certainly does not represent the
age of 470 years. Looking at the table it is possible to identify the outlier, but it is

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

difficult to say which would be the correct age. There are several possibilities that
can refer to the right age, such as: 47, 70 or even 40 years.

Relative and Cumulative Frequency Distribution


• Relative frequency distributions show the frequency of each class as a part or
fraction of the total frequency for the entire distribution. Frequency distributions
can show either the actual number of observations falling in each range or the
percentage of observations. In the latter instance, the distribution is called a
relative frequency distribution.

• To convert a frequency distribution into a relative frequency distribution, divide


the frequency for each class by the total frequency for the entire distribution.

• A relative frequency distribution lists the data values along with the percent of
all observations belonging to each group. These relative frequencies are calculated
by dividing the frequencies for each group by the total number of observations.

• Example: Suppose we take a sample of 200 India family's and record the number
of people living there. We obtain the following:

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

Cumulative frequency:

• A cumulative frequency distribution can be useful for ordered data (e.g. data
arranged in intervals, measurement data, etc.). Instead of reporting frequencies, the
recorded values are the sum of all frequencies for values less than and including
the current value.

• Example: Suppose we take a sample of 200 India family's and record the number
of people living there. We obtain the following:

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

• To convert a frequency distribution into a cumulative frequency distribution, add


to the frequency of each class the sum of the frequencies of all classes ranked
below it.

Frequency Distributions for Qualitative (Nominal) Data


• In the set of observations, any single observation is a word, numerical code or
letter, then data are qualitative data. Frequency distributions for qualitative data are
easy to construct.

• It is possible to convert frequency distributions for qualitative variables into


relative frequency distribution.

• If measurement is ordinal because observations can be ordered from least to


most, cumulative frequencies can be used.

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

Describing Data with Averages

• Averages consist of numbers (or words) about which the data are, in some sense,
centered. They are often referred to as measures of central tendency. It is
already covered in section 1.12.1.

1.12.1 Measuring the Central Tendency

• We look at various ways to measure the central tendency of data, include: Mean,
Weighted mean, Trimmed mean, Median, Mode and Midrange.

1. Mean :

• The mean of a data set is the average of all the data values. The sample mean x is
the point estimator of the population mean μ.

2. Median :

Sum of the values of then observations Number of observations in the sample

Sum of the values of the N observations Number of observations in the population

• The median of a data set is the value in the middle when the data items are
arranged in ascending order. Whenever a data set has extreme values, the median is
the preferred measure of central location.

• The median is the measure of location most often reported for annual income and
property value data. A few extremely large incomes of property values can inflate
the mean.

• For an off number of observations:

7 observations== 26, 18, 27, 12, 14, 29, 19.

Numbers in ascending order 12, 14, 18, 19, 26, 27, 29

• The median is the middle value.

Median=19

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

• For an even number of observations :

8 observations=26 18 29 12 14 27 30 19

Numbers in ascending order=12, 14, 18, 19, 26, 27, 29,30

The median is the average of the middle two values.

3. Mode:

• The mode of a data set is the value that occurs with greatest frequency. The
greatest frequency can occur at two or more different values. If the data have
exactly two modes, the data have exactly two modes, the data are bimodal. If the
data have more than two modes, the data are multimodal.

• Weighted mean : Sometimes, each value in a set may be associated with a


weight, the weights reflect the significance, importance or occurrence frequency
attached to their respective values.

Trimmed mean: A major problem with the mean is its sensitivity to extreme (e.g.,
outlier) values. Even a small number of extreme values can corrupt the mean. The
trimmed mean is the mean obtained after cutting off values at the high and low

extremes.

• For example, we can sort the values and remove the top and bottom 2 % before
computing the mean. We should avoid trimming too large a portion (such as 20 %)
at both ends as this can result in the loss of valuable information.

• Holistic measure is a measure that must be computed on the entire data set as a
whole. It cannot be computed by partitioning the given data into subsets and
merging the values obtained for the measure in each subset.

ariability, almost by definition, is the extent to which data points in a statistical


distribution or data set diverge, vary from the average value, as well as the extent
to which these data points differ from each other. Variability refers to the
divergence of data from its mean value and is commonly used in the statistical and
financial sectors.

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

• The goal for variability is to obtain a measure of how spread out the scores are in
a distribution. A measure of variability usually accompanies a measure of central
tendency as basic descriptive statistics for a set of scores.

• Central tendency describes the central point of the distribution and variability
describes how the scores are scattered around that central point. Together, central
tendency and variability are the two primary values that are used to describe a
distribution of scores.

• Variability serves both as a descriptive measure and as an important component


of most inferential statistics. As a descriptive statistic, variability measures the
degree to which the scores are spread out or clustered together in a distribution.

• Variability can be measured with the range, the interquartile range and the
standard deviation/variance. In each case, variability is determined by measuring
distance.

Range

• The range is the total distance covered by the distribution, from the highest score
to the lowest score (using the upper and lower real limits of the range).

Range=Maximum value - Minimum value

Merits :

a) It is easier to compute.

b) It can be used as a measure of variability where precision is not


required. Demerits :

a) Its value depends on only two scores

b) It is not sensitive to total condition of the distribution.

Variance

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

• Variance is the expected value of the squared deviation of a random variable


from its mean. In short, it is the measurement of the distance of a set of random
numbers from their collective average value. Variance is used in statistics as a way
of better understanding a data set's distribution.

• Variance is calculated by finding the square of the standard deviation of a


variable.

σ2= Σ(Χ - μ)2 /N

• In the formula above, μ represents the mean of the data points, x is the value of
an individual data point and N is the total number of data points.

• Data scientists often use variance to better understand the distribution of a data
set. Machine learning uses variance calculations to make generalizations about a
data set, aiding in a neural network's understanding of data distribution. Variance is
often used in conjunction with probability distributions.

Standard Deviation
• Standard deviation is simply the square root of the variance. Standard deviation
measures the standard distance between a score and the mean.

Standard deviation=√Variance

• The standard deviation is a measure of how the values in data differ from one
another or how spread out data is. There are two types of variance and standard
deviation in terms of sample and population.

• The standard deviation measures how far apart the data points in observations are
from each. we can calculate it by subtracting each data point from the mean value
and then finding the squared mean of the differenced values; this is called
Variance. The square root of the variance gives us the standard deviation.

• Properties of the Standard Deviation :

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

a) If a constant is added to every score in a distribution, the standard deviation will


not be changed.

b) The center of the distribution (the mean) changes, but the standard deviation
remains the same.

c) If each score is multiplied by a constant, the standard deviation will be


multiplied by the same constant.

d) Multiplying by a constant will multiply the distance between scores and because
the standard deviation is a measure of distance, it will also be multiplied.

• If user are given numerical values for the mean and the standard deviation, we
should be able to construct a visual image (or a sketch) of the distribution of
scores. As a general rule, about 70% of the scores will be within one standard
deviation of the mean and about 95% of the scores will be within a distance of two
standard deviations of the mean.

• The mean is a measure of position, but the standard deviation is a measure of


distance (on either side of the mean of the distribution).

• Standard deviation distances always originate from the mean and are expressed as
positive deviations above the mean or negative deviations below the mean.

• Sum of Square (SS) for population definition formula is given below:

Sum of Square (SS) = Σ(x-μ)2

• Sum of Square (SS) for population computation formula is given below:

SS= ΣΧ2- (ΣΧ)2/ N

• Sum of Squares for sample definition formula:

SS = Σ (X-X̄)2

• Sum of Squares for sample computation formula :

SS = Σx2 - (Σx)2/n

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

Example 2.8.1: The heights of animals are: 600 mm, 470 mm, 170 mm, 430
mm and 300 mm. Find out the mean, the variance and the standard deviation.

Solution:

Mean = 600+ 470 + 170+ 430 + 300 / 5

=1970 /5= 394

σ2= Σ(Χ - μ)2/ N

Variance = (600-394)2 + (470-394)2 + (170-394)2 + (430-394)2 + (300-394)2 /5

Variance = 42436+5776+ 50176 + 1296 +8836 / 5

Variance = 21704

Standard deviation = √Variance = √21704

= 142.32 ≈ 142

Example 2.8.2: Using the computation formula for the sum of squares,
calculate the population standard deviation for the scores: 1, 3, 7, 2, 0, 4, 7, 3.

Solution: Calculate mean of data

Mean = 1+3+7+2+0+4+7+3 / 8 = 3.375

Variance = (3.375-1)2 + (3.375-3)2 + (3.375-7)2 + (3.375-2)2 + (3.375-0)2 +


(3.375−4)2 + (3.375 − 7)2 + (3.375 – 3)2 /8

= (-2.375)2 + (0.375)2 + (3.625)2 + (−1.375)2 + (-3.375)2 + (0.625)2 + (3.625)2 +


(−0.375)2 /8

= 5.64+0.14+13.14+1.89+11.39+0.39+13.14+0.14 /8

= 45.87 /8 = 5.73

Variance = 5.73

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

The population standard deviation is the square root of the variance = (5.73) 1/2 =
2.393

The Interquartile Range

• The interquartile range is the distance covered by the middle 50% of the
distribution (the difference between Q1 and Q3).

• Fig. 2.8.1 shows IQR.

• The first quartile, denoted Q1, is the value in the data set that holds 25% of the
values below it. The third quartile, denoted Q3, is the value in the data set that
holds 25% of the values above it.

Example 2.8.3: Determine the values of the range and the IQR for the
following sets of data.

(a) Retirement ages: 60, 63, 45, 63, 65, 70, 55, 63, 60, 65, 63

(b) Residence changes: 1, 3, 4, 1, 0, 2, 5, 8, 0, 2, 3, 4, 7, 11, 0, 2, 3, 4

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)


lOMoARcPSD|15656136

Solution:

a) Retirement ages: 60, 63, 45, 63, 65, 70, 55, 63, 60, 65, 63

Range = Max number - Min number = 70-45

Range = 25

IQR:

Step 1: Arrange given number form lowest to highest.

45, 55, 60, 60, 63, 63, 63, 63, 65, 65, 70

Median

Q1=60 , Q3 65

IQR = Q3-Q1=65-60 = 5

Downloaded by P. SANTHIYA - CSE (psa.cse@builderscollege.edu.in)

You might also like