CHENNAI INSTITUTE OF TECHNOLOGY
(Affiliated to Anna University, Approved by AICTE, Accredited by NAAC & NBA)
Sarathy Nagar, Kundrathur, Chennai – 600069, India.
Lecture Notes Unit II
Department of Artificial Intelligence and Data Science
Subject: Fundamentals of Data Science
Dr.R. Ponnusamy
Course: II B.Tech. III Sem
Lecture Notes Unit II
1. What is statistics?
Statistics is the science concerned with developing and studying methods for collecting,
analyzing, interpreting and presenting empirical data. Statistics is a highly interdisciplinary
field. Especially for the purpose of inferring proportions in a whole from those in a
representative sample. In developing methods and studying the theory that underlies the
methods statisticians draw on a variety of mathematical and computational tools.
2. What is descriptive statistics?
Descriptive statistics summarize and organize characteristics of a data set. A data set is a
collection of responses or observations from a sample or entire population.
In quantitative research, after collecting data, the first step of statistical analysis is to describe
characteristics of the responses, such as the average of one variable (e.g., age), or the relation
between two variables (e.g., age and creativity).
Example:
A graph showing the annual change in global temperature during the last 30 years.
A report that describes the average difference in grade point average (GPA) between
college students who regularly drink alcoholic beverages and those who don’t.
3. What is inferential statistics?
Inferential statistics, which help you decide whether your data confirms or refutes your hypothesis and
whether it is generalizable to a larger population. A variety of tests and estimates—for generalizing
beyond collections of actual observations. This more advanced area is known as inferential statistics.
Tools from inferential statistics permit us to use a relatively small collection of actual observations to
evaluate.
Example:
• A pollster’s claim that a majority of all U.S. voters favor stronger gun control laws.
• An assertion about the relationship between job satisfaction and overall happiness.
4. Explain the following terms. Population, Sample, Random Sampling, Random Assignment.
A population refers to any complete collection of observations or potential observations.
A sample refers to any smaller collection of actual observations drawn from a population.
Random sampling is a procedure designed to ensure that each potential observation in the population
has an equal chance of being selected in a survey.
Random assignment signifies that each person has an equal chance of being assigned to any
group in an experiment.
5. Different types of data in statistics?
Numerical data / Quantitative data
These data have meaning as a measurement, such as a person’s height, weight, IQ, or blood
pressure; or they’re a count, such as the number of stock shares a person owns, how many teeth
a dog has, or how many pages you can read of your favorite book before you fall asleep.
Numerical data can be further broken into two types: discrete and continuous.
Discrete data represent items that can be counted; they take on possible values that can be listed
out. The list of possible values may be fixed (also called finite); or it may go from 0, 1, 2, on
to infinity (making it countably infinite).For example, the number of heads in 100 coin flips
takes on values from 0 through 100 (finite case), but the number of flips needed to get 100
heads takes on values from 100 (the fastest scenario) on up to infinity (if you never get to that
100th heads). Its possible values are listed as 100, 101, 102, 103 . . . (representing the countably
infinite case).
Continuous data represent measurements; their possible values cannot be counted and can
only be described using intervals on the real number line. For example, the exact amount of
gas purchased at the pump for cars with 20-gallon tanks would be continuous data from 0
gallons to 20 gallons, represented by the interval [0, 20], inclusive. You might pump 8.40
gallons, or 8.41, or 8.414863 gallons, or any possible number from 0 to 20. In this way,
continuous data can be thought of as being unaccountably infinite. For ease of recordkeeping,
statisticians usually pick some point in the number to round off.
Categorical data / Qualitative Data
Categorical data represent characteristics such as a person’s gender, marital status, hometown,
or the types of movies they like. Categorical data can take on numerical values (such as “1”
indicating male and “2” indicating female), but those numbers don’t have mathematical
meaning. You couldn’t add them together, for example. (Other names for categorical data are
qualitative data, or Yes/No data.)
Ordinal data / Ranked Data
Ordinal data mixes numerical and categorical data. The data fall into categories, but the
numbers placed on the categories have meaning. For example, rating a restaurant on a scale
from 0 (lowest) to 4 (highest) stars gives ordinal data.
-The distinctive property of ordinal measurement is order. Comparatively speaking, a first-
place finish reflects the fastest finish in a horse race or the highest GPA among graduating
seniors.
6. Give example for measurement of different data types?
7. Explain the different types of variables?
Variable: A variable is a characteristic or property that can take on different values.
Constant: Constant takes only one value.
Discrete Variable: A discrete variable consists of isolated numbers separated by gaps.
Number of Children in the family. Number of foreign countries visited.
Continuous variable: A continuous variable consists of numbers whose values, at least in
theory, have no restrictions. Grade of a school children
Confounding Variable: An uncontrolled variable that compromises the interpretation of a
study is known as a confounding variable.
8. Differentiate discrete variable and continuous variable.
9. What is approximate number?
Approximate Numbers that are rounded off, as is always the case with values for
continuous variables.
10. Explain the frequency distribution of data sets.
A frequency distribution is a collection of observations produced by sorting observations
into classes and showing their frequency of occurrence in each class.
A frequency distribution helps us to detect any pattern in the data (assuming a pattern
exists) by superimposing some order on the inevitable variability among observations.
• It can be represented as graph or table.
• Example - (Frequency Distributions)
11. What is Statistical Experiment?
Statisticians gather information through observational studies and experiments.
Observational studies observe and measure specific characteristics without modifying the
subjects under study. In contrast, a statistical experiment applies a treatment to the subjects
to see if a causal relationship exists. (Treatments can also be called factors, but this can be
confusing because latent variables in factor analysis are also called factors.)
Statistical experiments are designed to compare the outcomes of applying one or more
treatments to experimental units, then comparing the results to a control group that does
not receive a treatment.
12. Explain the Guidelines for frequency Distributions.
13. Frequency Distributions for Quantitative Data.
A frequency distribution is a collection of observations produced by sorting observations into
classes and showing their frequency (f) of occurrence in each class.
When observations are sorted into classes of single values, as in Table 2.1, the result is referred
to as a frequency distribution for ungrouped data.
• A frequency distribution produced whenever observations are sorted into classes of
more than one value.
14. What is Real Limits of Class Intervals?
Located at the midpoint of the gap between adjacent tabled boundaries. One-half of one unit
of measurement below the lower tabled boundary and one-half of one unit of measurement
above the upper tabled boundary.
15. How to construct the frequency distribution tables?
16. What is an Outliers? Give Example.
A very extreme score. For example, consider the following sales in a store over a week time, in which
Mar 25 2019, it reaches an extreme point. Which is an example for outlier.
17. What are all the ways to handle the outlier?
As soon as finding the outlier the following set of actions.
• Check for Accuracy
Whenever you encounter an outrageously extreme value, such as a GPA of 0.06,
attempt to verify its accuracy.
For instance, was a respectable GPA of 3.06 recorded erroneously as 0.06?
If the outlier survives an accuracy check, it should be treated as a legitimate score.
• Might Exclude from Summaries
You might choose to segregate (but not to suppress!) an outlier from any summary of
the data.
For example, you might relegate it to a footnote instead of using excessively wide
class intervals in order to include it in a frequency distribution.
• Might Enhance Understanding
Insofar as a valid outlier can be viewed as the product of special circumstances, it
might help you to understand the data.
18. What is Relative Frequency Distributions? Give an example.
A relative frequency distribution shows the proportion of the total number of observations
associated with each value or class of values and is related to a probability distribution, which
is extensively used in statistics.
Relative frequency distributions show the frequency of each class as a part or fraction of the
total frequency for the entire distribution.
Weight of different persons, and their frequency, and relative frequency is shown in the
following table.
19. How to convert the frequency into relative frequency table?
To convert a frequency distribution into a relative frequency distribution, divide the frequency
for each class by the total frequency for the entire distribution.
20. What is Cumulative Frequency Distributions? Given an example.
A frequency distribution showing the total number of observations in each class and all lower-
ranked classes.
Weight of different persons, and their frequency, and cumulative frequency, cumulative
percent is shown in the following table.
21. What is Percentile Rank of an Observation?
The percentile rank of a score indicates the percentage of scores in the entire distribution
with similar or smaller values than that score.
The percentile rank of a score (PR) is the percentage of scores in its frequency distribution that
are less than that score.[1] Its mathematical formula is
Where CF—the cumulative frequency—is the count of all scores less than or equal to the score
of interest, F is the frequency for the score of interest, and N is the number of scores in the
distribution.
22. What is cumulative frequency for distributions for nominal data?
Frequency distributions for qualitative data are easy to construct. Simply determine the
frequency with which observations occupy each class, and report these frequencies as shown
in Table 2.7 for the Facebook profile survey. This frequency distribution reveals that Yes replies
are approximately twice as prevalent as No replies.
Nominal-level variables can be displayed as frequency tables, but you should only include raw
and relative frequencies (cumulative frequency and cumulative percent are inappropriate for
use with nominal-level data).
23. Explain the Relative and Cumulative Distributions for Qualitative Data, with example.
Frequency distributions for qualitative variables can always be converted into relative
frequency distributions, as illustrated in Table 2.8. Furthermore, if measurement is ordinal
because observations can be ordered from least to most, cumulative frequencies (and
cumulative percentages) can be used.
24. What is histogram? Give example.
A bar-type graph for quantitative data. The common boundaries between adjacent bars
emphasize the continuity of the data, as with continuous variables.
25. What is Frequency Polygon?
A line graph for quantitative data that also emphasizes the continuity of continuous variables.
26. List the steps to convert the given histogram into frequency polygon.
Steps in constructing the frequency polygon
(a) Construct a histogram.
(b) Construct a frequency polygon.
(c) Asses Is this distribution balanced or lopsided?
27. What is Stem and Leaf Displays?
A device for sorting quantitative data on the basis of leading and trailing digits. Stem and leaf
displays are ideal for summarizing distributions, such as that for weight data, without
destroying the identities of individual observations. It represent Data on the basis of leading
and trailing digits. It is shown in the following table.
28. Explain the different typical graph of frequency distribution of polygon or histogram?
A typical share of the frequency distribution of polygon or histogram is shown in the following
figure.
The familiar bell-shaped silhouette of the normal curve can be superimposed on many
frequency distributions.
Reflect the coexistence of two different types of observations in the same distribution.
A lopsided distribution caused by a few extreme observations in the positive direction.
A lopsided distribution caused by a few extreme observations in the negative direction.
29. How a nominal data can be represented as graph? Given example.
A bar-type graph for qualitative data. Gaps between adjacent bars emphasize the
discontinuous nature of the data.
30. How data can be influenced or misrepresented while drawing the graph?
In a number of ways the statistician can misrepresent graph to influence the data. Some of the
methods are given below with graphs.
• The width of the Yes bar is more than three times that of the No bar, thus violating the
custom that bars be equal in width.
• The lower end of the frequency scale is omitted, thus violating the custom that the entire
scale be reproduced, beginning with zero.
• The height of the vertical axis is several times the width of the horizontal axis, thus
violating the custom, here to fore unmentioned, that the vertical axis be approximately
as tall as the horizontal axis is wide.
31. Explain the procedure for constructing graph.
32. What is the difference between the sample and population?
A population is the entire group that you want to draw conclusions about. A sample is the
specific group that you will collect data from. The size of the sample is always less than the
total size of the population.
33. What is mode? Give example.
The mode reflects the value of the most frequently occurring score or the mode is the value that
appears most often in a set of data. The mode of a discrete probability distribution is the
value x at which its probability mass function takes its maximum value.
In this case there exists two modes, 4 and 8. 4 occurs 7 times and 8 occurs 6 times. It is
known as bimodal. Bimodal describes any distribution with two obvious peaks.
34. What is median? Give example.
The median reflects the middle value when observations are ordered from least to most.
35. How to compute or find median?
36. How to compute mean value?
The mean is found by adding all scores and then dividing by the number of scores.
37. What is the difference between the population mean and sample mean?
Sample Mean ( X ) The balance point for a sample, found by dividing the sum for the values of
all scores in the sample by the number of scores in the sample
Population Mean (μ) : The balance point for a population, found by dividing the sum for all
scores in the population by the number of scores in the population.
Population Size (N) The total number of scores in the population
38. What is the purpose and nature of mean?
The mean serves as the balance point for its distribution because of a special property:
The sum of all scores, expressed as positive and negative deviations from the mean, always
equals zero.
The mean reflects the values of all scores, not just those that are middle ranked (as with the
median), or those that occur most frequently.
39. How to Interpret the Differences between Mean and Median?
Ideally, when a distribution is skewed, report both the mean and the median.
Appreciable differences between the values of the mean and median signal the presence of a
skewed distribution.
If the mean exceeds the median, as it does for the infant death rates, the underlying
distribution is positively skewed because of one or more scores with relatively large values,
such as the very high infant death rates for a number of countries, especially Sierra Leone.
On the other hand, if the median exceeds the mean, the underlying distribution is negatively
skewed because of one or more scores with relatively small values.
Special Status of the Mean
As has been seen, the mean sometimes fails to describe the typical or middle-ranked value of
a distribution.
Therefore, it should be used in conjunction with another average, such as the median.
In the long run, however, the mean is the single most preferred average for quantitative data.
If Distribution Is Not Skewed
When a distribution of scores is not too skewed, the values of the mode, median, and mean
are similar, and any of them can be used to describe the central tendency of the distribution.
40. Explain the averages for qualitative and Ranked Data.
Always Appropriate for Qualitative Data
The mode always can be used with qualitative data.
Median Sometimes Appropriate
The median can be used whenever it is possible to order qualitative data from least to most
because the level of measurement is ordinal.
It’s easiest to determine the median class for ordered qualitative data by using relative
frequencies, as in Table 3.5.
Inappropriate Averages
Mean cannot be used with qualitative data.
It would not be appropriate to report a median for unordered qualitative data with nominal
measurement, such as the ancestries of Americans.
Nor would it be appropriate to report a mean for any qualitative data.
Averages for Ranked Data
When the data consist of a series of ranks, with its ordinal level of measurement, the median
rank always can be obtained. It’s simply the middlemost or average of the two middlemost
ranks.
41. Explain the variability and its importance.
Describing Variability
When summarizing a set of data, we specify not only measures of central tendency, such
as the mean, but also measures of variability, that is, measures of the amount by which
scores are dispersed or scattered in a distribution.
Each of the three frequency distributions consists of seven scores with the same mean (10)
but with different variability.
Your intuition was correct if you concluded that distribution A has the least variability,
distribution B has intermediate variability, and distribution C has the most variability.
For distribution A with the least (zero) variability, all seven scores have the same value (10).
For distribution B with intermediate variability, the values of scores vary slightly (one 9 and
one 11).
For distribution C with most variability, they vary even more (one 7, two 9s, two 11s, and
one 13).
Therefore the variability is most important one.
Importance of Variability
Variability assumes a key role in an analysis of research results.
Does fitness training improve, on average, the scores of depressed patients on a mental-
wellness test?
To answer this question, depressed patients are randomly assigned to two groups, fitness
training is given to one group, and wellness scores are obtained for both groups.
Let’s assume that the mean wellness score is larger for the group with fitness training. Is
the observed mean difference between the two groups real or merely transitory?
This decision depends not only on the size of the mean difference between the two groups
but also on the inevitable variability of individual scores within each group.
Each with the same mean difference of 2, but with the two groups in experiment B having
less variability than the two groups in experiment C.
Notice that groups B and C in Figure 4.2 are the same as their counterparts in Figure 4.1.
Although the new group B* retains exactly the same (intermediate) variability as group B,
each of its seven scores and its mean have been shifted 2 units to the right.
Likewise, although the new group C* retains exactly the same (most) variability as group
C, each of its seven scores and its mean have been shifted 2 units to the right.
Consequently, the crucial mean difference of 2 (from 12 − 10 = 2) is the same for both
experiments.
Briefly, the relatively smaller variability within groups in experiment B translate into more
statistical stability for the observed mean difference of 2 when it is viewed as just one
outcome among many possible outcomes for repeat experiments.
On the other hand, the relatively larger variability within groups in experiment C translate
into less statistical stability for the observed mean difference of 2 when it is viewed as just
one outcome among many possible outcomes for repeat experiments.
42. What is a range?
The range is the difference between the largest and smallest scores.
In Figure 4.1, distribution A, the least variable, has the Smallest range of 0 (from 10 to 10);
distribution B, the moderately variable, has an intermediate range of 2 (from 11 to 9); and
distribution C, the most variable, has the largest range of 6 (from 13 to 7), in agreement with
our intuitive judgments about differences in variability.
The range is a handy measure of variability that can readily be calculated and understood.
43. What are all the methods for computing variability?
There are three methods available for serve as valid measures of variability of the system, first
one is the Inter Quartile Range (IQR), second one is the variance and the third one is the
Standard Deviation.
Those roles are reserved for the variance and particularly for its square root, the standard
deviation, because these measures serve as key components for other important statistical
measures. The variance and standard deviation occupy the same exalted position among
measures of variability as does the mean among measures of central tendency.
44. What is variance, how it is measured/computed/calculated?
The Variance is the mean of all squared deviation scores.
45. Explain the Standard Deviation, its nature with difference method of calculation and suitable
example in detail.
Standard Deviation - A rough measure of the average (or standard) amount by which scores
deviate on either side of their mean.
It tells how the values are spread across the data sample and it is the measure of the variation
of the data points from the mean.
These two generalizations about the majority and minority of scores are independent of the
particular shape of the distribution.
In the following, they apply to both the balanced distribution of IQ scores and the positively
skewed distribution of study times.
In fact, the balanced distribution of IQ scores approximates an important theoretical
distribution, the normal distribution.
The mean is a measure of position, but the standard deviation is a measure of distance (on
either side of the mean of the distance).
Mean of X 169.51 lbs has a particular position or location along the horizontal axis: It is located
at the point, and only at the point, corresponding to 169.51 lbs.
On the other hand, the standard deviation (s) of 23.33 lbs for the same distribution has no
particular location along the horizontal axis.
Using the standard deviation as a measure of distance on either side of the mean, we could
describe one person’s weight as two standard deviations above the mean, X + 2s, another
person’s weight as two-thirds of one standard deviation below the mean, X – 2 ⁄3s, and so on .
Sum of Squares (SS) - The sum of squared deviation scores.
Population Standard Deviation (σ) - A rough measure of the average amount by which
scores in the population deviate on either side of their population mean.
Standard Deviation for Sample ( s )
46. What is degree of freedom?
Degrees of Freedom (df) - The number of values free to vary, given one or more mathematical
restrictions.
Degrees of freedom (df) refers to the number of values that are free to vary, given one or more
mathematical restrictions, in a sample being used to estimate a population characteristic.
As has been noted, when n deviations about the sample mean are used to estimate variability
in the population, only n − 1 are free to vary.
As a result, there are only n − 1 degrees of freedom, that is, df = n − 1. One df is lost because
of the zero-sum restriction.
47. What is Interquartile Range (IQR) ?
The interquartile range (IQR), is simply the range for the middle 50 percent of the scores.
More specifically, the IQR equals the distance between the third quartile (or 75th percentile)
and the first quartile (or 25th percentile), that is, after the highest quarter (or top 25 percent)
and the lowest quarter (or bottom 25 percent) have been trimmed from the original set of
scores.
48. Explain the method of computation of IQR with example ?