Chapter 1 - Introduction To ASDS
Chapter 1 - Introduction To ASDS
Importance of Statistics in Data Science: The importance of statistics for data science
and statistics for data analytics is immense. Exploring it through the below-mentioned
points:
• For data identification and conversion of data patterns into usable format
• To collect, analyze, evaluate, and conclude the results for data using appropriate
mathematical models
Statistics for data science is also very important and useful in business and industry as
enlisted below:
• Useful in risk assessment, fraud detection, and portfolio optimization. It also contributes
to forecasting market trends, modeling financial data, and making investment
decisions.
• Statistics aids in medical science through clinical trials, patient data analysis, disease
diagnosis and identifying the treatment effectiveness.
• It helps evaluate teaching methodologies, assess student performance, and improve the
curriculum and educational policy.
• Manufacturers also benefit through process optimization and quality control through
defect identification, reducing downtime, and improving efficiency.
Fundamental of Statistics: Data science refers to dealing with different types of structured
and unstructured data. Statistical analysis helps in enhancing predictability, pattern
analysis, and concluding and interpreting the results or findings obtained from the above
data analysis. The two fundamental statistics concepts that play a key or vital role in data
science are descriptive and inferential statistics.
Dr. Mohd. Muzibur Rahman 2
Professor, Department of SDS
Introduction to Statistics and Data Science
Descriptive Statistics: It includes the method for summarizing and describing the main
features of a dataset. The different measures of central tendency involved in descriptive
statistics include mean, median, and mode. Besides, dispersive measures such as range,
standard deviation, and variance are also included to provide a comprehensive overview of
the data’s characteristics for the interpretation or conclusion.
Inferential Statistics: Inferential statistics is concerns in using sample data for inferences
or predictions about the population. It includes the test of hypothesis to assess the validity
of assumptions or claims about the population. The concept is also helpful for constructing
confidence intervals to estimate the likely range of values for population parameters.
Inferential statistics has the significance in decision-making.
Importance of statistics:
Statistics and state: A state in the modern setup collects the largest amount of statistics
for various purposes. It collects data relating to prices, production, consumption, income
and expenditure, investment and profits. Popular statistical methods such as time-series
analysis, index numbers, forecasting and demand analysis are extensively used in
formulating economic policies. Government also collects data on population dynamics in
order to initiate and implement various welfare policies and programs.
Statistics and economics: (i) Time series analysis is used for studying the behaviors of
prices, productions, and consumption of commodities, money in circulation and bank
deposits and clearings (ii) Index numbers are useful in economic planning as they indicate
the changes over a specified period of time in (a) prices of commodities (b) imports and
exports (c) industrial/agricultural production (d) cost of living. (iii) Demand analysis is
used to study the relationship between the price of a commodity and its output (supply).
(iv) forecasting techniques are used to predict inflation rate, unemployment rate, or
manufacturing capacity utilization .
Statistics in Business management: In the business point of view, statistics may be
defined as a method of decision making in the face of uncertainty on the basis of numerical
data and calculated risks. Following are certain activates of a typical organization where
statistics plays an important role in their efficient execution.
Marking: Before a product in launched, the market research team of an organization,
through a pilot survey, makes use of various techniques of statistics to analyze data on
population purchasing power, habits of the consumers, competitors, pricing and other
aspects. Such studies reveal the possible market potential for the product.
Production: Statistical methods are used to carry out programs for improvement in the
quality of the exciting products and setting quality control standards for new ones.
Decisions about the quality and time of either self-manufacturing or buying from outside
are based on statistically analyzed data.
Finance: A statistical study through correlation analysis of profit and dividend helps to
predict and decide probable dividends for future years. Statistics applied to analyze of data
on assets and liabilities and income and expenditure, help to ascertain the financial results
of various operations.
Personal: In the process of manpower planning, a personal department makes statistical
studies of wage rates, incentive plans, cost of living, labor turnover rates, employment
trends, accident rates, performances, training and development programs.
Statistics in social science: (i) Regression and correlation analysis techniques are used to
study and isolate all these factors associated with each social phenomenon which bring out
the changes in data with respect to time, place and object.
(i) Sample techniques and estimation theory and indispensable methods for
conducting any social survey for drawing valid inferences.
(ii) In sociology, statistical methods are used to study death rates, birth
rates, population growth and other aspect of vital statistics.
Statistics in medical science: The knowledge of statistical method and techniques in all
natural sciences – zoology, botany, meteorology, and medicine-is of great importance. For
example, for proper diagnosis of a disease, the doctor needs and relies heavily on factual
data relating to pulse rate, body temperature, blood pressure, heart beat, and body weight.
An important application of statistics lies for testing the efficacy of a particular drug or
vaccines to cure or prevent a specific disease.
Statistics and computer: Computer and information technology, in general have had a
fundamental effect on most business and service organization. Over the last decades
personal computer (PC) has revolutionized both the areas to which statistical techniques
are applied. PC facilities such as spreadsheets or common statistical packages have now
made such analysis readily available to any business decision-maker. Computer helps in
processing and maintaining past records of operations involving payroll calculations,
inventory management, railway/airline reservation etc.
Limitation of statistics: Although Statistics has it’s application in almost all sciences-
social, physical, and natural- it has it’s own limitations as well, which restrict it’s scope
and utility.
a) Statistics does not study qualitative phenomena: Since Statistics deals with
numerical data, it cannot be applied in studying those problems which can be stated
and expressed quantitatively. Qualitative characteristics such as honesty, poverty,
welfare, beauty or health cannot directly measure quantitatively. However, these
subjective concepts can be related in an indirect manner to numerical data after
assigning particular scores.
b) Statistics does not study individuals: Statistics always deals with aggregated data
that Statistics always helps for taking decision about a population based on sample
information.
c) Statistics can be misused: Statistics are liable to be misused. For proper use of
Statistics one should have enough skill, knowledge, experience to draw accurate
and sensible conclusion. Further, valid results cannot be drawn from the use of
statistics unless one has a proper understanding of the subject to which it is applied.
e) Laws are not exact: Statistical laws are based on probability. So, the results
will not always be as good as of scientific laws. On the basis of probability or
interpolation, we can only estimate the production of paddy in 2008 but cannot
make a claim that it would be exactly 100 %. This is only the approximated
estimates.
f) Results are true only on average: As discussed above, here the results are
interpolated for which time series or regression or probability can be used.
These are not absolutely true. If average of two sections of students in statistics
is same, it does not mean that all the 50 students is section A has got same
marks as in B. There may be much variation between the two. So we get average
results.
h) Statistical results are not always beyond doubt: Although we use many laws
and formulae in statistics but still the results achieved are not final and
conclusive. As they are unable to give complete solution to a problem, the result
must be taken and used with much wisdom or knowledge.
Population: A well define group or sector or area about that you want to draw your
conclusion or decision then the entire group is known as population. As for example, a
high school administrator wants to know the trend of final exam scores of all graduating
senior students. Here all the graduating students is known as population and each graduate
of the school is known as population unit. Population characteristics is known as parameter
and it is always unknown. Such as, population mean 𝜇.
Finite Population: If the population unit of a population is countable or a finite number
then this type of population is known as finite population. As for example, if you want to
know the socio-economic condition of the students of Jahangirnagar University then this
is known as finite population. Because the number of students of JU at any particular time
a known fixed number.
Infinite Population: If the population unit of a population is not countable or not a finite
number then this type of population is known as infinite population. As for example, if you
want to know the total number of fishes of any particular river or a pond at any particular
time this is known as infinite population. Because the number of fishes in a pond or a river
at any particular time is not known.
Sampling Unit: The smallest unit from which we want to collect the information for
drawing conclusion about the defined population is known as sampling unit.
Sampling Frame: The entire list of population units is known as sampling frame. The list
of all JU students is known as sampling frame.
Variable: A characteristics which varies individual to individual is known as variable. As
for example, height and weight of a person, amount of income of a family, number of
family members, total monthly expenditure of a family. We have two types of variables
one is qualitative and another one is quantitative variable.
Qualitative Variable: The characteristics which can not express or measure quantitatively
or numerically is known as qualitative variable. Such as gender, IQ level of an individual,
honesty of a person etc.
Quantitative Variable: The characteristics which can be express or measure quantitatively
or numerically is known as qualitative variable. Such as height and weight of a person,
percentage of attendance of the students, temperature of a particular area, number of family
members. We can classify quantitative variable into two category as discrete variable and
continuous variable.
Discrete Variable: A quantitative variable is said to be a discrete variable if it take only
integer values. As for example, the number of students enrolled in a class, the number of
university credits earned by a student at end of a particular semester and the number of
insurance claims filed following a particular hurricane in any particular state and number
of brothers and sisters of the students. It is usually arises from a measurement of counting
process
Continuous Variable: A quantitative variable is said to be a continuous variable if take
any value within a given range of real numbers and usually it is arise from a measurement
process not from a a counting. As for example, the continuous variable includes height and
weight of a person, waiting time for a person, distance of JU from your residence,
temperature of any region etc. Someone might say that he is 6 feet (or 72 inches) tall but
his height could be actually be 72.1 inches, 71.8 inches or some other similar number,
depending on the accuracy of the instrument used to measure height. Other examples of
continuous numerical variables include the weight of cereal boxes, the time to run a race,
and the distance between two cities etc.
Depending on the situations or the nature of data, we sometimes categories the variables
as dummy variable and categorical variable.
Dummy Variable: When we assign only two values either 0 or 1 for any particular variable
then this variable is known as dummy variable. As, for example, for the variable age, if we
categorized it as under 18 years of age as 0 and 1 as age over or equal to age 18 years.
Categorical Variables: Categorical variables produce response that belongs to groups or
categories. For example, response to yes (if yes 1)/no (if no 2) questions are categorical.
“Do you own a mobile phone?” and did you ever visit Hiroshima, Japan? Are limited to
yes or no answer? Other examples of categorical variables include questions on gender,
marital status, literacy status and your major in college. Sometime categorical variables
include a range of choices such the instructor in this course was an effective teacher (1:
strongly disagree 2: slightly disagrees, 3: neither agree nor disagree, 4: slightly agree, 5:
strongly agree).
Data: Data is a collection of facts, observations, or measurements used for analysis and
decision-making. Data can be qualitative, such as gender, intelligence etc., numerical, such
as counts or measurements number of family members, height etc, or categorical, such as
labels or classifications. Data serves as the starting point for analysis. Data is use to
examine, manipulate, and interpret to draw conclusions or make predictions about a
particular phenomenon or population. We have two types of data for decision making, one
is primary data and another one is secondary data.
Primary Data: The data which collected from the respondent directly by different survey
using questionnaire or schedule is known as primary data.
Secondary Data: The data collected from different published journals or books or other
sources is known as secondary data.
second player. However, with qualitative data, there is a measurable meaning to the
difference in numbers. When one student scores 90 on examination and another student
scores 45, the difference is measurable and meaningful.
Scale: A scale may be defined as any series of items that are arranged progressively
according to value or magnitude into which an item can be placed according to its
quantification. It can be defined as a continuous spectrum or series of categories.
Ordinal Scale: It is type of scale that arranges objects or alternatives according to their
magnitudes in an ordered relationship. When the respondents are ordered, ordinal values
are assigned. Thus it is possible to determine whether an object has more of less of a
characteristic than some other object.
Example: In business research if we ask to rate companies as excellent, good, fair, poor
we know excellent is higher.
Interval Scale: It is another type of scale that not only arranges objects according to their
magnitudes but also distinguishes their ordered arrangement in units of equal intervals. An
interval scale contains all the information of an ordinal scale, but it also allows you to
compare the differences between objects. The difference between any two scale values is
identical to the difference between any other two adjacent values of an interval scale. There
is a constant or equal interval between scale values.
Example: The classic example is Fahrenheit temperature scale. If a temperature is 800 it
cannot said that is twice as hot as 400 . The reason is far that 0 0 does not represent the lack
of temperature, but a relative point on the Fahrenheit scale.
Ratio Scale: A ratio scale possesses all the properties of the nominal, ordinal and interval
scales and in addition, an absolute zero point. It possesses an absolute zero. Thus, in ratio
scales we can identify or classify objects, rank the objects and compare intervals or
differences. It is also meaningful to compute ratios of scale values.
Example: Money and weight are ratio because they posses an absolute zero and interval
properties.
Mathematical and Statistical Analysis of Scales
Permissible Statistics
Basic Common Marketing Numerical
Scale
Characteristics Examples Examples Operation
Descriptive Inferential
• Numbers • Social • Brand • Counting • Percentages, • Chi-square,
identify and Security numbers, mode binomial test
classify objects numbers, store types,
Nominal
numbering of sex
football classification
players
• Numbers • Quality • Preference • Rank • Percentile, • Rank-order
indicate the rankings, rankings, ordering median correlation,
relative positions rankings of market Friedman
of the objects but teams in a position, ANOVA
Ordinal
not the tournament social class
magnitude of
differences
between them
• Differences • Temperature • Attitudes, • Arithmetic • Range, • Product-
between objects (Fahrenheit, opinions, operations mean, moment
can be Centigrade), index on standard correlations,
compared; zero IQ score numbers intervals deviation t-tests,
Interval
point is arbitrary between ANOVA,
numbers regression,
factor
analysis
• Zero point is • Length, • Age, income, • Arithmetic • Geometric • Coefficient of
fixed; ratios of weight costs, sales, operations mean, variation
Ratio
scale values can market on actual harmonic
be computed shares quantities mean
Data Set: Data is nothing but systematically recorded values and facts about a
characteristic in any enquiry. When the data available to us is not systematic or organized,
they are known as Raw Data. Mostly, the data given to us is in form of Raw data, and
systematically organizing them may be in the form of either Bar Graph, Pictograph,
Double Bar graph, or any other form of visual representation is called as organization of
Raw Data. As for example, 15 people were asked about their favorite sports, these are
the answers given by them, cricket, volleyball, tennis, cricket, cricket, tennis, badminton,
volleyball, badminton, badminton, cricket, tennis, volleyball, cricket, tennis.
Editing/Cleaning of Raw Data: Raw data is sometimes called the primary data and this
type of data may be affected by inconsistencies, duplications, omission etc. That’s why
it is always required to edit/clean the raw data before any summarization or statistical
analysis.
Organization of Data: It is a critical process that involves structuring, categorizing, and
managing data to make it more accessible, usable, and analyzable. Whether in research,
business, or everyday applications, well-organized data can significantly enhance
efficiency and decision-making. The importance of data organization has grown
exponentially with the increasing volume of data generated in today’s digital age. By
organizing data, we can ensure it is clean, accurate, and ready for analysis, leading to more
informed insights and better outcomes. Key components of data organization include
classification, categorization, and structuring. For example, in a business setting, customer
data might be organized by demographics, purchase history, and engagement levels,
allowing for targeted marketing efforts and personalized customer service.
• Removes any possible errors: In unorganized data, the possibility of error is not
zero, there can be errors or inconsistencies either while gathering the data or
tabulating the data or analyzing or while representing it, however, in organized data,
it is made sure that the data provided is completely correct and without any errors.
• Easy to understand and memorize: Organized data are visually appealing and are
very easily to memorize and understand than raw data.
a) Classification
b) Tabulation
c) Frequency Distribution
d) Graphical Representation
e) Descriptive Measures
- Measures of Central Tendency (Mean, Median, Mode)
- Measures of Dispersion (Range, Quartile Deviation, Mean
Deviation, Variance and Standard Deviation)
- Trimmed Mean
- Shape Characteristics
- Correlation
- Regression
- Stem and Leaf Plot
- Box whisker Plot
- Dot Plot
Classification: The most common summarization technique is classification. This
technique is mostly use for qualitative data. Such as to classify gender, area, rejoin, literacy
etc.
Tabulation: The technique of tabulation is mostly use for quantitative data. In tabulation
we have two types of tables. One univariate and another one is bivariate. In the univariate
table we observe the trend or behavior or pattern of the variable or characteristics of our
interest. The bivariate table we just examine the relationship between two variables.
Frequency distribution: A frequency distribution is a table used to organize the raw data.
We have two types of frequency distribution. One is discrete frequency distribution and
another one grouped frequency distribution.
Discrete Frequency Distribution: A discrete frequency distribution divides observations
in the data set into conveniently established single value distribution. When the range of
the data is small then we construct discrete frequency distribution. The number of
observations corresponding to each value is referred to as frequency. As for example,
following are the frequency distribution of family members of 25 families.
5 2
6 1
Step 2: Determine the number of class: If K determine the number of classes and N the
total number of observations, then the value K will be the smallest exponent of number 2,
that is 2k N .
Another way to find the value of K by using Sturge’s rule is given by: K = 1 + 3.322
log10 N , where log10 N is the logarithm (base 10) of total number.
Step 3: Determine the width or the interval of classes: For constructing the frequency
distribution determine the suitable class interval i,
Range Range
i= = , generally we consider i as a multiple of 5.
K 1+3.322 log10 N
Step 4: Determine the class limits (boundaries): The limits of each class interval should
be clearly defined so that each observation (element) of the data set belongs to one and
only one class. The class interval must be inclusive and non-overlapping such as 20-29,
30-39, etc. Sometimes we also need exclusive types of class, where upper limit of each
classes are excluded from the each class (such as 20-30, 30-40, 40-50 etc.)
Step 5: Mid-point of class interval: The class mid-point is the point halfway between the
boundaries of each class. That means, it is the average of upper limit and of lower limit of
each classes.
Step 6.Tally marks: Now each and every observations of the data set are matched with
the respective classes and put a tally for every observation, after completing the whole data
set, the tallies of every class are added and put it on corresponding classes. This is known
as frequencies.
Step 7: Cumulative frequency (less than): If you add the frequencies of each classes with
next class from the top in a cumulative form then it is known as cumulative frequencies
less than. If do the same thing from the bottom then it is known as cumulative frequencies
more than. If we divide the frequencies of each class by the total frequencies then it is
known as relative frequency.
Example 1: Following data shows the total time (in hours) work by 30 machinists.
Construct a frequency distribution.
90 88 90 89 90 84 86 90 84 89 93 84 90 94 91
94 93 93 92 92 85 88 86 91 87 94 89 85 90 95
Solution: Here the variations among the data set are not vary wide, so we construct a
ungroup frequency distribution as follows:
Table 1: Frequency distribution of total time hours work by 30 machinists.
Working hours 84 85 86 87 88 89 90 91 92 93 94 95
No. of 3 2 2 1 2 3 6 2 2 3 3 1
Employee
Example 2: Following data shows the weekly overtime (in hours) of 50 employees in a
reputed fashion design company. Construct a frequency distribution by taking suitable
class interval.
22 77 79 82 65 50 65 73 60 33 75 66 65 30 63 41 55
65 67 62 45 49 75 59 55 54 51 28 39 25 50 48 68 55
81 35 65 65 79 61 45 53 81 49 37 57 78 27 87 77
Solution: Here, 26 50 , so the value of number of classes is 6. And the range of the data
is 22-87=65. Therefore the width of the class inter is 12. On the other hand, we know,
𝑅𝑎𝑛𝑔𝑒 65
𝑖= = ≅ 9.784
1 + 3.322𝑙𝑜𝑔10 50 1 + 3.322 × 1.6989
Here the nearest value of the width is 10 (we select it multiple of 5).
Table: Frequency distribution of weekly overtime (in hours) distribution fo 50 employee.
Overtime (in hours) No. of employee
20-30 4
30-40 5
40-50 6
50-60 10
60-70 13
70-80 8
80-90 4
Dr. Mohd. Muzibur Rahman 13
Professor, Department of SDS
Introduction to Statistics and Data Science
Example: The management of a factory wants to know per month working pattern of
workers of their factory. In this connection, a survey was conducted on randomly selected
48 workers of the factory. Following data give the number of hours work per month of the
48 workers of the factory.
140 165 103 110 130 144 133 204 175 156 187 195
162 161 167 184 151 149 157 124 87 71 79 155
164 40 94 113 108 146 122 87 69 164 116 203
121 128 149 148 30 93 114 104 150 62 143 42
Construct a frequency distribution by using suitable class interval.
Describing Data: Graphical: Once we carefully define a problem, we will need to collect
data for making decision. Often the number of observations collected is so large that the
actual findings of the study are unclear. For this reason, it is necessary to summarize data
in such a way that a clear and accurate picture emerges. Unfortunately, there is no single
method or way to describe data. Rather, the appropriate line is typically problem-specific,
depending on two factors, the type of data and the purpose of the study. Tables and graphs
help us to gain a better understanding of data and provide visual support for improved
decision making.
Bar diagram and pie diagram are mainly used for representing qualitative data. The former
is also frequently used for depicting numerical values of a given item over a period of time.
Histogram, frequency polygon and ogive curve are used to represent frequency
distributions. Line diagram is widely used to study the changes in the values of a variable
with the passage of time. Scatter diagram is very useful in studying the interrelationship of
two variables.
Bar Diagram: This diagram is drawn by constructing a series of blocks of equal widths
but the heights of the blocks or rectangles is proportional to the values corresponding to
different time period or categories. Following Table shows the distribution of the
expenditure budget (in core taka) of different sector of country in the year 2012 as follows:
Now if we put the categories (sector) in the x-axis and the expenditure in y-axis then the
diagram will be a bar diagram where the width of the bars are equal but the heights are
proportional to the expenditure of the sectors.
100
80
60
Series1
40
20
0
s
ry
n
re
t
er
or
io
st
tu
th
sp
at
du
ul
O
uc
an
In
ric
Ed
Tr
Ag
An interesting and useful extension to the simple bar chart can be used when components
of individual categories are also of interest. As for example, following Table shows the
number of students enrolled in three business majors for three different years of three
department of Jahangirnagar University business faculty.
100
90
80
70
60 Accounting
50 Marketing
40 Finance & Banking
30
20
10
0
2001-02 2002-03 2003-04
This information can be shown in a bar chart by breaking down the total number of students
for each year so that the three components are distinguished by differences called
components or bar chart. This graph allows us at make visuals comparisons of totals and
individual components. In this example it appears that the increase in enrollment between
2001 and 2004 was almost uniform over the three majors.
Pie diagram: Pie chars are also used to describe categorical data. If we want to draw
attention to the proportion of frequencies in each category, then we will probably use a pie
chart to depict the division of a whole into its constituent parts. The circle represents the
total, and the segments cut from its center depict shares of that total. Following Table shows
the distribution of monthly expenditure of the students of JU.
Food
Clothing
House rent
Education
Miscellaneous
9
8
7
6
5
4
No. of Client
3
2
1
0
10-15 15-20 20-25 25-30 30-35
Audit Time
Frequency Polygon
14
12
10
Frequency
8
6
4
2
0
15 25 35 45 55 65
Mid value
Ogive (less than): In the X axis we plot upper limit of the class and in Y axis we plot
cumulative frequency less than.
Class interval 5-10 10-15 15-20 20-25 25-30 30-35
Frequency 5 7 11 8 4 2
Cumulative 5 12 23 31 35 37
frequency
Ogive Curve
40
Cumulative frequency
30
20 Series2
10
0
10 15 20 25 30 35
Upper lim its
Year of 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
enrollment
No. of student 12 14 15 17 18 15 16 19 20 21
(‘00)
Enrollment
25
20
15
10
5
0
1 2 3 4 5 6 7 8 9 10
a) When the emphasis is on the movement of a variable rather then on it’s actual
magnitude.
b) When several series are compared on the same chart.
c) When estimates or forecasts of a variable are to be obtained or displayed
graphically.
Scatter diagram: Sometimes the data consist of pair values of two related variables, and
the statistical problem is to investigate the inter-relationship between the variables. The
pairs of values of such related variable are: height and weight, income and expenditure,
price and consumption etc. When the given pair of values is plotted on ordinary graph
paper, we get a scatter diagram. If the dotted points form an upward trend on the graph
paper then the relationship between the variable is positive. If it forms a down ward trend
on the graph paper then the relationship between two variables is negative.
Expense of Ad. 10 12 15 20 23 9 6 7 11 12 13
Sales (in lac) 14 17 23 21 25 11 8 9 14 13 27
Sales
30
25
20
15
10
0
0 5 10 15 20 25
Example: The prices (in taka) of 20 different brand of walking shoes are given below:
4 7 7 5 7 7 7 6 6 6 7 8 8 5 6 8 9 6 7 8
5 0 0 5 5 3 0 5 8 0 4 0 3 8 8 5 0 4 5 2
Construct a stem and leaf plot to display the distribution of the data.
Solution: The stem and leaf display of the data is follows:
Stem Leaf
4 5
5 5 8
6 5 8 0 8 4
7 0 0 5 3 0 4 5
8 0 3 5 2
9 0
Stem Leaf
4 5
5 5 8
6 0 4 5 8 8
7 0 0 0 3 4 5 5
8 0 2 3 5
9 0
From the display it is seen that lowest price of walking shoe is 45 and highest is 90. And
the most common price is 70.
• s an excess of females, and denotes low sex ratio
Dot Plot: A dot plot is used to encode data in a dot or small circle. The dot plot is shown
on a number line that displays the distribution of numerical variables where a value is
defined by each dot.
A dot plot is used to represent any data in the form of dots or small circles. It is similar to
a simplified histogram or a bar diagram as the height of the bar formed with dots represents
the numerical value of each variable. Dot plots are used to represent small amounts of data.
For example, a dot plot can be used to collect the vaccination report of newborns in an area,
which is represented in the following table.
Colony A B C D
Now let's see the number of newborn babies who got a vaccine in each colony. Colony A
has a total of 7 dots, which means that seven babies have been vaccinated. Similarly, colony
B has three babies, colony C has five babies, and colony D has one baby who has been
vaccinated. The other way to represent it through a dot plot is given below:
There are two types of dot plot: Wilkinson dot plot and Cleveland dot plot.
The Wilkinson dot plot represents the distribution of continuous data in the form of
individual dots for each value. For example, if 10 students like math it is represented by 10
dots on a dot plot. In the above example of the number of kids vaccinated, the first graph
showing 7 dots for colony A, 3 dots for colony B, etc. is an example of a Wilkinson dot
plot.
The Cleveland dot plot is a good alternative to a simple bar map if you have more than a
few elements. It doesn't take much to look cluttered on a bar map. Many more values can
be used in a dot plot in the same amount of space, and it's also simpler to read. This type
of plot is similar to a bar chart but uses a location instead of the length of the bar formed
by multiple dots. Just like how the height of the bar chart represents the number of items,
the position of the dot on the number line or on the graph represents the number of items
for that category. In the above example of vaccinated children in 4 colonies, the second
graph showing only one dot for each colony is an example of a Cleveland dot plot.
Example 1: The following dot plot illustrates each student's essay score in Mr. Jhonson's
class. A different student is represented by each dot. What was the minimum essay score
earned by a student and what is the score earned by the maximum number of students?
Solution: So as per the above data represented in the dot plot, shows the data of the
number of students who received scores for essays on a 6-point scale.
Thus, the minimum essay score that a student received is 2 points and 3 is the marks earned
by the maximum number of students.
Example 2: The following dot plot shows the height of each toddler (kids who are started
walk) at Mrs. Bell's daycare. Each dot represents a different toddler. What is the height of
the shortest toddler?