KEMBAR78
Chapter 1 - Introduction To ASDS | PDF | Cost Of Living | Level Of Measurement
0% found this document useful (0 votes)
16 views23 pages

Chapter 1 - Introduction To ASDS

Chapter 1 introduces the concepts of statistics and data science, defining statistics as the science of collecting, analyzing, and interpreting numerical data, while data science is described as an interdisciplinary field that utilizes statistical methods to extract insights from data. The chapter emphasizes the importance of statistics in various domains, including business, healthcare, and social sciences, highlighting its role in decision-making and data analysis. Additionally, it discusses key statistical concepts such as descriptive and inferential statistics, types of data, and the limitations of statistical analysis.

Uploaded by

solidliquidbd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views23 pages

Chapter 1 - Introduction To ASDS

Chapter 1 introduces the concepts of statistics and data science, defining statistics as the science of collecting, analyzing, and interpreting numerical data, while data science is described as an interdisciplinary field that utilizes statistical methods to extract insights from data. The chapter emphasizes the importance of statistics in various domains, including business, healthcare, and social sciences, highlighting its role in decision-making and data analysis. Additionally, it discusses key statistical concepts such as descriptive and inferential statistics, types of data, and the limitations of statistical analysis.

Uploaded by

solidliquidbd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Chapter 1

Introduction to Statistics and Data Science


The English word Statistics originates from the Italian word “Statista” or the Latin word
“Status”. Statistics is about creating information about any number-based information by
searching and recording it. It may be defined as the science of numbers. Many scientists
define statistics in different ways. Some of them are given below:
A.L. Bowley defines, “Statistics are numerical statements of facts in any department of
enquiry placed in relation to each other.”
Wallis and Roberts defines, “Statistics is a body of methods for making wise decisions in
the face of uncertainty.”
Croxton and Cowden define, “Statistics may define as the collection, presentation,
analysis, and interpretation of numerical data.”
W.I. King defines, “The science of statistics is the method of judging collection, natural
or social phenomena from the results obtained from the analysis or enumeration or
collection of estimates.”
Professor Boddington defines, “Statistics as the science of estimate and probabilities.”
Professor Ya Lun Chaou defines, “Statistics is a method of decision making in the face
of uncertainty on the basis of numerical data and calculated risk.”
Professor Horace Secrist defines, “Statistics is the aggregate of facts affected to a marked
extent by the multiplicity of causes, numerically expressed, enumerated or estimated
according to reasonable standards of accuracy, collected in a systematic manner for a pre-
determined purpose and placed in relation to each other.”
American Heritage Dictionary defines, “Statistics is the mathematics of collection,
organization, and interpretation of numerical facts or data, especially the analysis of
population characteristics by inference from sampling.”
Definition of Statistics: Finally Statistics may be define as “The science which deals with
collection, organization, classification, summarization, presentation, analysis and finally
interpretation of statistical results obtained from the analysis in any field of inquiry.”
Data Science: The field of data science has gained significant prominence in recent years,
with organizations relying on data-driven insights to make informed decisions. At the heart
of data science lies on the discipline of statistics, which plays a crucial role in extracting
meaningful information from big data or bulk of data.
Definition of Data Science: Data science may be defined as an interdisciplinary field that
combines scientific methods, algorithms, and systems to extract knowledge and insights
from structured and unstructured data. It encompasses various techniques such as data
mining, machine learning, and statistical analysis to uncover patterns, trends, and
correlations within data. Data scientists utilize their expertise to generate actionable
insights and drive evidence-based decision-making in any field of enquiry.
Introduction to Statistics and Data Science

Importance of Statistics in Data Science: The importance of statistics for data science
and statistics for data analytics is immense. Exploring it through the below-mentioned
points:

• For description and quantification of data

• For data identification and conversion of data patterns into usable format

• To collect, analyze, evaluate, and conclude the results for data using appropriate
mathematical models

• Organize data while spotting the trends.

• Contributes to probability distribution and estimation

• Enhance the data visualization and reduce the assumptions

Statistics for data science is also very important and useful in business and industry as
enlisted below:

• Useful in risk assessment, fraud detection, and portfolio optimization. It also contributes
to forecasting market trends, modeling financial data, and making investment
decisions.

• Statistics aids in medical science through clinical trials, patient data analysis, disease
diagnosis and identifying the treatment effectiveness.

• It helps evaluate teaching methodologies, assess student performance, and improve the
curriculum and educational policy.

• Retailers benefit through inventory management, demand forecasting, and customer


segmentation. It aids in ensuring keeping up with optimal stock levels according to the
requirements, improving the pricing strategies, and enhancing the overall customer
experience.

• Manufacturers also benefit through process optimization and quality control through
defect identification, reducing downtime, and improving efficiency.

• It assists in environmental studies for ecological monitoring and climate modeling to


support conservation efforts and build environmental policies.

Fundamental of Statistics: Data science refers to dealing with different types of structured
and unstructured data. Statistical analysis helps in enhancing predictability, pattern
analysis, and concluding and interpreting the results or findings obtained from the above
data analysis. The two fundamental statistics concepts that play a key or vital role in data
science are descriptive and inferential statistics.
Dr. Mohd. Muzibur Rahman 2
Professor, Department of SDS
Introduction to Statistics and Data Science

Descriptive Statistics: It includes the method for summarizing and describing the main
features of a dataset. The different measures of central tendency involved in descriptive
statistics include mean, median, and mode. Besides, dispersive measures such as range,
standard deviation, and variance are also included to provide a comprehensive overview of
the data’s characteristics for the interpretation or conclusion.

Inferential Statistics: Inferential statistics is concerns in using sample data for inferences
or predictions about the population. It includes the test of hypothesis to assess the validity
of assumptions or claims about the population. The concept is also helpful for constructing
confidence intervals to estimate the likely range of values for population parameters.
Inferential statistics has the significance in decision-making.

Importance of statistics:
Statistics and state: A state in the modern setup collects the largest amount of statistics
for various purposes. It collects data relating to prices, production, consumption, income
and expenditure, investment and profits. Popular statistical methods such as time-series
analysis, index numbers, forecasting and demand analysis are extensively used in
formulating economic policies. Government also collects data on population dynamics in
order to initiate and implement various welfare policies and programs.
Statistics and economics: (i) Time series analysis is used for studying the behaviors of
prices, productions, and consumption of commodities, money in circulation and bank
deposits and clearings (ii) Index numbers are useful in economic planning as they indicate
the changes over a specified period of time in (a) prices of commodities (b) imports and
exports (c) industrial/agricultural production (d) cost of living. (iii) Demand analysis is
used to study the relationship between the price of a commodity and its output (supply).
(iv) forecasting techniques are used to predict inflation rate, unemployment rate, or
manufacturing capacity utilization .
Statistics in Business management: In the business point of view, statistics may be
defined as a method of decision making in the face of uncertainty on the basis of numerical
data and calculated risks. Following are certain activates of a typical organization where
statistics plays an important role in their efficient execution.
Marking: Before a product in launched, the market research team of an organization,
through a pilot survey, makes use of various techniques of statistics to analyze data on
population purchasing power, habits of the consumers, competitors, pricing and other
aspects. Such studies reveal the possible market potential for the product.
Production: Statistical methods are used to carry out programs for improvement in the
quality of the exciting products and setting quality control standards for new ones.
Decisions about the quality and time of either self-manufacturing or buying from outside
are based on statistically analyzed data.
Finance: A statistical study through correlation analysis of profit and dividend helps to
predict and decide probable dividends for future years. Statistics applied to analyze of data

Dr. Mohd. Muzibur Rahman 3


Professor, Department of SDS
Introduction to Statistics and Data Science

on assets and liabilities and income and expenditure, help to ascertain the financial results
of various operations.
Personal: In the process of manpower planning, a personal department makes statistical
studies of wage rates, incentive plans, cost of living, labor turnover rates, employment
trends, accident rates, performances, training and development programs.
Statistics in social science: (i) Regression and correlation analysis techniques are used to
study and isolate all these factors associated with each social phenomenon which bring out
the changes in data with respect to time, place and object.
(i) Sample techniques and estimation theory and indispensable methods for
conducting any social survey for drawing valid inferences.
(ii) In sociology, statistical methods are used to study death rates, birth
rates, population growth and other aspect of vital statistics.
Statistics in medical science: The knowledge of statistical method and techniques in all
natural sciences – zoology, botany, meteorology, and medicine-is of great importance. For
example, for proper diagnosis of a disease, the doctor needs and relies heavily on factual
data relating to pulse rate, body temperature, blood pressure, heart beat, and body weight.
An important application of statistics lies for testing the efficacy of a particular drug or
vaccines to cure or prevent a specific disease.
Statistics and computer: Computer and information technology, in general have had a
fundamental effect on most business and service organization. Over the last decades
personal computer (PC) has revolutionized both the areas to which statistical techniques
are applied. PC facilities such as spreadsheets or common statistical packages have now
made such analysis readily available to any business decision-maker. Computer helps in
processing and maintaining past records of operations involving payroll calculations,
inventory management, railway/airline reservation etc.
Limitation of statistics: Although Statistics has it’s application in almost all sciences-
social, physical, and natural- it has it’s own limitations as well, which restrict it’s scope
and utility.
a) Statistics does not study qualitative phenomena: Since Statistics deals with
numerical data, it cannot be applied in studying those problems which can be stated
and expressed quantitatively. Qualitative characteristics such as honesty, poverty,
welfare, beauty or health cannot directly measure quantitatively. However, these
subjective concepts can be related in an indirect manner to numerical data after
assigning particular scores.
b) Statistics does not study individuals: Statistics always deals with aggregated data
that Statistics always helps for taking decision about a population based on sample
information.
c) Statistics can be misused: Statistics are liable to be misused. For proper use of
Statistics one should have enough skill, knowledge, experience to draw accurate
and sensible conclusion. Further, valid results cannot be drawn from the use of
statistics unless one has a proper understanding of the subject to which it is applied.

Dr. Mohd. Muzibur Rahman 4


Professor, Department of SDS
Introduction to Statistics and Data Science

d) It does not depict entire story of phenomenon: In any particular enquiry it


may have many causes, but all these causes can not be expressed in terms of
data. So we cannot reach at the correct conclusions. Development of a group
depends upon many social factors like, parents’ economic condition, education,
culture, region, administration by government etc. But all these factors cannot
be placed in data. So we analyze only that data we find quantitatively and not
qualitatively. So results or conclusion are not 100% correct because many
aspects are ignored.

e) Laws are not exact: Statistical laws are based on probability. So, the results
will not always be as good as of scientific laws. On the basis of probability or
interpolation, we can only estimate the production of paddy in 2008 but cannot
make a claim that it would be exactly 100 %. This is only the approximated
estimates.

f) Results are true only on average: As discussed above, here the results are
interpolated for which time series or regression or probability can be used.
These are not absolutely true. If average of two sections of students in statistics
is same, it does not mean that all the 50 students is section A has got same
marks as in B. There may be much variation between the two. So we get average
results.

g) To Many methods to study problems: In this subject we use so many methods


to find a single result. Variation can be found by quartile deviation, mean
deviation or standard deviations and results vary in each case.

h) Statistical results are not always beyond doubt: Although we use many laws
and formulae in statistics but still the results achieved are not final and
conclusive. As they are unable to give complete solution to a problem, the result
must be taken and used with much wisdom or knowledge.

Population: A well define group or sector or area about that you want to draw your
conclusion or decision then the entire group is known as population. As for example, a
high school administrator wants to know the trend of final exam scores of all graduating
senior students. Here all the graduating students is known as population and each graduate
of the school is known as population unit. Population characteristics is known as parameter
and it is always unknown. Such as, population mean 𝜇.
Finite Population: If the population unit of a population is countable or a finite number
then this type of population is known as finite population. As for example, if you want to
know the socio-economic condition of the students of Jahangirnagar University then this

Dr. Mohd. Muzibur Rahman 5


Professor, Department of SDS
Introduction to Statistics and Data Science

is known as finite population. Because the number of students of JU at any particular time
a known fixed number.

Infinite Population: If the population unit of a population is not countable or not a finite
number then this type of population is known as infinite population. As for example, if you
want to know the total number of fishes of any particular river or a pond at any particular
time this is known as infinite population. Because the number of fishes in a pond or a river
at any particular time is not known.

Sample: A representative part of population is known as sample. As for example, to know


the socio-economic condition of JU student, if we take a representative part of the students
instead of taking all the JU students is known as sample.

Sampling Unit: The smallest unit from which we want to collect the information for
drawing conclusion about the defined population is known as sampling unit.

Sampling Frame: The entire list of population units is known as sampling frame. The list
of all JU students is known as sampling frame.
Variable: A characteristics which varies individual to individual is known as variable. As
for example, height and weight of a person, amount of income of a family, number of
family members, total monthly expenditure of a family. We have two types of variables
one is qualitative and another one is quantitative variable.
Qualitative Variable: The characteristics which can not express or measure quantitatively
or numerically is known as qualitative variable. Such as gender, IQ level of an individual,
honesty of a person etc.
Quantitative Variable: The characteristics which can be express or measure quantitatively
or numerically is known as qualitative variable. Such as height and weight of a person,
percentage of attendance of the students, temperature of a particular area, number of family
members. We can classify quantitative variable into two category as discrete variable and
continuous variable.
Discrete Variable: A quantitative variable is said to be a discrete variable if it take only
integer values. As for example, the number of students enrolled in a class, the number of
university credits earned by a student at end of a particular semester and the number of
insurance claims filed following a particular hurricane in any particular state and number
of brothers and sisters of the students. It is usually arises from a measurement of counting
process
Continuous Variable: A quantitative variable is said to be a continuous variable if take
any value within a given range of real numbers and usually it is arise from a measurement
process not from a a counting. As for example, the continuous variable includes height and
weight of a person, waiting time for a person, distance of JU from your residence,
temperature of any region etc. Someone might say that he is 6 feet (or 72 inches) tall but
his height could be actually be 72.1 inches, 71.8 inches or some other similar number,
depending on the accuracy of the instrument used to measure height. Other examples of

Dr. Mohd. Muzibur Rahman 6


Professor, Department of SDS
Introduction to Statistics and Data Science

continuous numerical variables include the weight of cereal boxes, the time to run a race,
and the distance between two cities etc.
Depending on the situations or the nature of data, we sometimes categories the variables
as dummy variable and categorical variable.

Dummy Variable: When we assign only two values either 0 or 1 for any particular variable
then this variable is known as dummy variable. As, for example, for the variable age, if we
categorized it as under 18 years of age as 0 and 1 as age over or equal to age 18 years.
Categorical Variables: Categorical variables produce response that belongs to groups or
categories. For example, response to yes (if yes 1)/no (if no 2) questions are categorical.
“Do you own a mobile phone?” and did you ever visit Hiroshima, Japan? Are limited to
yes or no answer? Other examples of categorical variables include questions on gender,
marital status, literacy status and your major in college. Sometime categorical variables
include a range of choices such the instructor in this course was an effective teacher (1:
strongly disagree 2: slightly disagrees, 3: neither agree nor disagree, 4: slightly agree, 5:
strongly agree).
Data: Data is a collection of facts, observations, or measurements used for analysis and
decision-making. Data can be qualitative, such as gender, intelligence etc., numerical, such
as counts or measurements number of family members, height etc, or categorical, such as
labels or classifications. Data serves as the starting point for analysis. Data is use to
examine, manipulate, and interpret to draw conclusions or make predictions about a
particular phenomenon or population. We have two types of data for decision making, one
is primary data and another one is secondary data.

Primary Data: The data which collected from the respondent directly by different survey
using questionnaire or schedule is known as primary data.
Secondary Data: The data collected from different published journals or books or other
sources is known as secondary data.

Measurement and Scaling Concept:


Measurement: Measurement is the assignment of numbers or other symbols to
characteristics of objects according to certain pre-specified rules. Note that what we
measure is not the object, but some characteristic of it. Thus, we do not measure objects-
only their perceptions, attitudes, preferences, or other relevant characteristics. In research,
numbers are usually assigned for one of two reasons. First, numbers permit statistical
analysis of the resulting data. Second, numbers facilitate the communication of
measurement rules and results. In case of qualitative data there is no measurable to the
“difference” in numbers. For one basketball player is assigned the number “20” and another
player has the number “10” we can not conclude that the first player is twice as good as the

Dr. Mohd. Muzibur Rahman 7


Professor, Department of SDS
Introduction to Statistics and Data Science

second player. However, with qualitative data, there is a measurable meaning to the
difference in numbers. When one student scores 90 on examination and another student
scores 45, the difference is measurable and meaningful.

Scale: A scale may be defined as any series of items that are arranged progressively
according to value or magnitude into which an item can be placed according to its
quantification. It can be defined as a continuous spectrum or series of categories.

Purpose of Scaling: The purpose of scaling is to represent usual quantitatively an items, a


person’s or an event’s place in the spectrum. The type of scale determines what numerical
and statistical operations can be used in analyzing measurements.
Types of Scale: There are four types of scales, such as: (i) Nominal scale (ii) Ordinal scale
(iii) Interval scale (iv) Ratio scale.
Nominal Scale: It is a measurement scale of simplest type in which the number or letters
assigned to objects serve only labels or tags for identifying and classifying objects with a
strict one-to-one correspondence between the numbers and the objects.
Example: In business research if we give the coding of males as 1 and females as 2. These
two numbers are nothing but levels.

Ordinal Scale: It is type of scale that arranges objects or alternatives according to their
magnitudes in an ordered relationship. When the respondents are ordered, ordinal values
are assigned. Thus it is possible to determine whether an object has more of less of a
characteristic than some other object.
Example: In business research if we ask to rate companies as excellent, good, fair, poor
we know excellent is higher.
Interval Scale: It is another type of scale that not only arranges objects according to their
magnitudes but also distinguishes their ordered arrangement in units of equal intervals. An
interval scale contains all the information of an ordinal scale, but it also allows you to
compare the differences between objects. The difference between any two scale values is
identical to the difference between any other two adjacent values of an interval scale. There
is a constant or equal interval between scale values.
Example: The classic example is Fahrenheit temperature scale. If a temperature is 800 it
cannot said that is twice as hot as 400 . The reason is far that 0 0 does not represent the lack
of temperature, but a relative point on the Fahrenheit scale.

Dr. Mohd. Muzibur Rahman 8


Professor, Department of SDS
Introduction to Statistics and Data Science

Ratio Scale: A ratio scale possesses all the properties of the nominal, ordinal and interval
scales and in addition, an absolute zero point. It possesses an absolute zero. Thus, in ratio
scales we can identify or classify objects, rank the objects and compare intervals or
differences. It is also meaningful to compute ratios of scale values.

Example: Money and weight are ratio because they posses an absolute zero and interval
properties.
Mathematical and Statistical Analysis of Scales

Permissible Statistics
Basic Common Marketing Numerical
Scale
Characteristics Examples Examples Operation
Descriptive Inferential
• Numbers • Social • Brand • Counting • Percentages, • Chi-square,
identify and Security numbers, mode binomial test
classify objects numbers, store types,
Nominal
numbering of sex
football classification
players
• Numbers • Quality • Preference • Rank • Percentile, • Rank-order
indicate the rankings, rankings, ordering median correlation,
relative positions rankings of market Friedman
of the objects but teams in a position, ANOVA
Ordinal
not the tournament social class
magnitude of
differences
between them
• Differences • Temperature • Attitudes, • Arithmetic • Range, • Product-
between objects (Fahrenheit, opinions, operations mean, moment
can be Centigrade), index on standard correlations,
compared; zero IQ score numbers intervals deviation t-tests,
Interval
point is arbitrary between ANOVA,
numbers regression,
factor
analysis
• Zero point is • Length, • Age, income, • Arithmetic • Geometric • Coefficient of
fixed; ratios of weight costs, sales, operations mean, variation
Ratio
scale values can market on actual harmonic
be computed shares quantities mean

Dr. Mohd. Muzibur Rahman 9


Professor, Department of SDS
Introduction to Statistics and Data Science

Data Set: Data is nothing but systematically recorded values and facts about a
characteristic in any enquiry. When the data available to us is not systematic or organized,
they are known as Raw Data. Mostly, the data given to us is in form of Raw data, and
systematically organizing them may be in the form of either Bar Graph, Pictograph,
Double Bar graph, or any other form of visual representation is called as organization of
Raw Data. As for example, 15 people were asked about their favorite sports, these are
the answers given by them, cricket, volleyball, tennis, cricket, cricket, tennis, badminton,
volleyball, badminton, badminton, cricket, tennis, volleyball, cricket, tennis.

Editing/Cleaning of Raw Data: Raw data is sometimes called the primary data and this
type of data may be affected by inconsistencies, duplications, omission etc. That’s why
it is always required to edit/clean the raw data before any summarization or statistical
analysis.
Organization of Data: It is a critical process that involves structuring, categorizing, and
managing data to make it more accessible, usable, and analyzable. Whether in research,
business, or everyday applications, well-organized data can significantly enhance
efficiency and decision-making. The importance of data organization has grown
exponentially with the increasing volume of data generated in today’s digital age. By
organizing data, we can ensure it is clean, accurate, and ready for analysis, leading to more
informed insights and better outcomes. Key components of data organization include
classification, categorization, and structuring. For example, in a business setting, customer
data might be organized by demographics, purchase history, and engagement levels,
allowing for targeted marketing efforts and personalized customer service.

Necessity of Data Organization: There is an immense necessity of data organization for


proper or most accurate decision making. The data organization has the following
advantages:
• It saves a lot of time: It is very difficult to make any conclusion or decision from an
unorganized data. But if the data are organized then it will take very short time for
conclusion. Take the previous example and find out which sport is chosen by most
of the people, the answer can be given by both raw data and organized data, but in
the latter case, the time consumed to answer the question is much less than the earlier
one.

• Removes any possible errors: In unorganized data, the possibility of error is not
zero, there can be errors or inconsistencies either while gathering the data or
tabulating the data or analyzing or while representing it, however, in organized data,
it is made sure that the data provided is completely correct and without any errors.

• Easy to understand and memorize: Organized data are visually appealing and are
very easily to memorize and understand than raw data.

Dr. Mohd. Muzibur Rahman 10


Professor, Department of SDS
Introduction to Statistics and Data Science

Methods of Summarizing Data: As we discussed that it is necessary to summarize the


raw data before making any decision or conclusion: Following are the different methods
for summarizing the raw data.

a) Classification
b) Tabulation
c) Frequency Distribution
d) Graphical Representation
e) Descriptive Measures
- Measures of Central Tendency (Mean, Median, Mode)
- Measures of Dispersion (Range, Quartile Deviation, Mean
Deviation, Variance and Standard Deviation)
- Trimmed Mean
- Shape Characteristics
- Correlation
- Regression
- Stem and Leaf Plot
- Box whisker Plot
- Dot Plot
Classification: The most common summarization technique is classification. This
technique is mostly use for qualitative data. Such as to classify gender, area, rejoin, literacy
etc.

Tabulation: The technique of tabulation is mostly use for quantitative data. In tabulation
we have two types of tables. One univariate and another one is bivariate. In the univariate
table we observe the trend or behavior or pattern of the variable or characteristics of our
interest. The bivariate table we just examine the relationship between two variables.
Frequency distribution: A frequency distribution is a table used to organize the raw data.
We have two types of frequency distribution. One is discrete frequency distribution and
another one grouped frequency distribution.
Discrete Frequency Distribution: A discrete frequency distribution divides observations
in the data set into conveniently established single value distribution. When the range of
the data is small then we construct discrete frequency distribution. The number of
observations corresponding to each value is referred to as frequency. As for example,
following are the frequency distribution of family members of 25 families.

Number of Family Member Number of Family


2 3
3 12
4 7

Dr. Mohd. Muzibur Rahman 11


Professor, Department of SDS
Introduction to Statistics and Data Science

5 2
6 1

Continuous Frequency Distribution: A continuous frequency distribution is a table used


to organize data numerically ordered classes (groups or categories). The number of
observations in each class is referred to as frequency.
Constructing a Frequency Distribution: If the variation within the data set is not so wide,
then it is wide to construct ungrouped frequency distribution for summarizing data. If the
number of observations obtained gets large, the method discussed above to summarize data
become difficult and time consuming. Thus to further summarizing the data into group
frequency distribution tables, the following steps should be taken:
i) select an appropriate number of non-overlapping class intervals
ii) determine the width of the class interval
iii) determine class limits (or boundaries) for each class interval to avoid
overlapping.

For constructing the frequency distribution, we need the following steps:


Step 1: Determine range: From the given data set, find out the lowest value and the highest
value. Then range is the difference between highest value and lowest value.

Step 2: Determine the number of class: If K determine the number of classes and N the
total number of observations, then the value K will be the smallest exponent of number 2,
that is 2k  N .

Another way to find the value of K by using Sturge’s rule is given by: K = 1 + 3.322
log10 N , where log10 N is the logarithm (base 10) of total number.

Step 3: Determine the width or the interval of classes: For constructing the frequency
distribution determine the suitable class interval i,
Range Range
i= = , generally we consider i as a multiple of 5.
K 1+3.322 log10 N
Step 4: Determine the class limits (boundaries): The limits of each class interval should
be clearly defined so that each observation (element) of the data set belongs to one and
only one class. The class interval must be inclusive and non-overlapping such as 20-29,
30-39, etc. Sometimes we also need exclusive types of class, where upper limit of each
classes are excluded from the each class (such as 20-30, 30-40, 40-50 etc.)

Step 5: Mid-point of class interval: The class mid-point is the point halfway between the
boundaries of each class. That means, it is the average of upper limit and of lower limit of
each classes.

Dr. Mohd. Muzibur Rahman 12


Professor, Department of SDS
Introduction to Statistics and Data Science

Step 6.Tally marks: Now each and every observations of the data set are matched with
the respective classes and put a tally for every observation, after completing the whole data
set, the tallies of every class are added and put it on corresponding classes. This is known
as frequencies.
Step 7: Cumulative frequency (less than): If you add the frequencies of each classes with
next class from the top in a cumulative form then it is known as cumulative frequencies
less than. If do the same thing from the bottom then it is known as cumulative frequencies
more than. If we divide the frequencies of each class by the total frequencies then it is
known as relative frequency.
Example 1: Following data shows the total time (in hours) work by 30 machinists.
Construct a frequency distribution.
90 88 90 89 90 84 86 90 84 89 93 84 90 94 91
94 93 93 92 92 85 88 86 91 87 94 89 85 90 95
Solution: Here the variations among the data set are not vary wide, so we construct a
ungroup frequency distribution as follows:
Table 1: Frequency distribution of total time hours work by 30 machinists.
Working hours 84 85 86 87 88 89 90 91 92 93 94 95
No. of 3 2 2 1 2 3 6 2 2 3 3 1
Employee

Example 2: Following data shows the weekly overtime (in hours) of 50 employees in a
reputed fashion design company. Construct a frequency distribution by taking suitable
class interval.
22 77 79 82 65 50 65 73 60 33 75 66 65 30 63 41 55
65 67 62 45 49 75 59 55 54 51 28 39 25 50 48 68 55
81 35 65 65 79 61 45 53 81 49 37 57 78 27 87 77
Solution: Here, 26  50 , so the value of number of classes is 6. And the range of the data
is 22-87=65. Therefore the width of the class inter is 12. On the other hand, we know,
𝑅𝑎𝑛𝑔𝑒 65
𝑖= = ≅ 9.784
1 + 3.322𝑙𝑜𝑔10 50 1 + 3.322 × 1.6989
Here the nearest value of the width is 10 (we select it multiple of 5).
Table: Frequency distribution of weekly overtime (in hours) distribution fo 50 employee.
Overtime (in hours) No. of employee
20-30 4
30-40 5
40-50 6
50-60 10
60-70 13
70-80 8
80-90 4
Dr. Mohd. Muzibur Rahman 13
Professor, Department of SDS
Introduction to Statistics and Data Science

Example: The management of a factory wants to know per month working pattern of
workers of their factory. In this connection, a survey was conducted on randomly selected
48 workers of the factory. Following data give the number of hours work per month of the
48 workers of the factory.

140 165 103 110 130 144 133 204 175 156 187 195
162 161 167 184 151 149 157 124 87 71 79 155
164 40 94 113 108 146 122 87 69 164 116 203
121 128 149 148 30 93 114 104 150 62 143 42
Construct a frequency distribution by using suitable class interval.
Describing Data: Graphical: Once we carefully define a problem, we will need to collect
data for making decision. Often the number of observations collected is so large that the
actual findings of the study are unclear. For this reason, it is necessary to summarize data
in such a way that a clear and accurate picture emerges. Unfortunately, there is no single
method or way to describe data. Rather, the appropriate line is typically problem-specific,
depending on two factors, the type of data and the purpose of the study. Tables and graphs
help us to gain a better understanding of data and provide visual support for improved
decision making.

Use of graphs: Following are the uses of graphs:


(i) It is helpful in explaining the main features of a set of data
(ii) It is often valuable in suggesting an appropriate method of analysis and in
explaining the conclusions founded upon the analysis.
(iii) It can sometimes pinpoint gross errors in statistical records.
Basic principle of graphs:
(i) A graph should be clear and simple; a complicated graph defeats its own
purpose.
(ii) A graph should be completely self explanatory.
(iii) The origin, the vertical and the horizontal scales should be so chosen that a
graph does not convey a false impression about the nature of the data.
Limitation of graphs:
(i) They may be misleading, unless drawn and studied with care.
(ii) The conclusions drawn from the graphs should normally be regarded as
tentative and therefore, the graphs are no substitute for more critical statistical
analysis.
Types of diagrams:
(i) bar diagram, (ii) pie diagram, (iii) histogram, (iv) frequency polygon, (v) line diagram,
(vi) ogive curve (vii) scatter diagram.

Dr. Mohd. Muzibur Rahman 14


Professor, Department of SDS
Introduction to Statistics and Data Science

Bar diagram and pie diagram are mainly used for representing qualitative data. The former
is also frequently used for depicting numerical values of a given item over a period of time.
Histogram, frequency polygon and ogive curve are used to represent frequency
distributions. Line diagram is widely used to study the changes in the values of a variable
with the passage of time. Scatter diagram is very useful in studying the interrelationship of
two variables.
Bar Diagram: This diagram is drawn by constructing a series of blocks of equal widths
but the heights of the blocks or rectangles is proportional to the values corresponding to
different time period or categories. Following Table shows the distribution of the
expenditure budget (in core taka) of different sector of country in the year 2012 as follows:

Sector of Expenditure Transport Education Agriculture Industry Others


Expenditure (in core 25 40 80 70 55
taka)

Now if we put the categories (sector) in the x-axis and the expenditure in y-axis then the
diagram will be a bar diagram where the width of the bars are equal but the heights are
proportional to the expenditure of the sectors.

100
80
60
Series1
40
20
0
s
ry
n

re
t

er
or

io

st
tu

th
sp

at

du
ul

O
uc
an

In
ric
Ed
Tr

Ag

An interesting and useful extension to the simple bar chart can be used when components
of individual categories are also of interest. As for example, following Table shows the
number of students enrolled in three business majors for three different years of three
department of Jahangirnagar University business faculty.

Subject 2001-02 2002-03 2003-04


Accounting 40 50 70
Marketing 70 80 90
Finance & Banking 45 55 80

Dr. Mohd. Muzibur Rahman 15


Professor, Department of SDS
Introduction to Statistics and Data Science

100
90
80
70
60 Accounting
50 Marketing
40 Finance & Banking
30
20
10
0
2001-02 2002-03 2003-04

This information can be shown in a bar chart by breaking down the total number of students
for each year so that the three components are distinguished by differences called
components or bar chart. This graph allows us at make visuals comparisons of totals and
individual components. In this example it appears that the increase in enrollment between
2001 and 2004 was almost uniform over the three majors.
Pie diagram: Pie chars are also used to describe categorical data. If we want to draw
attention to the proportion of frequencies in each category, then we will probably use a pie
chart to depict the division of a whole into its constituent parts. The circle represents the
total, and the segments cut from its center depict shares of that total. Following Table shows
the distribution of monthly expenditure of the students of JU.

Item of Expenditure Expenditure (in taka)


Food 6500
Clothing 2500
House rent 7000
Education 3500
Miscellaneous 3500

Pie Diagram of Daily Expenses of JU Students

Expen. (in taka)

Food
Clothing
House rent
Education
Miscellaneous

Histogram: It is a graphical method for representing a frequency distribution. To construct


this diagram the horizontal axis (x-axis) is divided into segments corresponding to the class
boundaries of the frequency distribution. On each segment a rectangle with area
proportional to the frequency in the class is erected. The set of adjacent rectangles so

Dr. Mohd. Muzibur Rahman 16


Professor, Department of SDS
Introduction to Statistics and Data Science

constructed, constitutes a histogram. Following distribution shows the audit time of 20


clients by exclusive method:

Audit time (in hours) 10-15 15-20 20-25 25-30 30-35


Number of clients 4 8 5 2 1

9
8
7
6
5
4
No. of Client

3
2
1
0
10-15 15-20 20-25 25-30 30-35

Audit Time

Frequency Polygon: It is a diagram used to represent a frequency distribution. The mid-


values of class intervals are plotted along the x-axis and corresponding frequencies are
plotted along the y-axis. These later points are then joined by straight lines. This forming
with the x-axis a polygon called frequency polygon. The frequency polygon should be
brought down-at each end to the x-axis by joining it to mid value (on the base line) of the
next outlying interval.

Audit time (in 10-20 20-30 30-40 40-50 50-60 60-70


hours)
Number of clients 4 8 12 7 5 4

Frequency Polygon

14
12
10
Frequency

8
6
4
2
0
15 25 35 45 55 65
Mid value

Dr. Mohd. Muzibur Rahman 17


Professor, Department of SDS
Introduction to Statistics and Data Science

Ogive (less than): In the X axis we plot upper limit of the class and in Y axis we plot
cumulative frequency less than.
Class interval 5-10 10-15 15-20 20-25 25-30 30-35
Frequency 5 7 11 8 4 2
Cumulative 5 12 23 31 35 37
frequency

Ogive Curve

40
Cumulative frequency

30

20 Series2

10

0
10 15 20 25 30 35
Upper lim its

Graphs to Describe Time-series Data:


Line diagram: If we are given the values of a variable at different point of time, the set of
values is known as a time series. The line diagram is used to represent this type of data. In
this diagram time is represented along the x-axis and the variable is plotted along the y-
axis. Thus we get a point for each time period and successive points, when it connected by
straight line, gives the desired diagram.

Year of 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
enrollment
No. of student 12 14 15 17 18 15 16 19 20 21
(‘00)

Enrollment

25
20
15
10
5
0
1 2 3 4 5 6 7 8 9 10

Dr. Mohd. Muzibur Rahman 18


Professor, Department of SDS
Introduction to Statistics and Data Science

The situations in which the line diagram is particularly useful are:

a) When the emphasis is on the movement of a variable rather then on it’s actual
magnitude.
b) When several series are compared on the same chart.
c) When estimates or forecasts of a variable are to be obtained or displayed
graphically.
Scatter diagram: Sometimes the data consist of pair values of two related variables, and
the statistical problem is to investigate the inter-relationship between the variables. The
pairs of values of such related variable are: height and weight, income and expenditure,
price and consumption etc. When the given pair of values is plotted on ordinary graph
paper, we get a scatter diagram. If the dotted points form an upward trend on the graph
paper then the relationship between the variable is positive. If it forms a down ward trend
on the graph paper then the relationship between two variables is negative.

Expense of Ad. 10 12 15 20 23 9 6 7 11 12 13
Sales (in lac) 14 17 23 21 25 11 8 9 14 13 27
Sales

30

25

20

15

10

0
0 5 10 15 20 25

Stem-and-leaf display: Stem and leaf display is another form of presentation of


quantitative data. It allows us to condense data, but still retain the individuality of the data.
This presentation shows the range, concentration, presence of outlier, if any, and
distribution of the data set at a glance.
The stem of an observation is the leading digit or digits and the leaf of an observation is
the trailing digit. All the values in the stem are listed in order in a column, a vertical line is
drawn beside them and then all the corresponding leaf values are recorded for each stem in
row, to right of vertical line.
Steps for construction of stem and leaf plot or display:
a) Divide each observation into two parts: the stem and the leaf.
b) List the leaf in a column, with a vertical line to their right.
c) For each observation, record the leaf portion in the same raw as its corresponding stem.
d) Order the leaves from lowest to highest in each stem.
e) Mention the leaf unit to understand the actual observation.
Dr. Mohd. Muzibur Rahman 19
Professor, Department of SDS
Introduction to Statistics and Data Science

Example: The prices (in taka) of 20 different brand of walking shoes are given below:

4 7 7 5 7 7 7 6 6 6 7 8 8 5 6 8 9 6 7 8
5 0 0 5 5 3 0 5 8 0 4 0 3 8 8 5 0 4 5 2

Construct a stem and leaf plot to display the distribution of the data.
Solution: The stem and leaf display of the data is follows:

Stem Leaf
4 5
5 5 8
6 5 8 0 8 4
7 0 0 5 3 0 4 5
8 0 3 5 2
9 0

Now we arrange the digits of each leaf in ascending order we get:

Stem Leaf
4 5
5 5 8
6 0 4 5 8 8
7 0 0 0 3 4 5 5
8 0 2 3 5
9 0

From the display it is seen that lowest price of walking shoe is 45 and highest is 90. And
the most common price is 70.
• s an excess of females, and denotes low sex ratio

Dot Plot: A dot plot is used to encode data in a dot or small circle. The dot plot is shown
on a number line that displays the distribution of numerical variables where a value is
defined by each dot.

A dot plot is used to represent any data in the form of dots or small circles. It is similar to
a simplified histogram or a bar diagram as the height of the bar formed with dots represents
the numerical value of each variable. Dot plots are used to represent small amounts of data.
For example, a dot plot can be used to collect the vaccination report of newborns in an area,
which is represented in the following table.

Dr. Mohd. Muzibur Rahman 20


Professor, Department of SDS
Introduction to Statistics and Data Science

Colony A B C D

Number of babies vaccinated 7 3 5 1

Now let's see the number of newborn babies who got a vaccine in each colony. Colony A
has a total of 7 dots, which means that seven babies have been vaccinated. Similarly, colony
B has three babies, colony C has five babies, and colony D has one baby who has been
vaccinated. The other way to represent it through a dot plot is given below:

Dr. Mohd. Muzibur Rahman 21


Professor, Department of SDS
Introduction to Statistics and Data Science

Types of Dot Plot

There are two types of dot plot: Wilkinson dot plot and Cleveland dot plot.

Wilkinson Dot Plot

The Wilkinson dot plot represents the distribution of continuous data in the form of
individual dots for each value. For example, if 10 students like math it is represented by 10
dots on a dot plot. In the above example of the number of kids vaccinated, the first graph
showing 7 dots for colony A, 3 dots for colony B, etc. is an example of a Wilkinson dot
plot.

Cleveland Dot Plot

The Cleveland dot plot is a good alternative to a simple bar map if you have more than a
few elements. It doesn't take much to look cluttered on a bar map. Many more values can
be used in a dot plot in the same amount of space, and it's also simpler to read. This type
of plot is similar to a bar chart but uses a location instead of the length of the bar formed
by multiple dots. Just like how the height of the bar chart represents the number of items,
the position of the dot on the number line or on the graph represents the number of items
for that category. In the above example of vaccinated children in 4 colonies, the second
graph showing only one dot for each colony is an example of a Cleveland dot plot.

Example 1: The following dot plot illustrates each student's essay score in Mr. Jhonson's
class. A different student is represented by each dot. What was the minimum essay score
earned by a student and what is the score earned by the maximum number of students?

Dr. Mohd. Muzibur Rahman 22


Professor, Department of SDS
Introduction to Statistics and Data Science

Solution: So as per the above data represented in the dot plot, shows the data of the
number of students who received scores for essays on a 6-point scale.

• The minimum essay score that a student received is 2 points.


• Four students earned 3 marks, which is the score earned by the maximum number
of students.

Thus, the minimum essay score that a student received is 2 points and 3 is the marks earned
by the maximum number of students.

Example 2: The following dot plot shows the height of each toddler (kids who are started
walk) at Mrs. Bell's daycare. Each dot represents a different toddler. What is the height of
the shortest toddler?

Dr. Mohd. Muzibur Rahman 23


Professor, Department of SDS

You might also like