Introduction to Statistics
WHAT IS STATISTICS?
Sir Ronald Fisher: Father of modern statistics.
Father of Indian Statistics: Prof. Prasanta Chandra Mahalanobis.
The word ‘Statistics’ and Statistical are derived from the Latin word ‘Status’, means
political state.
Statistics is a branch of mathematics.
Statistic is defined as a method of collecting, organizing, summarizing, analyzing
and interpreting the numerical data.
USE & APPLICATION OF STATISTICS
► Industries & Business , Agriculture, Education, Ecological Studies , Medical Studies,
Sports, Engineering, Economics, Environment, Physics, Chemistry, Biology,
Astronomy, Psychology, Forestry and so on…
WHAT IS BIOSTATISTICS?
Father of Biostatistics – Sir. Francis Galton
Biostatistics is the branch of statistics applied to biological or medical sciences.
Biostatistics is the methods used in dealing with statistics in the field of health
sciences such as biology, medicine, nursing, public health etc.
APPLICATION OF BIOSTATISTICS
In medicine
In community medicine & public health
In Physiology & Anatomy
In Pharmacology
Genomics, population genetics, and statistical genetics in populations in order to link
variation in genotype with a variation in phenotype
Biological sequence analysis
Ecology, ecological forecasting
BRANCHES OF STATISTICS
► Statistics can be divided into two branches
1. Descriptive Statistics
Descriptive statistics are used to organize and summarize the data in the form of tables,
graphs and numbers.
2. Inferential Statistics
Statistical methods used to draw conclusion about the whole population using a sample is
called Inferential Statistics. It include the methods like estimation and testing of hypothesis.
(To reach decisions about a large body of data by examining only a small part of the data.
This is done by using various tests of hypotheses, magnitude of associations etc.)
DATA
Data is collection of information, it is collected from the population.
OR, a collection of numerical observations of known facts is called a data.
Data collection
Data collection is the process of acquiring information from different sources, about
the topic under research. The function is performed by the researcher himself of his/her
team.
SOURCES OF DATA
PRIMARY DATA
Primary Data are the first-hand data, collected by the researcher for the first time and
is original in nature.
OR, Primary data refers to the data collected by the researcher, for the very first time,
from different sources, with a particular problem, question or specific purpose in mind.
It is useful for current studies as well as for future studies.
The sources of primary data are primary units such as basic experimental units,
individuals, households.
SECONDARY DATA
Secondary Data is the Second-hand data which have been collected by someone else
and which is readily available from the other sources.
OR, Secondary Data is the data collected by any person, organization or agency in the
past through surveys, experiments or study, for some other purpose, but used by the
researcher to deal with the problem at hand.
It involves less cost, time and effort
These are usually in journals, periodicals, research publication, official record etc.
Primary Data Secondary Data
Primary Data are the first-hand data, Nature Secondary Data is the Second-hand data which
collected by the researcher for the first time have been collected by someone else and which is
and is original in nature readily available from the other sources.
Real-time Data Nature of data Past Dara
Pure and row Form Refined
Freshly collected for the project Facts and figures Already collected and recorded
Time consuming Process Quick and Easy
Related to the objective of the researcher Data Adjusted or used according to the need
Expensive Cost Economical
First-hand Information Second-hand
More Accuracy & Comparatively less
Reliability
Survey, Experiment, Interview, Observation, Source Books, Journals, Newspapers, Internal records,
Questionnaire Government Publications, Websites, etc
VARIABLES
A variable is any characteristics, number, or quantity that can be measured or
counted.
Examples: Age, Sex, Caste, Blood pressure, weight, height, etc.
Vatiables
Quantitative Qualitative
Variables Variables
Discrete Continuous
variables variables
Quantitative Variables (Quantitative = Quantity)
Data are measures of values or counts and written down with numbers.
It also can be known as numerical variables
Example:- height, weight, number of Facebook friends, etc.
a) Discrete variables assume exact values only and can be obtained by counting. Or a
quantitative variable that can assume a countable number of values.
Example: Total number of students in a class
b) Continuous variables assume infinite values within a specified interval and can be
obtained by measurement. Or a quantitative variable that can assume an uncountable
number of values.
Example: Height of the students.
Qualitative Variables (Qualitative = Quality)
Not measured numerically- it can be categorized.
Qualitative variables are variables that can be placed into distinct categories,
according to the attribute or characteristic.
It also can be known as non-numerical variables
Example:- sex, religion, nationality, blood group, hair color, etc.
SCALES OF MEASUREMENT
There are four principal scales used to measure data.
1. Nominal Scale
2. Ordinal Scale
3. Ratio Scale
4. Interval Scale
1.Nominal Scale
Nominal scales are used for labelling variables, without any quantitative value.
“Nominal” scales could simply be called “labels”.
Simply, the data are alphabetic or numerical in name only and does not include any
notation of measurement.
(A nominal scale usually deals with the non-numeric variables or the numbers that do not
have any value.
Examples: Male, Female; Alive, Dead;
Presence or absence of disease.
Nominal data can be broken down again into three categories:
1. Nominal with order: Some nominal data can be sub-categorised in order, such as “cold,
warm, hot and very hot.”
2. Nominal without order: Nominal data can also be sub-categorised as nominal without
order, such as male and female.
3. Dichotomous: Dichotomous data is defined by having only two categories or levels, such
as “yes’ and ‘no’.
2.Ordinal Scale
The ordinal scale defines data that is placed in a specific order. While each value is
ranked, there’s no information that specifies what differentiates the categories from each
other.
These values can’t be added to or subtracted from.
(A qualitative variable that incorporates an ordered position, or ranking. Involves data
that may be arranged in some order.)
Examples: size of t-shirt, education level,
Level of satisfaction :(Very satisfied, satisfied, and somewhat satisfied)
3.Ratio Scale
Data classified as the ratio of two numbers.
Quantitative classification.
Zero point of scale is absolute (data can be added, subtracted, multiplied, and divided)
Example: number of votes,
Weight. Less than 55 kgs
55 – 75 kgs
76 – 85 kgs
86 – 95 kgs
More than 95 kgs
4.Interval Scale
Data classified by ranking.
Quantitative classification
Zero point of scale is arbitrary (differences are meaningful).
(It is defined as a quantitative measurement scale in which the difference between the two
variables is meaningful. In other words, the variables are measured in an exact manner,
not as in a relative way in which the presence of zero is arbitrary.)
Example: Haemoglobin; 8-10, 10-12, 12-14
Fahrenheit temp. Scale, (on a temperature scale, the difference between 20 °C
and 30 °C is the same as the difference between 50°C ad 60°C.)
Qualitative variables are measured on a nominal or ordinal scale.
Quantitative variables are measured on interval or ratio scale.
Properties Nominal Ordinal Interval Ratio
Labelled variables ✔ ✔ ✔ ✔
Meaningful order of variables ✖ ✔ ✔ ✔
Measurable difference ✖ ✖ ✔ ✔
The absolute value of zero ✖ ✖ ✖ ✔
PRESENTATION OF DATA
Data can be represented in countless ways.
The presentation of data means exhibition of the data in such a clear and attractive
manner that these are easily understood and analysed.
The three main forms of presentation of data are:
1. Textual presentation
2. Tabular form
3. Diagrammatic and Graphical presentation
1. TEXTUAL PRESENTATION OF DATA
Data can be presented using paragraphs or sentences.
It is a combination of texts and figures.
This includes enumeration of important characteristics, emphasizing the most
significant features and highlighting the most striking attributes of the set of data.
Example: Of the 150 sample interviewed, the following complaints were noted: 27
for lack of books in the library, 25 for a dirty playground, 20 for lack of laboratory
equipment, 17 for a not well maintained university buildings.
2. TABULAR FORM
Method of presenting data using the statistical table.
Tables, which convey information that has been converted into words or numbers in
rows and columns.
A more effective device of presenting data.
1. Stem and leaf plots
2. Frequency distribution table
3. Contingency table
Format of a Table
General principles in designing tables
The tables should be numbered.
A title must be given to each table.
The headings of columns & rows should be clear & concise.
The data must be presented according to size or importance.
If percentages or averages are to be compared, they should be placed as close as
possible.
No table should be too large.
Source notes may be given.
Table 1: Population of some states in India
Source: Census of India, 2001
FREQUENCY DISTRIBUTION TABLE
In the frequency distribution table, the data is first split up into convenient groups (class
interval) and the number of items (frequency) which occur in each group is shown in adjacent
columns.
Class interval
• it is defined as the difference between the upper class limit and Lower class limit.
• It should be equal intervals
• For eg: (10-20), (20-30)
• Lower values is called Lower limit
• Highest value is called Upper limit
INCLUSIVE & EXCLUSIVE CLASS
Inclusive class interval
• If the lower as well as upper limits are inclusive in the same class, such class is called
Inclusive class.
• The upper limit of the class and lower limit of the next class are not equal
Exclusive class interval
• If lower limit is included in the same class and upper limit is excluded from that class,
but included in the next class, such a class is called Exclusive class.
• The upper limit of the class and lower limit of the next class are equal
Rules for construction of frequency table
The class interval should not be too large or too small.
The number of classes to be formed more than 8 and less than 15.
The class interval should be equal and uniform throughout the classification.
After construction of table, proper and clear heading should be given to it.
The base or source of data should be mentioned with the pattern of analysis in footnote at the
end of table.
Tally Marks
Tally Marks are a form of numeral system with the vertical lines used for counting.
Cross the line (\) is laced over the four vertical lines (|) to get a total of five
CONTINGENCY TABLE
A contingency table, also known as a cross-classification table, describes the relationships
between two or more categorical variables.
A table cross-classifying two variables is called a 2-way contingency table and forms a
rectangular table with rows for the R categories of the X variable and columns for the C
categories of a Y variable.
A contingency table having R rows and C columns is called an R x C table.
Example:
Smoke
Alcohol Yes No
consumption
Low 10 80
High 50 40
3. DIAGRAMMATIC AND GRAPHICAL PRESENTATION
A most effective device of presenting data.
Charts and diagrams are useful methods of presenting simple data.
They are helpful in understanding the relationship between variables.
Diagrams are better retained in memory than statistical table.
Gives information at a glance.
Common Charts & Diagrams
ᴥ Bar diagram
ᴥ Histogram
ᴥ Pie diagram
ᴥ Line Diagram
ᴥ Frequency polygon and Frequency curve
ᴥ Scatter Diagram
ᴥ Stem & Leaf Diagram
BAR DIAGRAM
Used to display frequency distribution for nominal or ordinal data.
Data is presented in the form of rectangular bar of equal breadth.
Each bar represent one variant /attribute.
Suitable scale should be indicated and scale starts from zero.
The width of the bar and the gaps between the bars should be equal throughout.
The length of the bar is proportional to the magnitude/ frequency of the variable.
The bars may be vertical or horizontal.
Simple bar diagram Multiple Bar diagram Component bar diagram
Horizontal or vertical bars More than one sub-attribute of When there are many categories on X-
with the same width, drawn variable can be expressed. axis (more than 5) and they have further
with their bases on the same subcategories, then to accommodate the
horizontal or vertical line categories, the bars may be divided into
with equal gaps in between. parts, each part representing a certain
item and proportional to the magnitude
of that particular item.
HISTOGRAM
Used for Quantitative, Continuous, Variables.
It is used to present variables which have no gaps.
It consist of a series of blocks. The class intervals are given along horizontal axis and the
frequency along the vertical axis.
Differences between bar diagram and histogram?
In a histogram no space is left in between two rectangles, but in a bar diagram some space
must be left between two consecutive bars.
We can have a bar diagram both for discrete and continuous variables, but the histogram is
drawn only for a continuous variable.
PIE CHARTS
The “pie chart” also is known as “circle chart.
Most common way of presenting data.
The value of each category is divided by the total values and then multiplied by 360 and
then each category is allocated the respective angle to present the proportion it has.
It is often necessary to indicate percentages in the segment as it may not be sometimes
very easy virtually, to compare the areas of segments.
LINE DIAGRAM
It is the simplest type of diagram. (line graph, line chart)
It is a chart that shows a line joining several points or a line that shows the relation
between the points.
A line graph is used for showing trends over a particular period of time.
The variable is taken in X-axis and Frequency of the observations on the Y- axis.
Used to illustrate the relationship between continuous quantities.
Used to compare two or more groups
FREQUENCY POLYGON & FREQUENCY CURVE
Frequency polygon and frequency curve are the alternative form of histogram.
Frequency polygon is to connect the midpoints at the top of the bars of a histogram with
line segments
A frequency polygon is a graph in which line segments “connecting the dots” depict a
frequency distribution.
The Class Endpoints is scaled along the x axis and the Frequency values along the y axis.
A dot is then plotted for the frequency value at the midpoint of each class interval (class
midpoint).
Connecting these midpoints by straight line to get a Frequency polygon.
Connecting these midpoints by a smooth line instead of straight line to get a Frequency
curve.
Frequency polygons can also be drawn independently without drawing histograms.
The midpoints of the class intervals known as class marks are used to plot the points.
Difference between a frequency polygon and a frequency curve?
The only difference between a frequency curve and a frequency polygon is that:
Frequency polygon is drawn by joining points by a straight line.
Frequency curve is drawn by a smooth hand.
When frequency polygon is smoothed out then it is known as frequency curve.
SCATTER PLOTS
Scatter diagrams show the relationship between the two variables.
Also called: X-Y graph.
A scatter diagram is widely known as a correlation chart
It is used to summarize a set of discrete or continuous observations.
The x-axis represents the independent variable, while the y axis represents the dependent
variable.
Advantage:- Each observation is represented individually, no information is lost.
Disadvantage:- Difficult to read if many data points lie close together.
Positive correlation: In this case, as the value of X increases,
the value of Y will increase too, which means that the
correlation between the two variables is positive.
Negative correlation: In this case, as the value of X increases,
the value of Y will decrease.
No correlation: In this case, the data point spreads so
randomly.
STEM AND LEAF DISPLAY or STEM PLOT
• A Stem and Leaf Plot is a special table where each data value is split into a "stem"
(the first digit or digits) and a "leaf" (usually the last digit).
• "32" is split into "3" (stem) and "2" (leaf).
SUMMARIZING DATA
MEASURES OF AVERAGE & DISPERSION
Average / Central tendency
The word “average” is a value in the distribution, around which other values are distributed.
The central tendency is the descriptive summary of a data set. Through the single value
from the dataset, it reflects the centre of the data distribution.
Measures of Average / Central tendency
1) Mean
2) Median
3) Mode
MEAN
The sum of the observations divided by the no. of observations.
The Greek letter 𝝁 (“mu”) is used as the symbol for
population mean and the symbol 𝐱 ̅ used to represent the mean of a sample.
Total sum of observations 𝚺𝐱
Mean = = 𝐌𝐞𝐚𝐧 𝐱̅ =
Number of observations 𝐧
Here,
Σ represents the addition of values
X represents each value in the data set
𝐱̅ represents the mean of the data set
n represents the number of data values
𝚺 𝐟𝐱
̅=
Arithmetic mean of grouped data; 𝐗
𝚺𝐟
MEDIAN
Median is defined as the middle value of any observation, when the values are
arranged in ascending or descending order. (The median of a set of data is the “middle
element” when the data is arranged in ascending order.)
OR The median is the value that is in the middle when the data points are sorted from
smallest to largest.
𝑛+1 𝑡ℎ
Odd ( ) observation
2
Median =
𝑛 𝑡ℎ 𝑛+1 𝑡ℎ
( ) +( )
2 2
Even observation
2
Here, n represents the number of data values
MODE
Mode contains the highest frequency in any data. (The most commonly occurring
value in the given dataset)
Empirical relationship between the three measures of central tendency is
2 Mean = 3 Median – Mode
Dispersion
The measures of dispersion help to interpret the variability of data. There are two types of
measures of dispersion.
Measures of Dispersion
1. Range
2. Variance
3. Standard deviation
4. The coefficient of variation
RANGE
The range is the most straight forward measure of spread. It's the difference between the
largest observed value (the maximum) and the smallest observed value (the minimum).
Range = Highest value – Lowest value
VARIANCE
It is based on the squared distances between the values of the individual cases and the mean.
To calculate the squared distance between a value and the mean, just subtract the mean from
the value and then square the difference.
[𝚺( 𝐱− 𝐱̅)𝟐 ]
𝐕𝐚𝐫𝐢𝐚𝐧𝐜𝐞 =
𝐧−𝟏
Here,
Σ represents the addition of values
X represents each value in the data set
𝐱̅ represents the mean of the data set
n represents the number of data values
STANDARD DEVIATION
Take the square root of the variance and obtain what’s known as the standard deviation.
The Greek letter 𝝈 (“sigma”) is used as the symbol for population SD and the symbol ‘s’
(small letter ‘s’) used to represent the SD of a sample.
[ 𝚺( 𝐱 − 𝐱̅)𝟐 ]
𝐒. 𝐃 = √
𝐧−𝟏
Here,
Σ represents the addition of values
X represents each value in the data set
𝐱̅ represents the mean of the data set
n represents the number of data values
COEFFICIENT OF VARIATION
The coefficient of variation (CV) is a measure of relative variability. It is the ratio of the
standard deviation to the mean (average)
Standard Deviation
Coefficient of Variation = × 100
Mean
𝜎
In symbols: CV = ̅ × 100
X