1
BSN 6th Semester Notes Subject: Biostatistics
By: Meghraj Chatrani
Dated: Sep-5-2024
Unit-1 Introduction to Biostatistics
Statistics: Statistics as a discipline refers to statistical methodology techniques and procedures
dealing with design of experiments, collection, classification, summarization, organization, and
interpretation of information contained in a data set.
Statistics is science & art of dealing with variation in such a way as to obtain reliable results.
Biostatistics: The application of statistics to a wide range of topics in biology. Biostatistics is the
science which deals with development and application of the most appropriate methods for the:
• Collection of data.
• Presentation of the collected data.
• Analysis and interpretation of the results.
• Making decisions on the basis of such analysis
Importance of statistics: The basic understanding of statistics is useful in conducting the
investigations for a research project,also in effective presentation of the results. Statistics plays a
vital role in nursing, enabling evidence-based practice, quality improvement, and informed
decision-making. It helps nurses analyze patient data and predict outcomes, ultimately improving
patient care and safety.
It is helpful to distinguish between two major categories of statistics
Descriptive statistics: Deal with enumeration, organization, summarization and presentation of
data. Graphical representation & Tables
Inferential Statistics: Utilizes a (sub) set of the data (sample) to make estimates, decision, or
prediction about a larger set of data (population). It consists of Estimation and hypothesis of
testing.
Statistical basic terms:
Population: The set of all measurements of interest to the investigator .e.g. Monthly income of
households in Pakistan Or Number of TB Patients in Pakistan
Sample: Any subset of all measurements selected from the population.
2
Parameter: The set of measurements in a population may be summarized by a descriptive
characteristic. E.g. average household size and percent of households with modern sanitation as
reported in the 1998 census of Karachi
Statistic: The set of measurements in a sample may be summarized by a descriptive
characteristic. E.g. average household size and percent of households as reported from a sample
survey of 6,000 householdsin Karachi, 2001.
VARIABLE: A characteristic or attribute that can assume different values.
DATA: Measurements or observations for a variable. Or The values of observations recorded for
variables.
DATA ARRAY: A data set that has been ordered.
DATA SET: A collection of data values.
DATA VALUE/DATUM: A value in a data set.
Topic Qualitative Vs Quantitative variables or Data
Qualitative variables: Those characteristics that yield observations, which categorized
individuals according to same sharacteristics or attribute. Example: Occupation, sex, Marital
status and Educational level.
Quantitative Data (Measurement): Those characteristics for which the measurements convey
information regarding an amount or quantity. A variable that is numerical in nature and that can
be ordered or ranked.
Example: Age, Blood pressure
3
Measurements of scale: A type of classification that tells how variables are categorized,
counted, or measured; the four type of scales are: (Nominal, Ordinal, Interval, Ratio)
Topic: Types of Data
Nominal scale: A measurement level that classifies data into mutually exclusive
(nonoverlapping) categories in which no order or ranking can be imposed on them a qualitative
variable that categorizes an element of a population. Example: Telephone number, zip code,
smoking status
Ordinal scale: A measurement level that classifies data into categories that can be ranked;
However, precise differences between the ranks do not exist a qualitative variable that
incorporates an ordered position or ranking. Example: Educational level socio-economic status
etc.
Interval Scale: Scale having equal units but an arbitrary zero point. Can add and subtract.
Example: Temperature
4
Ratio scale: A measurement level that possess all the characteristics of interval measurement and
a true zero. Zero is the absence of the characteristic being measured.-Can add, subtract, multiply,
and divide.
Discrete scale: Can assume a countable number of values. There is a gap between any two
values. Example: number of children, number of missing teeth etc.
Continuous scale: A variable that can assume all values between any two specific values; a
variable obtained by measuring. E.g: height, weight etc
Unit-2 ORGANIZING & DISPLAYING DATA
Variable: Characteristic or property of an individual population unit. The value of characteristic
may very among units in a population. Weight, height, martial status, smoking habit etc.
Frequency distribution: A Tabular summary of a set of data showing the frequency (or number)
of items in each of several non-overlapping (with each data value belonging to one and only one
group) groups.
Class Frequency: Number of observations in a data set falling into a particular class.
Cumulative Frequency: Number of observation in a data set falling below Or above particular
class inclusive of that particular class.
Class relative frequency: Class frequency divided by the total number of observations in the
data set.
Formala: Relative Frequency = Frequency / Total observations .
Relative Cumulative Frequency: Cumulative frequency divided by the total number of
observations in the data set.
Formula: Relative cumulative Frequency = Cumulative Frequency / Total observation
GRAPHS: Graphs are Geometrical designs: Convey information at a glance and are
athematically less sophisticated.
Graphical presentation of Quantitive data: Histogram, Relative, Histogram, Frequency,
Polygon, Cumulative Frequency Polygon, Dot Plot and Scatter Plot
Histogram: common graphical presentation of quantitative data is a histogram. The variable of
interest is placed on the horizontal Axis. Unlike a bar graph, a histogram has no natural
separation between rectangles of adjacent classes.
5
Frequency polygon: The frequency polygon is simpler than its histogram counterpart. It
sketches an outline of the data pattern more clearly. The polygon becomes increasingly smooth
and curve like as we increase. The number of classes and the number of observations.
CUMULATIVE FREQUENCY POLYGON (OGIVE): These gaps are eliminated by plotting
points halfway between the class limits. Thus, 59.5 is used for the 50-59 class, 69.5 is used for
the 60-69 class, and so on.
DOT PLOT: One of the simplest graphical summaries of data is a dot plot. A horizontal axis
shows the range of data values. Then each data value is represented by a dot placed above the
axis.
Graphical presentation of Qualitative data Bar Graph (simple, multiple, component,
sliding) And Pie Chart
BAR GRAPH: A bar graph is a graphical device for depicting qualitative data. On the horizontal
axis we specify the labels that are used for each of the classes. A frequency, relative frequency, or
percent frequency scale can be used for the vertical axis. Using a bar of fixed width drawn above
each class label, we extend the height appropriately. The bars are separated to emphasize the fact
that each class is a separate category.
Types of Bar Chart : Nominal, Ordinal, vertical or multiple bar, horizontal or component
bar and sliding bar
PIE CHART: The pie chart is a commonly used graphical device for presenting relative
frequency distributions for qualitative data. First draw a circle; then use the relative frequencies
to subdivide the circle into sectors that correspond to the relative frequency for each class.
• Common devise for displaying data arranged in categories
• Useful for variables with small number of categories
Unit-3 Measures of Central Tendency & Measures of Dispersion
Measures of Central Tendency: Given a data set, a measure of the central tendency is a value
about which the observations tend to cluster. In other words it is a value around which a data set
is centered. The three most common measures of central tendency are the mean, the median, the
mode.
Mean: It is the arithmetic average of a set of numbers. Applicable for interval and ratio data and
not applicable for nominal or ordinal data. Computed by summing all values in the data set and
dividing the sum by the number of values in the data set.
E.g: Sample Mean: age of the patients coming to clinic 8, 9, 10, 11, 12, 9
6
Mean X= (summing all values) 8+9+10+11+ 12+9 / (number of values) 6
Mean X= 59/6= 9.83
Median: The Median is a middle value in an ordered array of numbers. Applicable for ordinal,
interval, and ratio data. Not applicable for nominal data.
Median (Computational Procedure): Arrange the observations in an ordered array. If there is
an odd number of terms, the median is the middle term of the ordered array. If there is an even
number of terms, the median is the average of the middle two terms.
7
8
9
10
Appropriate Measures of Central Tendency
Nominal variables: Mode
Ordinal variables: Median
Interval level variables: Mean
If the distribution is normal (median is better with skewed distribution).
11
12
13
14
Unit-4 Normal Distribution
Normal Distribution: The normal distribution is the most important of all statistical
distributions. It was first discovered by the French Mathematician Abraham Demoivre in 1733.
Sir Francis Galton, first applied the normal distribution to medicine. The reason why the normal
distribution plays such a key role in statistics is that countless phenomena follow (or closely
approximate) the normal distribution. Examples are height, serum cholesterol, life span of light
bulbs, body temperature of healthy persons, size Of oranges, etc.
Symmetric Distribution: When the data values are evenly distributed about the mean, the
distribution is said to be symmetric.
Skewed Distribution:
Negatively or Left Skewed: When the majority of data values fall to the right of the mean, the
distribution is said to be skewed.
Positively or right Skewed: When the majority of data values fall to the left of the mean, the
distribution is said to be skewed.
What is the Normal Distribution (Curve)? It’s a theoretical model. The normal distribution
plays a very important role in statistical inference. A frequency polygon or histogram that is
unimodal, smooth, and symmetrical (no empirical distribution has a shape that perfectly matches
this ideal model). Since the distribution is unimodal it is bell-Shaped.
Unit-5 Sampling Distribution of Sample Mean & Central Limit
Theorem
Key terms
15
Sampling Distribution: the probability (density) function of statistic is called the sampling
distribution of the statistic.
Standard Error: the standard deviation of the sampling distribution is called the standard error.
Sampling Error: is the difference between the sample measures and the corresponding
population measure due to the fact that the sample is not a perfect representation of the
population.
Sampling distribution of sample mean: is a distribution using the means computed from all
possible random samples of a specific size taken from a population.
Unit-6 Point Estimation
16
Unit-7 Hypothesis Testing
Hypothesis Testing: It is a decision making process for evaluating claims about a population.
17
Allows us to use sample data to test a claim about a population, such as testing whether a
population mean equals some number.
Example: Does an average box of cereal contain 368 grams of cereal?
Method of testing hypothesis: The three methods used to test hypotheses are:
1. The traditional method
2. The P-value method
3. The Confidence Interval method
Hypothesis: A statement of belief used in the evaluation of a population parameter.
Ex: The mean balance score to assess muscle function among rheumatoid arthritis patients is
lower than the osteo-arthritis.
Types of hypothesis:
NULL HYPOTHESIS (H0):- a claim that there is no difference between the population
parameter and the hypothesized value.
ALTERNATIVE OR RESEARCHER HYPOTHESIS (Ha OR H1):- a claim that disagrees
with the null hypothesis.
Examples of Null Hypothesis:
Ex:The mean balance score to assess muscle function among rheumatoid arthritis (RA) patients
is greater than or equal to the osteo-arthritis (OA) patients (4).
Examples of ALTERNATIVE OR RESEARCHER Hypothesis: The mean balance score to
assess muscle function among rheumatoid arthritis (RA) patients is lower than the osteo-arthritis
(OA) patients
Directional and Non-directional Hypothesis: one tailed hypotheses are directional; two tailed
hypothesis is otherwise non-directional.
Basic Elements of Testing hypothesis:
• Null Hypothesis
• Alternative Hypothesis (Researcher Hypothesis)
• Choice of appropriate level of significance ( )
• Assumptions & Test Statistic (Formula)
• Rejection Region (Critical Region)
• Conclusion
Unit-8 Type l and type ll errors, power of the set and p-value
18
p-value approach: Another approach and now a days the most common approach is to report the
extent to which the statistic disagrees with the null hypothesis and compare it with the value of a
for the decision whether to reject the null hypothesis. This measure of disagreement is called the
p-value.
P-Value: A commonly used approach in statistical software in hypothesis testing is to report p-
value.
The p-value measures the strength of the evidence against Ho. P-value is compared with the
value of alpha for the decision whether to reject the null hypothesis.
Errors involve in Testing Null Hypothesis Ho:
Type I Error (Rejection error orAlpha (a) Error): It is the decision that we reject Ho, when in
fact Ho is true.
Type II Error (Non-rejection error or Beta (β) Error): It is the decision that we do not reject
Ho is false.
Unit-12 Correlation and Regression
Correlation: Correlation is a statistical method used to determine whether a relationship
between variables exists. Correlation is a numerical measure that is used to answer following
questions:
1. Are two or more variables are related?
2. If so, what is the strength of the relationship?
Regression: Regression is a statistical method used to describe the nature of the relationship
between variables, that is, positive or negative, linear or non-linear. Regression is a numerical
measure that is used to answer following question:
What type of relationship exists?
Independent variable: Independent variable is the variable in regression that can be controlled
or manipulated.
Dependent variable: The dependent variable is the variable in regression that cannot be
controlled or manipulated.
Unit-15 One Way Analysis of Variance (ANOVA)
19
ANOVA: When an F test is used to test a hypothesis concerning the means of three or more
populations, the technique is called analysis of Variance (ANOVA). To compare the means of
three or more samples, you can use the t-test.
20
21