KEMBAR78
Exercise Book | PDF | Mode (Statistics) | Quartile
0% found this document useful (0 votes)
46 views43 pages

Exercise Book

Uploaded by

anhvh2410113
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views43 pages

Exercise Book

Uploaded by

anhvh2410113
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.

vn)

EXERCISES BOOK
1. Important concepts of descriptive statistics:
Measures of Central Tendency: These measures indicate the central or typical value of a dataset.
The commonly used measures are the mean, median, and mode.
Measures of Dispersion: These measures quantify the spread or variability of a dataset. Common
measures include the range, variance, standard deviation, and interquartile range.
Percentiles: Percentiles divide a dataset into hundredths and provide information about the relative
position of a particular value within the dataset. The median represents the 50th percentile.
Boxplots: Boxplots provide a visual summary of the dataset's distribution, including the median,
quartiles, range, and any potential outliers.
Histograms: Histograms display the frequency distribution of a continuous variable by dividing it
into intervals or bins. They provide insights into the shape and spread of the data.
Normal Distribution: The normal distribution, also known as the Gaussian distribution, is a
symmetrical probability distribution frequently encountered in biostatistics. It is characterized by its
mean and standard deviation.
Z-Score: The z-score measures the number of standard deviations a particular observation is from
the mean. It is used to compare and standardize values across different distributions.
Confidence Intervals: Confidence intervals provide a range of values within which a population
parameter is likely to fall. They account for sampling variability and provide a measure of the
uncertainty associated with the estimate.
Correlation: Correlation measures the strength and direction of the linear relationship between two
variables. It is often used to assess the association between variables in biostatistical studies.
Scatter Plots: Scatter plots visualize the relationship between two continuous variables. They help
identify patterns, trends, and the nature of the association between variables.
Formula to calculate descriptive statistics
1.1. Mean: The mean is the sum of all values in a dataset divided by the number of observations.
Formula: Mean = (x₁ + x₂ + x₃ + ... + xₙ) / n
where x₁, x₂, x₃, ..., xₙ are the individual data points and n is the number of observations.
1.2. Median: The median is the middle value in an ordered dataset. If the dataset has an odd
number of observations, the median is the middle value. If the dataset has an even number
of observations, the median is the average of the two middle values.
Formula: Median = (n + 1) / 2
where n is the number of observations.
1.3. Mode: The mode is a measure of central tendency that represents the most frequently
occurring value in a dataset. It is the value that has the highest frequency or probability density.
To calculate the mode in biostatistics, you can use the following concept and formula:

1
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

Concept: The mode corresponds to the value in the dataset that occurs with the highest frequency. It
represents the peak or most common value in the distribution.
Formula: For a discrete dataset, the mode can be found by simply identifying the value with the
highest frequency. If multiple values have the same highest frequency, the dataset is considered
multimodal, meaning it has multiple modes.
For a continuous dataset, the mode can be estimated by finding the peak of the probability density
function (PDF) or the highest point on the histogram.
It's important to note that not all datasets have a mode. Some datasets may have a uniform
distribution where each value occurs with equal frequency, resulting in no distinct mode.
1.4. Percentiles: Percentiles divide a dataset into hundredths and provide information about the
relative position of a particular value within the dataset.
Formula: Percentile = (P/100) * (n + 1)
where P is the desired percentile and n is the number of observations.
1.5. Variance: Variance measures the variability or dispersion of a dataset. It quantifies how
spread out the data points are from the mean.
Formula: Variance = Σ(xi - μ)² / n
where xi represents each data point, μ is the mean, and n is the number of observations.
1.6. Covariance: Covariance measures the directional relationship between two variables. It
indicates whether the variables move together (positive covariance) or in opposite directions
(negative covariance).
Formula: Cov(X,Y) = Σ((xi - μx) * (yi - μy)) / n
where xi and yi are data points, μx and μy are the means of the respective variables, and n is the number
of observations.
1.7. Standard Deviation: The standard deviation is the square root of the variance. It provides a
measure of the average distance between each data point and the mean.
Formula: Standard Deviation = √(Σ(xi - μ)² / n)
1.8. Z-Score: The z-score measures the number of standard deviations an observation is from the
mean. It is used to standardize values and compare them across different distributions.
Formula: Z = (x - μ) / σ
where x is the individual data point, μ is the mean, and σ is the standard deviation.
1.9. Confidence Intervals: Confidence intervals (CI) provide a range of values within which a
population parameter is likely to fall. The formula for constructing a confidence interval
depends on the distribution of the data and the desired level of confidence (e.g., 95%, 99%).
Formula (for a sample mean with known population standard deviation): CI = x̄ ± (Z * σ / √n)
where x̄ is the sample mean, Z is the Z-score corresponding to the desired confidence level, σ is the
population standard deviation, and n is the sample size.

2
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

1.10. Correlation Coefficient (Pearson's Correlation): The correlation coefficient measures


the strength and direction of the linear relationship between two variables. The value ranges
from -1 to +1, where -1 indicates a perfect negative correlation, +1 indicates a perfect positive
correlation, and 0 indicates no correlation.
Formula: r = Σ((xi - x̄) * (yi - ȳ)) / √(Σ(xi - x̄)² * Σ(yi - ȳ)²)
where xi and yi are data points, x̄ and ȳ are the means of the respective variables.
1.11. Discrete Probability Distribution: Binomial Distribution
Concept:
The binomial distribution describes the probability of obtaining a specific number of successes in a
fixed number of independent Bernoulli trials.
Formula:
P(X = x) = (nCx) * px * (1 - p)(n - x)
R Code Example:
# Required library
Tính xác suất mà thuốc có hoặc không có hiệu quả trên 7 người/10
library(ggplot2)

# Parameters
n <- 10 # Number of trials n = số lần
p <- 0.5 # Probability of success

# Probability calculation x = số bệnh nhân cần xác định

x <- 0:n
prob <- dbinom(x, size = n, prob = p)

# Bar plot
binom_plot <- ggplot(data.frame(x, prob), aes(x, prob)) +
geom_bar(stat = "identity", fill = "blue") +
labs(x = "Number of Successes", y = "Probability") +
ggtitle("Binomial Distribution") +
theme_minimal()

binom_plot

3
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

1.12. Continuous Probability Distribution: Normal Distribution


Concept:
The normal distribution is a continuous probability distribution with a symmetric bell-shaped curve.
Formula:
PDF: f(x) = (1 / (σ * √(2π))) * exp(-(x - μ)2 / (2σ2))
R Code Example:
# Required library
library(ggplot2)

# Parameters
mu <- 0 # Mean
sigma <- 1 # Standard deviation

# Probability calculation
x <- seq(-4, 4, by = 0.1)
density <- dnorm(x, mean = mu, sd = sigma)

# Density plot
normal_plot <- ggplot(data.frame(x, density), aes(x, density)) +
geom_line(color = "blue") +
labs(x = "x", y = "Density") +
ggtitle("Normal Distribution") +
theme_minimal()

normal_plot

1.13. Poisson Distribution


Concept:
The Poisson distribution models the probability of a certain number of events occurring within a fixed
interval of time or space.
Formula:
P(X = x) = (exp(-λ) * λx) / x!

4
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

R Code Example:
# Required library
library(ggplot2)
Binomial distribution —> cho một bộ dữ liệu —> phân bố như thế nào

# Parameter
lambda <- 3 # Average rate of events

# Probability calculation
x <- 0:10
prob <- dpois(x, lambda)

# Bar plot
poisson_plot <- ggplot(data.frame(x, prob), aes(x, prob)) +
geom_bar(stat = "identity", fill = "blue") +
labs(x = "Number of Events", y = "Probability") +
ggtitle("Poisson Distribution") +
theme_minimal()

poisson_plot

Note: These R code examples demonstrate how to calculate and visualize the probability
distributions in biostatistics using the corresponding functions from the stats package in R. The
plots provide visual representations of the distributions to better understand the probabilities
associated with different outcomes.
_______________________________
____________Problems___________
Calculation of descriptive statistics
Problem 1:
A researcher is studying the heights of a sample of 50 individuals. The heights (in centimeters) are as
follows: 165, 170, 168, 172, 160, 175, 163, 169, 171, 166, 173, 167, 169, 160, 174, 168, 172, 167,
165, 170, 169, 171, 167, 170, 175, 170, 168, 165, 172, 166, 170, 171, 168, 173, 165, 172, 169, 160,
171, 173, 167, 172, 170, 169, 165, 168, 173, 166, 170, 174, 168.
Compute the mean, median, mode, range, variance, and standard deviation of the heights.
5
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

Suggestions: thi như tự luận —> viết câu lệnh chép ra giấy

• Mean: Add up all the heights and divide by the number of individuals (50 in this case).
• Median: Arrange the heights in ascending order and find the middle value. If there's an even
number of values, take the average of the two middle values.
• Mode: Identify the height(s) that appear(s) most frequently in the data.
• Range: Find the difference between the maximum and minimum heights.
• Variance: Calculate the average squared deviation from the mean. It measures the spread of
data.
• Standard Deviation: Take the square root of the variance. It provides a measure of the average
distance between each data point and the mean.
Problem 2:
A study examined the blood pressure readings (in mmHg) of 30 participants. The blood pressure
values are as follows: 120, 118, 122, 124, 130, 126, 128, 124, 120, 122, 124, 126, 128, 130, 132, 134,
136, 138, 140, 142, 144, 146, 148, 150, 152, 154, 156, 158, 160, 162.
Calculate the five-number summary (minimum, lower quartile, median, upper quartile, maximum)
and construct a box plot of the blood pressure data.
Suggestions:
• Five-Number Summary: The minimum value, lower quartile (25th percentile), median (50th
percentile), upper quartile (75th percentile), and maximum value.
• Box Plot: Construct a graphical representation that displays the five-number summary. It helps
visualize the distribution of the data, including outliers and skewness.
Problem 3:
A researcher is investigating the enzyme activity levels of a sample of 25 specimens. The enzyme
activity values (in units per minute) are as follows: 10, 12, 15, 8, 14, 9, 13, 11, 16, 12, 10, 11, 13, 15,
9, 8, 14, 12, 16, 11, 13, 10, 9, 14, 12.
Calculate the mean, median, and range of the enzyme activity levels. Also, compute the interquartile
range (IQR) and construct a box plot. = 50% dữ liệu ở giữa —> từ khoảng 1/4 thứ nhất đến khoảng giữa
Suggestions:
• Interquartile Range (IQR): The difference between the upper quartile and the lower quartile.
It represents the spread of the middle 50% of the data.
• Box Plot: Similar to Problem 2, construct a box plot to visualize the data and observe any
potential outliers.
Problem 4:
A study measures the body mass index (BMI) of 40 participants. The BMI values are as follows: 22.5,
24.8, 25.2, 26.7, 27.1, 28.3, 29.6, 30.2, 31.4, 32.0, 25.9, 27.3, 28.1, 29.4, 30.8, 32.2, 33.0, 34.5, 35.1,
36.7, 25.1, 26.7, 27.5, 29.0, 30.4, 31.9, 33.1, 34.2, 35.7, 37.2, 26.1, 27.7, 28.9, 30.3, 31.6, 33.4, 34.8,
36.2, 37.9, 39.0.
Compute the mean, median, and standard deviation of the BMI values. Also, determine the z-score
for an individual with a BMI of 32.8.

6
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

Suggestions:
• Mean: Calculate the average of the BMI values.
• Median: Find the middle value when the BMI values are arranged in ascending order.
• Standard Deviation: Measure the spread of the BMI values around the mean.
• Z-Score: Determine the standardized value by subtracting the mean from an individual's BMI
and dividing by the standard deviation. It indicates how many standard deviations an
individual's BMI is away from the mean.
Problem 5:
A researcher measures the reaction times (in milliseconds) of a sample of 20 participants. The reaction
times are as follows: 250, 260, 255, 270, 275, 280, 290, 295, 305, 310, 320, 315, 300, 280, 275, 270,
265, 255, 250, 245.
Calculate the mean, median, and variance of the reaction times. Additionally, compute the coefficient
of variation (CV) and interpret its meaning in the context of the data.
• Mean: Calculate the average of the reaction times.
• Median: Find the middle value when the reaction times are arranged in ascending order.
• Variance: Measure the spread of the reaction times around the mean.
• Coefficient of Variation (CV): Divide the standard deviation by the mean and multiply by 100.
It represents the relative variability of the data, allowing comparison between datasets with
different units of measurement.
Note: In each of these problems, you would apply various descriptive statistics measures to
summarize and analyze the given data. It's important to understand the context of the data and
interpret the results accordingly. Descriptive statistics provide summary measures that help describe
the central tendency, variability, and shape of the data distribution. These measures include mean,
median, mode, range, variance, standard deviation, quartiles, box plots, z-scores, and coefficient of
variation. Make sure to use the appropriate formulas and techniques to calculate these statistics
accurately.
Solutions
Problem 1:
# Heights data
heights <- c(165, 170, 168, 172, 160, 175, 163, 169, 171, 166, 173, 167,
169, 160, 174, 168, 172, 167, 165, 170, 169, 171, 167, 170, 175, 170,
168, 165, 172, 166, 170, 171, 168, 173, 165, 172, 169, 160, 171, 173,
167, 172, 170, 169, 165, 168, 173, 166, 170, 174, 168)

# Mean
mean_height <- mean(heights)

# Median
median_height <- median(heights)

7
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

# Mode
mode_height <- unique(heights[which.max(tabulate(match(heights,
unique(heights))))])

# Range
range_height <- max(heights) - min(heights)

# Variance
var_height <- var(heights)

# Standard Deviation
sd_height <- sd(heights)

# Visualize the heights data


hist(heights, main = "Height Distribution", xlab = "Height", ylab =
"Frequency")

Problem 2:
# Blood pressure data
blood_pressure <- c(120, 118, 122, 124, 130, 126, 128, 124, 120, 122,
124, 126, 128, 130, 132, 134, 136, 138, 140, 142, 144, 146, 148, 150,
152, 154, 156, 158, 160, 162)

# Five-Number Summary
summary_stats <- summary(blood_pressure)

# Box Plot
boxplot(blood_pressure, main = "Blood Pressure Box Plot")

Problem 3:
# Enzyme activity data
enzyme_activity <- c(10, 12, 15, 8, 14, 9, 13, 11, 16, 12, 10, 11, 13,
15, 9, 8, 14, 12, 16, 11, 13, 10, 9, 14, 12)

# Mean
mean_activity <- mean(enzyme_activity)

8
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

# Median
median_activity <- median(enzyme_activity)

# Range
range_activity <- max(enzyme_activity) - min(enzyme_activity)

# Interquartile Range (IQR)


iqr_activity <- IQR(enzyme_activity)

# Box Plot
boxplot(enzyme_activity, main = "Enzyme Activity Box Plot")

Problem 4:
# BMI data
bmi <- c(22.5, 24.8, 25.2, 26.7, 27.1, 28.3, 29.6, 30.2, 31.4, 32.0,
25.9, 27.3, 28.1, 29.4, 30.8, 32.2, 33.0, 34.5, 35.1, 36.7, 25.1, 26.7,
27.5, 29.0, 30.4, 31.9, 33.1, 34.2, 35.7, 37.2, 26.1, 27.7, 28.9, 30.3,
31.6, 33.4, 34.8, 36.2, 37.9, 39.0)

# Mean
mean_bmi <- mean(bmi)

# Median
median_bmi <- median(bmi)

# Standard Deviation
sd_bmi <- sd(bmi)

# Z-Score
individual_bmi <- 32.8
z_score <- (individual_bmi - mean_bmi) / sd_bmi

# Visualize the BMI data


hist(bmi, main = "BMI Distribution", xlab = "BMI", ylab = "Frequency")
9
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

Problem 5:
# Reaction times data
reaction_times <- c(250, 260, 255, 270, 275, 280, 290, 295, 305, 310,
320, 315, 300, 280, 275, 270, 265, 255, 250, 245)

# Mean
mean_reaction <- mean(reaction_times)

# Median
median_reaction <- median(reaction_times)

# Variance
var_reaction <- var(reaction_times)

# Coefficient of Variation (CV)


cv_reaction <- (sd(reaction_times) / mean_reaction) * 100

# Visualize the reaction times data


hist(reaction_times, main = "Reaction Times Distribution", xlab =
"Reaction Time", ylab = "Frequency")

Calculation of distribution and probability


Problem 6: Body Mass Index (BMI)
The distribution of BMI values in a population follows a normal distribution with a mean of 25 and
a standard deviation of 3. Suppose we want to find the probability that a randomly selected individual
has a BMI greater than 30.
Solution:
We need to calculate the area under the normal distribution curve to the right of BMI = 30. We can
use the standard normal distribution table or a statistical software to find the corresponding z-score.
Let's assume the z-score for BMI = 30 is 1.5 (obtained from the standard normal distribution table).
Using the z-score formula:
Z = (x - μ) / σ
1.5 = (30 - 25) / 3

10
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

Now, we can find the probability using the z-score and the standard normal distribution table:
P(BMI > 30) = P(Z > 1.5)
From the table, we find that P(Z > 1.5) is approximately 0.0668 or 6.68%.
Therefore, the probability that a randomly selected individual has a BMI greater than 30 is
approximately 6.68%.
R codes
# Required libraries
library(ggplot2)
library(patchwork)

# Parameters
mean_bmi <- 25
sd_bmi <- 3

# Probability calculation tính pnorm cho cận trên

prob_bmi <- 1 - pnorm(30, mean = mean_bmi, sd = sd_bmi)

quy ve z --> pnorm = limit


# Visualization
x <- seq(10, 40, by = 0.1)
density <- dnorm(x, mean = mean_bmi, sd = sd_bmi)

# Density plot
density_plot <- ggplot(data.frame(x), aes(x)) +
geom_line(aes(y = density), color = "blue") +
geom_area(aes(y = density, fill = (x >= 30)), alpha = 0.3) +
labs(x = "BMI", y = "Density") +
ggtitle("Normal Distribution of BMI") +
theme_minimal()

# Probability plot
prob_plot <- ggplot() +
geom_bar(stat = "identity", data = data.frame(x = 1, prob_bmi), aes(x =
"", y = prob_bmi), fill = "blue") +
11
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

coord_polar(theta = "y") +
labs(x = "", y = "Probability") +
ggtitle("Probability of BMI > 30") +
theme_minimal()

# Combine plots
density_plot + prob_plot + plot_layout(ncol = 2)

Note: This code calculates the probability of BMI > 30 using the P norm function and visualizes the
normal distribution of BMI along with the probability as a density plot and a bar plot.

Problem 7: Drug Dosage


The distribution of blood plasma concentrations of a drug in a population follows a normal
distribution with a mean of 100 mg/L and a standard deviation of 10 mg/L. If the desired therapeutic
range for the drug concentration is between 90 mg/L and 110 mg/L, what percentage of the population
falls within this range?
Solution:
To find the percentage of the population within the desired therapeutic range, we need to calculate
the area under the normal distribution curve between 90 mg/L and 110 mg/L.
First, we need to standardize the values using the z-score formula:
Z1 = (90 - 100) / 10 = -1.0
Z2 = (110 - 100) / 10 = 1.0
Now, we can find the probability using the z-scores and the standard normal distribution table:
P(90 ≤ X ≤ 110) = P(-1.0 ≤ Z ≤ 1.0)
From the table, we find that P(-1.0 ≤ Z ≤ 1.0) is approximately 0.6826 or 68.26%.
Therefore, approximately 68.26% of the population falls within the desired therapeutic range of 90
mg/L to 110 mg/L for the drug concentration.
R codes:
# Required libraries
library(ggplot2)
standard
deviation = 10
# Parameters
mean_concentration <- 100 mean

sd_concentration <- 10

12
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

# Probability calculation
prob_range <- diff(pnorm(c(90, 110), mean = mean_concentration, sd =
sd_concentration))

# Visualization
x <- seq(70, 130, by = 0.1)
density <- dnorm(x, mean = mean_concentration, sd = sd_concentration)

# Density plot
density_plot <- ggplot(data.frame(x), aes(x)) +
geom_line(aes(y = density), color = "blue") +
geom_area(aes(y = density, fill = (x >= 90 & x <= 110)), alpha = 0.3) +
labs(x = "Concentration (mg/L)", y = "Density") +
ggtitle("Normal Distribution of Drug Concentration") +
theme_minimal()

# Probability plot
prob_plot <- ggplot() +
geom_bar(stat = "identity", data = data.frame(x = 1, prob_range), aes(x
= "", y = prob_range), fill = "blue") +
coord_polar(theta = "y") +
labs(x = "", y = "Probability") +
ggtitle("Probability of Concentration in Range (90-110)") +
theme_minimal()

# Combine plots
density_plot + prob_plot + plot_layout(ncol = 2)

Note: This code calculates the probability of the drug concentration falling within the range of 90
mg/L to 110 mg/L using the pnorm function and visualizes the normal distribution of the drug
concentration along with the probability as a density plot and a bar plot.

These examples illustrate how normal distribution and probability concepts can be applied in
biostatistics to solve problems related to various variables and parameters of interest, such as
BMI and drug dosage.
The R code examples demonstrate how to solve the given problems using appropriate functions
and visualize the data using ggplot2 for clear and informative plots.
13
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

Problem 8: Binomial Distribution


A drug is known to cure a certain disease in 80% of cases. If a doctor treats 10 patients with the drug,
what is the probability that exactly 7 patients will be cured?
Solution:
This problem follows a binomial distribution with parameters n = 10 (number of patients) and p = 0.8
(probability of success, i.e., being cured). We need to calculate P(X = 7), where X represents the
number of cured patients. Using the binomial probability formula
P(X = x) = (nCx) * px * (1 - p)(n - x)
P(X = 7) = (10C7) * (0.87) * (1 - 0.8)(10 - 7) = 0.2013
R codes:
# Required library
library(ggplot2)

# Parameters
n <- 10 # Number of patients
p <- 0.8 # Probability of success (cured patients)

# Probability calculation
x <- 7
prob <- dbinom(x, size = n, prob = p)

# Print the probability


print(prob)

# Bar plot
binom_plot <- ggplot(data.frame(x, prob), aes(x, prob)) +
geom_bar(stat = "identity", fill = "blue") +
labs(x = "Number of Cured Patients", y = "Probability") +
ggtitle("Binomial Distribution") +
theme_minimal()

binom_plot

14
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

Problem 9: In a clinical trial, the success rate of a new treatment for a specific disease is 60%. If 100
patients are treated with the new drug, what is the probability that at least 70 patients will respond
positively to the treatment?
Solution:
This problem follows a binomial distribution with parameters n = 100 (number of patients) and p =
0.6 (probability of success, i.e., positive response). We need to calculate P(X >= 70), where X
represents the number of patients responding positively. Using the binomial cumulative probability
function
P(X >= x) = 1 - P(X < x)
P(X >= 70) = 1 - P(X < 70)
= 1 - sum(dbinom(0:69, size = 100, prob = 0.6))
R codes:
# Required library
library(ggplot2)

# Parameters
n <- 100 # Number of patients
p <- 0.6 # Probability of success (positive response)

# Probability calculation
x <- 70:n
prob <- 1 - sum(dbinom(0:69, size = n, prob = p))

# Print the probability


print(prob)

# Bar plot
binom_plot <- ggplot(data.frame(x, prob), aes(x, prob)) +
geom_bar(stat = "identity", fill = "blue") +
labs(x = "Number of Positive Responses", y = "Probability") +
ggtitle("Binomial Distribution") +
theme_minimal()

binom_plot

Problem 10:
15
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

In a population, the prevalence of a certain disease is 10%. A diagnostic test for the disease has a
sensitivity of 80% and a specificity of 90%. If a randomly selected individual tests positive for the
disease, what is the probability that the individual actually has the disease? Solution:
This problem requires applying conditional probability and Bayes' theorem. Let's denote the
following:
D: The individual has the disease (event D)
P: The individual tests positive for the disease (event P)
We need to calculate P(D | P), i.e., the probability that the individual has the disease given that they
tested positive. Using Bayes' theorem:
P(D | P) = (P(P | D) * P(D)) / P(P) P(P | D) = Sensitivity = 0.80
P(D) = Prevalence = 0.10
P(P) = P(P | D) * P(D) + P(P | D') * P(D') P(P | D') = 1 - Specificity = 1 - 0.90 = 0.10
P(D') = 1 - P(D) = 0.90
Plugging in the values: P(D | P) = (0.80 * 0.10) / ((0.80 * 0.10) + (0.10 * 0.90))
R codes:
# Parameters
prevalence <- 0.10
sensitivity <- 0.80
specificity <- 0.90

# Probability calculation
p_positive <- (sensitivity * prevalence) / ((sensitivity * prevalence) +
(1 - specificity) * (1 - prevalence))

# Print the probability


print(p_positive)

Problems 11: Poisson Distribution


The number of bacteria in a water sample follows a Poisson distribution with an average rate of 5
bacteria per milliliter. What is the probability that there are exactly 3 bacteria in a 1-milliliter sample?
Solution:
This problem follows a Poisson distribution with parameter λ = 5 (average rate of bacteria per
milliliter). We need to calculate P(X = 3), where X represents the number of bacteria in a 1-milliliter
sample. Using the Poisson probability formula
P(X = x) = (exp(-λ) * λ^x) / x!
16
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

P(X = 3) = (exp(-5) * 5^3) / 3!


= 0.1404
Therefore, the probability that there are exactly 3 bacteria in a 1-milliliter sample is approximately
0.1404.
R codes:
# Required library
library(ggplot2)

# Parameter
lambda <- 5 # Average rate of bacteria per milliliter

# Probability calculation
x <- 3
prob <- dpois(x, lambda)

# Print the probability


print(prob)

# Bar plot
poisson_plot <- ggplot(data.frame(x, prob), aes(x, prob)) +
geom_bar(stat = "identity", fill = "blue") +
labs(x = "Number of Bacteria", y = "Probability") +
ggtitle("Poisson Distribution") +
theme_minimal()

poisson_plot

Problem 12:
The number of heart attacks occurring in a particular city follows a Poisson distribution with an
average rate of 2 heart attacks per day. What is the probability that there are more than 3 heart attacks
in a given day?
Solution:
This problem follows a Poisson distribution with parameter λ = 2 (average rate of heart attacks per
day). We need to calculate P(X > 3), where X represents the number of heart attacks in a day. Using
the Poisson cumulative probability function
P(X > x) = 1 - P(X <= x)
17
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

P(X > 3) = 1 - sum(dpois(0:3, lambda = 2))


R codes
# Required library
library(ggplot2)

# Parameter
lambda <- 2 # Average rate of heart attacks per day

# Probability calculation
x <- 4:20
prob <- 1 - sum(dpois(0:3, lambda = lambda))

# Print the probability


print(prob)

# Bar plot
poisson_plot <- ggplot(data.frame(x, prob), aes(x, prob)) +
geom_bar(stat = "identity", fill = "blue") +
labs(x = "Number of Heart Attacks", y = "Probability") +
ggtitle("Poisson Distribution") +
theme_minimal()

poisson_plot

Problem 14: Probability


In a population, the prevalence of a certain genetic disorder is 0.02. If a random individual is selected,
what is the probability that they have the disorder?
Solution:
The problem involves calculating the probability of an event occurring, given the prevalence of the
disorder. P(Having the disorder) = 0.02 Therefore, the probability that a random individual has the
genetic disorder is 0.02.
R codes:
# Probability calculation
probability <- 0.02

18
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

# Print the probability


print(probability)

Problem 15:
In a population, the prevalence of a certain genetic mutation is 0.05. If two individuals are randomly
selected, what is the probability that both individuals have the mutation?
Solution:
The problem involves calculating the probability of both events (individuals having the mutation)
occurring, given the prevalence of the mutation. P(Both have the mutation) = 0.05 * 0.05.
R codes:
# Probability calculation
probability <- 0.05 * 0.05

# Print the probability


print(probability)
Problem 16:
A diagnostic test for a certain disease has a false positive rate of 5% and a false negative rate of 10%.
If a randomly selected individual tests positive for the disease, what is the probability that the
individual does not have the disease?
Solution:
This problem requires applying conditional probability. Let's denote the following:
D: The individual has the disease (event D)
P: The individual tests positive for the disease (event P)
We need to calculate P(D' | P), i.e., the probability that the individual does not have the disease given
that they tested positive.
P(D' | P) = (P(P | D') * P(D')) / P(P) P(P | D') = False Positive Rate = 0.05
P(D') = 1 - Prevalence = 1 - 0.10 = 0.90
P(P) = P(P | D) * P(D) + P(P | D') * P(D')
P(P | D) = 1 - False Negative Rate = 1 - 0.10 = 0.90
P(D) = Prevalence = 0.10
Plugging in the values: P(D' | P) = (0.05 * 0.90) / ((0.90 * 0.10) + (0.05 * 0.90))
R codes:
# Parameters
prevalence <- 0.10

19
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

false_positive <- 0.05


false_negative <- 0.10

# Probability calculation
p_not_disease <- (false_positive * (1 - prevalence)) /
((false_positive * (1 - prevalence)) + (false_negative *
prevalence))

# Print the probability


print(p_not_disease)

20
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

2. Statistic analysis
1. Student's t-test: The t-test is a parametric test that compares the means of two groups. It
calculates a t-value, which represents the difference between the means relative to the
variability within the groups. The t-value is compared to a critical value from the t-distribution
to determine if the difference is statistically significant. The independent samples t-test is used
when the two groups are independent, while the paired samples t-test is used when the samples
are related or matched.
a. Problem 1
A researcher wants to compare the effectiveness of two different cholesterol-lowering drugs (Drug A
and Drug B). They randomly assign 30 participants to receive either Drug A or Drug B for a period
of 12 weeks. After the treatment, they measure the participants' cholesterol levels. The data are as
follows:
Drug A: 180, 195, 200, 185, 190, 205, 195, 180, 200, 195, 190, 185, 200
Drug B: 175, 185, 190, 165, 170, 180, 185, 170, 190, 195, 180, 175, 185
Is there a significant difference between the mean cholesterol levels of the two drugs? Use a
significance level of 0.05.
Suggestion:
In this problem, the researcher wants to compare the effectiveness of two different cholesterol-
lowering drugs (Drug A and Drug B) by measuring the participants' cholesterol levels. The data
provided for Drug A and Drug B are the cholesterol level measurements for each group. To determine
if there is a significant difference between the mean cholesterol levels of the two drugs, you would
perform an independent samples t-test.
To conduct the t-test, you would calculate the t-value using the formula:
t = (mean of Drug A - mean of Drug B) / sqrt[(squared deviation of Drug A / sample size of Drug A)
+ (squared deviation of Drug B / sample size of Drug B)]
Once you have calculated the t-value, you would compare it to the critical value from the t-distribution
for the given significance level (0.05 in this case). If the calculated t-value exceeds the critical value,
you would conclude that there is a significant difference between the mean cholesterol levels of the
two drugs.
R codes:
# Cholesterol data
drug_A <- c(180, 195, 200, 185, 190, 205, 195, 180, 200, 195, 190, 185,
200)
drug_B <- c(175, 185, 190, 165, 170, 180, 185, 170, 190, 195, 180, 175,
185)

# Independent samples t-test

21
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

result <- t.test(drug_A, drug_B, alternative = "two.sided", var.equal =


TRUE)

# Print the results


print(result)

b. Problem 2
A study aims to investigate the effect of a new exercise program on blood pressure. A sample of 25
participants is randomly assigned to either the exercise group or the control group. After 8 weeks,
their systolic blood pressure readings are recorded. The data are as follows:
Exercise group: 130, 125, 135, 140, 132, 128, 127, 130, 133, 135
Control group: 135, 140, 145, 138, 142, 140, 150, 136, 140, 143
Is there a significant difference in the mean systolic blood pressure between the exercise group and
the control group? Use a significance level of 0.01.
Suggestion:
In this problem, the study aims to investigate the effect of a new exercise program on systolic blood
pressure. The participants are divided into an exercise group and a control group, and their systolic
blood pressure readings are recorded. To determine if there is a significant difference in the mean
systolic blood pressure between the two groups, you would perform an independent samples t-test.
Similar to Problem 1, you would calculate the t-value using the formula mentioned earlier and
compare it to the critical value from the t-distribution for the given significance level (0.01 in this
case). If the calculated t-value exceeds the critical value, you would conclude that there is a significant
difference in the mean systolic blood pressure between the exercise group and the control group.
R codes:
# Blood pressure data
exercise_group <- c(130, 125, 135, 140, 132, 128, 127, 130, 133, 135)
control_group <- c(135, 140, 145, 138, 142, 140, 150, 136, 140, 143)

# Independent samples t-test


result <- t.test(exercise_group, control_group, alternative = "two.sided",
var.equal = TRUE)

# Print the results


print(result)

c. Problem 3:
A researcher investigates the effect of a new drug on pain relief. They randomly assign 18 patients to
receive either the new drug or a placebo. After a specified period, the patients rate their pain levels
on a scale of 1-10 (higher values indicating more pain). The data are as follows:
22
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

New drug: 4, 5, 3, 6, 4, 5, 3
Placebo: 6, 7, 5, 7, 6, 8, 7
Is there a significant difference in the mean pain levels between the patients who received the new
drug and those who received the placebo? Use a significance level of 0.05.
Suggestions:
In this problem, the researcher investigates the effect of a new drug on pain relief. The patients are
randomly assigned to receive either the new drug or a placebo, and they rate their pain levels on a
scale of 1-10. To determine if there is a significant difference in the mean pain levels between the two
groups, you would perform an independent samples t-test.
Using the provided data, you would calculate the t-value and compare it to the critical value from the
t-distribution for the given significance level (0.05 in this case). If the calculated t-value exceeds the
critical value, you would conclude that there is a significant difference in the mean pain levels
between the patients who received the new drug and those who received the placebo.
These examples illustrate the application of the independent samples t-test in comparing the means
of two groups. By calculating the t-value and comparing it to the critical value, you can determine if
the observed differences in the data are statistically significant or if they could have occurred by
chance. The t-test is a commonly used statistical test for evaluating group differences in various
research studies.
R codes:
# Pain level data
new_drug <- c(4, 5, 3, 6, 4, 5, 3)
placebo <- c(6, 7, 5, 7, 6, 8, 7)

# Independent samples t-test


result <- t.test(new_drug, placebo, alternative = "two.sided", var.equal
= TRUE)

# Print the results


print(result)

Important notes:
• Data Preparation:
Before running the t-test, make sure you have the data properly formatted. In the provided examples,
the data are stored in separate vectors (drug_A, drug_B, exercise_group, control_group, new_drug,
placebo).
Ensure that the data are numerical and correspond to the appropriate groups or conditions you want
to compare.

23
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

• t-test Function:
The t.test() function in R performs the independent samples t-test.
The first argument of the function corresponds to the data for the first group, and the second argument
corresponds to the data for the second group.
The alternative argument specifies the alternative hypothesis and can be set to "two.sided" (default),
"less" (for a lower-tailed test), or "greater" (for an upper-tailed test).
The var.equal argument specifies whether to assume equal variances between the groups (TRUE) or
not (FALSE). Setting it to TRUE assumes equal variances, while setting it to FALSE performs a
Welch's t-test, which does not assume equal variances.
• Result Interpretation:
The t.test() function returns a list of results that includes the t-value, degrees of freedom, p-value, and
confidence interval for the difference in means.
The t-value represents the test statistic, which measures the difference between the sample means
relative to the variability within the groups. A larger absolute t-value indicates a more significant
difference.
The p-value indicates the probability of obtaining the observed difference (or a more extreme
difference) assuming the null hypothesis is true. A p-value below the chosen significance level
indicates statistical significance.
The confidence interval provides a range of plausible values for the true difference in means, with the
chosen level of confidence.
• Result Printing:
The print() function is used to display the result of the t-test.
By default, the output includes the t-value, degrees of freedom, p-value, and the confidence interval.
You can customize the output by accessing the individual elements of the result list. For example,
result$p.value will give you only the p-value.

2. Analysis of Variance (ANOVA): ANOVA is a parametric test used to compare the means of
three or more groups. It determines if there are significant differences among the means by
analyzing the variation between groups and within groups. ANOVA calculates an F-value,
which compares the between-group variation to the within-group variation. The F-value is
compared to a critical value from the F-distribution to determine if the group differences are
statistically significant.
Problem 4: ANOVA Example
A researcher wants to compare the mean blood pressure levels among three different treatment groups
(A, B, and C). The data collected are as follows:
Group A: 130, 135, 140, 145
Group B: 125, 130, 135, 140

24
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

Group C: 120, 125, 130, 135


Perform a one-way ANOVA to determine if there are any significant differences in the mean blood
pressure levels among the treatment groups. Use a significance level of 0.05.
Solution:
To perform a one-way ANOVA, we use the F-test to compare the variability between groups to the
variability within groups. The null hypothesis is that there are no differences in the means of the
treatment groups.
R Codes:
# Blood pressure data
groupA <- c(130, 135, 140, 145)
groupB <- c(125, 130, 135, 140)
groupC <- c(120, 125, 130, 135)

# Perform one-way ANOVA


result <- aov(c(groupA, groupB, groupC) ~ factor(rep(c("A", "B", "C"),
each = 4)))

# Summary of ANOVA
print(summary(result))

# Post hoc test (Tukey's HSD)


posthoc <- TukeyHSD(result)
print(posthoc)

Data Visualization (Box plots)


# Box plots Example
# Cholesterol data
groupA <- c(130, 135, 140, 145)
groupB <- c(125, 130, 135, 140)
groupC <- c(120, 125, 130, 135)

# Combine data into a single vector


data <- c(groupA, groupB, groupC)

25
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

# Create factor variable for groups


groups <- factor(rep(c("A", "B", "C"), each = 4))

# Box plot
boxplot(data ~ groups, xlab = "Groups", ylab = "Cholesterol Levels",
main = "Cholesterol Levels by Group")
Problem 5: Drug Efficacy Study
A pharmaceutical company is testing the efficacy of three different drugs (A, B, and C) for treating a
specific condition. They randomly assign 50 patients into three groups: Group A receives Drug A,
Group B receives Drug B, and Group C receives Drug C. After a certain treatment period, the patients'
symptom scores are measured. The company wants to determine if there are any significant
differences in the mean symptom scores among the three drug groups.
Solution
After performing the one-way ANOVA, you will obtain a summary table that provides information
on the between-group variability (sums of squares, degrees of freedom, mean squares) and the within-
group variability (residual sum of squares, degrees of freedom, mean squares). The F-statistic and its
corresponding p-value are also reported. If the p-value is below the chosen significance level (e.g.,
0.05), you can conclude that there are significant differences among the groups.
Post hoc tests, such as Tukey's HSD, can be performed to determine which specific groups differ
significantly from each other. The post hoc test results will provide confidence intervals and p-values
for pairwise group comparisons.
R codes:
# Drug Efficacy Study Example
# Symptom scores
groupA <- c(3, 4, 2, 5, 3, 4, 3, 2, 4, 5)
groupB <- c(2, 3, 2, 4, 3, 2, 1, 3, 4, 2)
groupC <- c(4, 5, 5, 4, 3, 5, 3, 4, 4, 3)

# Perform one-way ANOVA


result <- aov(c(groupA, groupB, groupC) ~ factor(rep(c("A", "B", "C"),
each = 10)))

# Summary of ANOVA
print(summary(result))

# Post hoc test (Tukey's HSD)


posthoc <- TukeyHSD(result)

26
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

print(posthoc)

Visualizing the data:


Data visualization is an essential aspect of data analysis.
Here's an example of how you can visualize the data using box plots in R:
# Box plot Example
# Symptom scores
groupA <- c(3, 4, 2, 5, 3, 4, 3, 2, 4, 5)
groupB <- c(2, 3, 2, 4, 3, 2, 1, 3, 4, 2)
groupC <- c(4, 5, 5, 4, 3, 5, 3, 4, 4, 3)

# Combine data into a single vector


data <- c(groupA, groupB, groupC)

# Create factor variable for groups


groups <- factor(rep(c("A", "B", "C"), each = 10))

# Box plot
boxplot(data ~ groups, xlab = "Groups", ylab = "Symptom Scores", main =
"Symptom Scores by Group")

Problem 6: Drug Dosage Comparison


A pharmaceutical company is comparing the effectiveness of three different dosages (Low, Medium,
High) of a drug in reducing blood pressure. They randomly assign 60 patients into the three dosage
groups. After a certain treatment period, the patients' blood pressure levels are recorded. The company
wants to determine if there are any significant differences in the mean blood pressure levels among
the three dosage groups.
Solution:
In Example 2, we have three dosage groups: Low, Medium, and High. We want to determine if there
are any significant differences in the mean blood pressure levels among these groups.
To solve this, we perform a one-way ANOVA and examine the p-value. If the p-value is below the
chosen significance level (e.g., 0.05), we can conclude that there are significant differences among
the dosage groups. Additionally, we can conduct post hoc tests, such as Tukey's HSD, to determine
which specific groups differ significantly from each other.
R codes:
# Drug Dosage Comparison Example
# Blood pressure levels
low_dosage <- c(120, 122, 118, 125, 123, 120, 116, 118, 122, 119)
27
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

medium_dosage <- c(130, 128, 132, 135, 129, 136, 130, 133, 137,
131)
high_dosage <- c(140, 138, 142, 145, 139, 136, 140, 142, 141, 144)

# Perform one-way ANOVA


result <- aov(c(low_dosage, medium_dosage, high_dosage) ~
factor(rep(c("Low", "Medium", "High"), each = 10)))

# Summary of ANOVA
print(summary(result))

# Post hoc test (Tukey's HSD)


posthoc <- TukeyHSD(result)
print(posthoc)
Visualization of the data using box plots:
# Box plot Example
# Blood pressure levels
low_dosage <- c(120, 122, 118, 125, 123, 120, 116, 118, 122, 119)
medium_dosage <- c(130, 128, 132, 135, 129, 136, 130, 133, 137, 131)
high_dosage <- c(140, 138, 142, 145, 139, 136, 140, 142, 141, 144)

# Combine data into a single vector


data <- c(low_dosage, medium_dosage, high_dosage)

# Create factor variable for dosage groups


groups <- factor(rep(c("Low", "Medium", "High"), each = 10))

# Box plot
boxplot(data ~ groups, xlab = "Dosage Groups", ylab = "Blood Pressure
Levels", main = "Blood Pressure Levels by Dosage Group")

Note:
This code will generate a box plot visualizing the distribution of blood pressure levels for each dosage
group. The x-axis represents the dosage groups (Low, Medium, High), while the y-axis represents the
blood pressure levels. The plot title and axis labels can be customized to suit your specific needs.

28
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

By examining the box plots, you can visually compare the central tendency and dispersion of blood
pressure levels among the dosage groups. This visualization can help identify any potential
differences or patterns in the data.
Problem 7: Two-Way ANOVA Example
A study is conducted to investigate the effects of two factors, diet type (A, B, C) and exercise intensity
(low, medium, high), on cholesterol levels. The data collected are as follows:
Diet A, Low Exercise: 150, 155, 160
Diet A, Medium Exercise: 140, 145, 150
Diet A, High Exercise: 130, 135, 140
Diet B, Low Exercise: 160, 165, 170
Diet B, Medium Exercise: 150, 155, 160
Diet B, High Exercise: 140, 145, 150
Diet C, Low Exercise: 170, 175, 180
Diet C, Medium Exercise: 160, 165, 170
Diet C, High Exercise: 150, 155, 160
Perform a two-way ANOVA to determine if there are any significant effects of diet type, exercise
intensity, or their interaction on cholesterol levels. Use a significance level of 0.05.
Solution:
To perform a two-way ANOVA, we examine the effects of two factors (diet type and exercise
intensity) and their interaction on the outcome (cholesterol levels).
R codes:
# Cholesterol data
diet <- rep(c("A", "B", "C"), each = 3, times = 3)
exercise <- rep(c("Low", "Medium", "High"), each = 9)
cholesterol <- c(150, 155, 160, 140, 145, 150, 130, 135, 140, 160, 165,
170, 150, 155, 160, 140, 145, 150, 170, 175, 180, 160, 165, 170, 150, 155,
160)

# Perform two-way ANOVA


result <- aov(cholesterol ~ diet + exercise + diet:exercise)

# Summary of ANOVA
print(summary(result))

# Post hoc test (Tukey's HSD)


29
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

posthoc <- TukeyHSD(result)


print(posthoc)

Problem 8: Drug Efficacy Study with Gender


A pharmaceutical company is testing the efficacy of a drug on a specific condition, considering both
the drug type (A, B, C) and gender (Male, Female) as factors. They randomly assign patients to
different drug groups and record their symptom scores. The company wants to determine if there are
any significant differences in the mean symptom scores considering both the drug type and gender.
Solution:
To solve this, we perform a two-way ANOVA to analyze the effects of both the drug type and gender
on the symptom scores. We examine the main effects of drug type and gender, as well as the
interaction effect between drug type and gender.
R codes:
# Drug Efficacy Study Example with Gender
# Symptom scores
drugA_male <- c(3, 4, 2, 5, 3)
drugB_male <- c(2, 3, 2, 4, 3)
drugC_male <- c(4, 5, 5, 4, 3)
drugA_female <- c(4, 3, 2, 3, 4)
drugB_female <- c(3, 2, 1, 3, 4)
drugC_female <- c(5, 3, 4, 4, 3)

# Combine data into a single vector


data <- c(drugA_male, drugB_male, drugC_male, drugA_female, drugB_female,
drugC_female)

# Create factor variables for drug type and gender


drug <- factor(rep(c("A", "B", "C"), each = 5))
gender <- factor(rep(c("Male", "Female"), each = 15))

# Two-way ANOVA
result <- aov(data ~ drug * gender)

# Interaction plot
interaction.plot(drug, gender, data, xlab = "Drug", ylab = "Symptom
Scores", legend = TRUE)

30
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

Note: This code will generate an interaction plot that shows the interaction effect of drug type and
gender on symptom scores. The x-axis represents the drug type, the lines represent the gender (Male,
Female), and the y-axis represents the symptom scores. The plot helps us visualize how the
relationship between drug type and symptom scores differs across gender groups.
Problem 9: Drug Dosage Study with Time Points
A pharmaceutical company is comparing the efficacy of different dosages (Low, Medium, High) of a
drug for treating a condition, considering multiple time points (Baseline, Week 4, Week 8) as factors.
They measure the symptom scores at each time point for patients in each dosage group. The company
wants to determine if there are any significant differences in the mean symptom scores considering
both the dosages and time points.
Solution:
To solve this, we perform a two-way ANOVA to analyze the effects of both the dosage and time points
on the symptom scores. We examine the main effects of dosage and time points, as well as the
interaction effect between dosage and time points.
R codes:
# Drug Dosage Study Example with Time Points
# Symptom scores
low_baseline <- c(3, 4, 2, 5, 3)
low_week4 <- c(2, 3, 2, 4, 3)
low_week8 <- c(4, 5, 5, 4, 3)
medium_baseline <- c(4, 3, 2, 3, 4)
medium_week4 <- c(3, 2, 1, 3, 4)
medium_week8 <- c(5, 3, 4, 4, 3)
high_baseline <- c(2, 3, 2, 4, 3)
high_week4 <- c(4, 5, 5, 4, 3)
high_week8 <- c(3, 2, 1, 3, 4)

# Combine data into a single vector


data <- c(low_baseline, low_week4, low_week8,
medium_baseline, medium_week4, medium_week8,
high_baseline, high_week4, high_week8)

# Create factor variables for dosages and time points


dosage <- factor(rep(c("Low", "Medium", "High"), each = 15))
time <- factor(rep(c("Baseline", "Week 4", "Week 8"), each = 5))

31
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

# Two-way ANOVA
result <- aov(data ~ dosage * time)

# Line plot
interaction.plot(dosage, time, data, xlab = "Dosage", ylab = "Symptom
Scores", legend = TRUE, type = "b")

Note: This code will generate a line plot that shows the interaction effect of dosage and time points
on symptom scores. The x-axis represents the dosages, the lines represent the time points (Baseline,
Week 4, Week 8), and the y-axis represents the symptom scores. The plot helps us visualize how the
symptom scores change over time for each dosage group and how the relationship between dosage
and symptom scores differs across time points.
Remember to customize the plot labels and titles as needed to suit your specific analysis!
Problem 10: Vaccine Efficacy Study with Ethnicity
A research team is studying the efficacy of a new vaccine for preventing a certain disease, considering
both the vaccine type (A, B, C) and ethnicity (Asian, Black, White) as factors. They administer the
different vaccines to individuals from different ethnic backgrounds and record the presence or absence
of the disease. The team wants to determine if there are any significant differences in the disease
incidence considering both the vaccine type and ethnicity.
R codes:
# Vaccine Efficacy Study Example with Ethnicity
# Disease incidence
vaccineA_asian <- c(10, 5, 7, 12, 8)
vaccineB_asian <- c(8, 4, 6, 9, 7)
vaccineC_asian <- c(6, 3, 5, 7, 5)
vaccineA_black <- c(9, 6, 7, 11, 9)
vaccineB_black <- c(7, 5, 6, 8, 6)
vaccineC_black <- c(5, 4, 4, 6, 5)
vaccineA_white <- c(11, 7, 8, 14, 10)
vaccineB_white <- c(9, 6, 7, 10, 8)
vaccineC_white <- c(7, 5, 6, 8, 6)

# Combine data into a single vector


data <- c(vaccineA_asian, vaccineB_asian, vaccineC_asian, vaccineA_black,
vaccineB_black, vaccineC_black, vaccineA_white, vaccineB_white,
vaccineC_white)

# Create factor variables for vaccine type and ethnicity


32
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

vaccine <- factor(rep(c("A", "B", "C"), each = 15))


ethnicity <- factor(rep(c("Asian", "Black", "White"), each = 5))

# Two-way ANOVA
result <- aov(data ~ vaccine * ethnicity)

# Interaction plot
interaction.plot(vaccine, ethnicity, data, xlab = "Vaccine", ylab =
"Disease Incidence", legend = TRUE)

Problem 11: Clinical Trial with Treatment and Gender


A clinical trial is conducted to evaluate the effectiveness of two different treatments (Treatment A,
Treatment B) for a specific medical condition, considering both the treatment type and gender as
factors. Patients are randomly assigned to one of the treatment groups, and their treatment outcomes
are recorded. The researchers want to determine if there are any significant differences in the
treatment outcomes considering both the treatment type and gender.
Solution:
For each example, you would perform a two-way ANOVA to analyze the effects of both factors on
the outcome variable (e.g., symptom scores, disease incidence, treatment outcomes). You would
examine the main effects of each factor (drug type, age group, vaccine type, ethnicity, treatment type,
gender) and the interaction effect between the factors.
R codes:
# Clinical Trial Example with Treatment and Gender
# Treatment outcomes
treatmentA_male <- c(3, 4, 2, 5, 3)
treatmentB_male <- c(2, 3, 2, 4, 3)
treatmentA_female <- c(4, 3, 2, 3, 4)
treatmentB_female <- c(3, 2, 1, 3, 4)

# Combine data into a single vector


data <- c(treatmentA_male, treatmentB_male, treatmentA_female,
treatmentB_female)

# Create factor variables for treatment and gender


treatment <- factor(rep(c("A", "B"), each = 10))
gender <- factor(rep(c("Male", "Female"), each = 5))

33
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

# Two-way ANOVA
result <- aov(data ~ treatment * gender)

# Interaction plot
interaction.plot(treatment, gender, data, xlab = "Treatment", ylab =
"Treatment Outcomes", legend = TRUE)

Note: Please note that in each example, you need to replace the data vectors with your actual data,
adjust the factor levelsas needed, and customize the plot labels and titles according to your specific
study.
The R code for visualizing these examples would depend on the specific nature of the data and the
desired visualization technique. Some common visualization options for two-way ANOVA include
interaction plots, bar plots, heatmaps, or scatter plots.
Here's a general template for visualizing a two-way ANOVA using an interaction plot in R:
# Load necessary libraries
library(ggplot2)

# Create a data frame with your data


df <- data.frame(
Factor1 = factor(rep(c("Level1", "Level2", "Level3"), each = n)),
Factor2 = factor(rep(c("LevelA", "LevelB", "LevelC"), n)),
Outcome = c(outcome_values)
)

# Perform the two-way ANOVA


result <- aov(Outcome ~ Factor1 * Factor2, data = df)

# Create an interaction plot


ggplot(df, aes(x = Factor1, y = Outcome, color = Factor2, group = Factor2))
+
geom_line() +
geom_point() +
labs(x = "Factor 1", y = "Outcome", color = "Factor 2") +
theme_bw()

Remember to replace "Factor1", "Factor2", and "Outcome" with the appropriate variable names from
your dataset, and customize the plot labels and titles as needed.

34
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

These examples and the provided R code offer a starting point for analyzing and visualizing two-way
ANOVA data in biostatistics. You can adapt them to fit your own datasets and research questions,
exploring different visualization techniques based on the nature of your data and the specific
hypotheses you want to investigate.
3. Chi-square test: The chi-square test is a non-parametric test used to analyze the association
between two categorical variables. It compares the observed frequencies in each category to
the expected frequencies under the assumption of independence. The test calculates a chi-
square statistic, which measures the overall discrepancy between observed and expected
frequencies. The chi-square statistic is compared to a critical value from the chi-square
distribution to determine if the association is statistically significant.
4. Pearson correlation coefficient: The Pearson correlation coefficient, denoted as "r,"
measures the strength and direction of the linear relationship between two continuous
variables. It ranges from -1 to +1, where -1 represents a perfect negative linear relationship,
+1 represents a perfect positive linear relationship, and 0 represents no linear relationship. The
correlation coefficient is estimated by calculating the covariance between the variables
divided by the product of their standard deviations.
5. Simple linear regression: Simple linear regression is a parametric model that examines the
relationship between a dependent variable and one independent variable. It assumes a linear
relationship and estimates the slope and intercept of the regression line. The regression line
represents the best-fit line that minimizes the sum of squared differences between the observed
and predicted values. The model can be used to predict the values of the dependent variable
based on the independent variable.
General solution:
For each problem, you would perform a simple linear regression analysis using appropriate statistical
software (such as R, Python, or SPSS). The analysis would involve fitting a regression line to the data
and assessing the significance of the relationship between the independent variable and dependent
variable.
Here's a general template for performing a simple linear regression analysis using R:
# Load necessary libraries
library(ggplot2)

# Create a data frame with your data


df <- data.frame(
Independent_Variable = c(independent_variable_values),
Dependent_Variable = c(dependent_variable_values)
)

# Perform simple linear regression


model <- lm(Dependent_Variable ~ Independent_Variable, data = df)

35
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

# Print the regression coefficients and statistical information


summary(model)

# Visualize the regression line


ggplot(df, aes(x = Independent_Variable, y = Dependent_Variable)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Independent Variable", y = "Dependent Variable") +
theme_bw()

Note: Remember to replace "Independent_Variable" and "Dependent_Variable" with the appropriate


variable names from your dataset, and customize the plot labels and titles as needed.
These examples and the provided R code offer a starting point for analyzing and visualizing simple
linear regression data in biostatistics. You can adapt them to fit your own datasets and research
questions, exploring different statistical software and techniques based on your specific study
objectives.
Problem 1: Height and Weight Relationship
A researcher wants to investigate the relationship between height (independent variable) and weight
(dependent variable) in a sample of individuals. The researcher collects data on the height and weight
of 50 participants. They want to determine if there is a significant linear relationship between height
and weight and estimate the weight based on the height of an individual.
Example of R codes:
# Height and Weight Relationship Example
# Height (independent variable)
height <- c(150, 160, 165, 170, 155, 175, 180, 158, 166, 172, 168, 162,
157, 169, 163, 171, 176, 159, 173, 167, 161, 164, 154, 177, 181, 156, 174,
179, 153, 178, 152)

# Weight (dependent variable)


weight <- c(50, 58, 60, 65, 52, 70, 75, 55, 62, 68, 66, 57, 54, 67, 58,
69, 72, 56, 71, 65, 59, 60, 51, 73, 77, 53, 68, 74, 49, 76, 48)

# Create a data frame


df <- data.frame(Height = height, Weight = weight)

# Perform simple linear regression


model <- lm(Weight ~ Height, data = df)

36
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

# Print the regression coefficients and statistical information


summary(model)

# Visualize the regression line


ggplot(df, aes(x = Height, y = Weight)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Height", y = "Weight") +
theme_bw()

Problem 2: Blood Pressure and Age


A study is conducted to examine the association between blood pressure (dependent variable) and age
(independent variable) in a group of 100 participants. The researchers measure the blood pressure and
record the age of each participant. They aim to determine if age can be used as a predictor of blood
pressure and quantify the strength and direction of the relationship.
Example of R codes:
# Blood Pressure and Age Example
# Age (independent variable)
age <- c(25, 33, 42, 51, 37, 60, 45, 29, 48, 55, 40, 34, 39, 47, 52, 43,
36, 31, 56, 38, 44, 27, 50, 41, 57, 46, 32, 59, 28, 53, 30, 35, 49, 54,
58)

# Blood Pressure (dependent variable)


blood_pressure <- c(120, 130, 140, 150, 140, 160, 150, 130, 155, 165, 135,
125, 138, 150, 155, 142, 136, 128, 160, 137, 145, 125, 158, 140, 163, 147,
130, 162, 127, 156, 132, 134, 148, 152, 159)

# Create a data frame


df <- data.frame(Age = age, Blood_Pressure = blood_pressure)

# Perform simple linear regression


model <- lm(Blood_Pressure ~ Age, data = df)

# Print the regression coefficients and statistical information


summary(model)

# Visualize the regression line


37
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

ggplot(df, aes(x = Age, y = Blood_Pressure)) +


geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Age", y = "Blood Pressure") +
theme_bw()

Problem 3: Cholesterol Level and Dietary Fat Intake


A nutritionist is interested in understanding the relationship between cholesterol levels (dependent
variable) and dietary fat intake (independent variable) among a sample of 75 individuals. The
nutritionist assesses the cholesterol levels and dietary fat intake of each participant and wants to
examine if higher dietary fat intake is associated with increased cholesterol levels.
Example of R codes:
# Cholesterol Level and Dietary Fat Intake Example
# Dietary Fat Intake (independent variable)
fat_intake <- c(40, 50, 60, 70, 55, 65, 75, 45, 55, 65, 50, 55, 58, 62,
68, 52, 58, 60, 70, 65, 45, 75, 40, 50, 55, 60, 70, 65, 55, 45, 68, 62,
58, 50, 48)

# Cholesterol Level (dependent variable)


cholesterol <- c(180, 190, 200, 210, 195, 205, 215, 185, 195, 205, 190,
195, 198, 202, 208, 192, 198, 200, 210, 205, 185, 215, 180, 190, 195, 200,
210, 205, 195, 185, 208, 202, 198, 190, 188)

# Create a data frame


df <- data.frame(Fat_Intake = fat_intake, Cholesterol = cholesterol)

# Perform simple linear regression


model <- lm(Cholesterol ~ Fat_Intake, data = df)

# Print the regression coefficients and statistical information


summary(model)

# Visualize the regression line


ggplot(df, aes(x = Fat_Intake, y = Cholesterol)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Dietary Fat Intake", y = "Cholesterol Level") +
38
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

theme_bw()

6. Multiple linear regression: Multiple linear regression extends simple linear regression to
include multiple independent variables. It examines the relationship between a dependent
variable and several independent variables, assuming a linear relationship. Multiple regression
estimates the coefficients for each independent variable and allows for adjusting for
confounding factors. The model can be used for prediction, inference, and identifying
significant predictors.
Problem 1: Blood Pressure Prediction
A researcher wants to predict blood pressure (dependent variable) based on age (independent variable
1), body mass index (BMI) (independent variable 2), and cholesterol level (independent variable 3).
The researcher gathers data on 100 individuals, including their age, BMI, cholesterol level, and
corresponding blood pressure measurements. The goal is to develop a multiple linear regression
model to predict blood pressure using the three independent variables.
Problem 2: Disease Progression Prediction
A study aims to predict the progression of a particular disease (dependent variable) based on variables
such as age (independent variable 1), gender (independent variable 2), smoking status (independent
variable 3), and genetic marker (independent variable 4). The researchers collect data from 200
patients, recording their age, gender, smoking status, genetic marker status, and disease progression
scores. The objective is to build a multiple linear regression model to predict the disease progression
based on the given independent variables.
Problem 3: Drug Dosage Optimization
A pharmaceutical company is conducting a study to optimize the dosage of a new drug (dependent
variable) based on variables like body weight (independent variable 1), age (independent variable 2),
and liver function (independent variable 3). The company collects data on 50 patients, including their
body weight, age, liver function test results, and the corresponding optimal drug dosage. The aim is
to develop a multiple linear regression model to determine the optimal drug dosage based on the
independent variables.
Solution:
For each problem, you would perform a multiple linear regression analysis using appropriate
statistical software (such as R, Python, or SPSS). The analysis would involve fitting a regression
model to the data and assessing the significance of the relationships between the independent
variables and the dependent variable.
Here's a general template for performing a multiple linear regression analysis using R:
# Load necessary libraries
library(ggplot2)

# Create a data frame with your data


df <- data.frame(
Dependent_Variable = c(dependent_variable_values),
39
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

Independent_Variable_1 = c(independent_variable_1_values),
Independent_Variable_2 = c(independent_variable_2_values),
Independent_Variable_3 = c(independent_variable_3_values)
)

# Perform multiple linear regression


model <- lm(Dependent_Variable ~ ., data = df)

# Print the regression coefficients and statistical information


summary(model)

# Visualize the predicted vs. observed values


ggplot(df, aes(x = Dependent_Variable, y = fitted(model))) +
geom_point() +
geom_abline(intercept = 0, slope = 1, color = "red") +
labs(x = "Observed", y = "Predicted") +
theme_bw()

Detail R codes for each problem:


Problem 1:
# Blood Pressure Prediction Example
# Age (independent variable 1)
age <- c(35, 42, 50, 55, 60, 38, 45, 52, 58, 65, 40, 48, 54, 59, 63)

# Body Mass Index (independent variable 2)


bmi <- c(25, 28, 30, 31, 29, 26, 27, 29, 32, 33, 24, 27, 31, 30, 34)

# Cholesterol Level (independent variable 3)


cholesterol <- c(180, 195, 200, 210, 190, 185, 195, 205, 220, 230, 180,
195, 200, 210, 225)

# Blood Pressure (dependent variable)


blood_pressure <- c(120, 128, 135, 140, 130, 122, 126, 133, 142, 148, 118,
130, 138, 142, 150)

40
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

# Create a data frame


df <- data.frame(Age = age, BMI = bmi, Cholesterol = cholesterol,
Blood_Pressure = blood_pressure)

# Perform multiple linear regression


model <- lm(Blood_Pressure ~ Age + BMI + Cholesterol, data = df)

# Print the regression coefficients and statistical information


summary(model)

# Visualize the predicted vs. observed values


ggplot(df, aes(x = Blood_Pressure, y = fitted(model))) +
geom_point() +
geom_abline(intercept = 0, slope = 1, color = "red") +
labs(x = "Observed", y = "Predicted") +
theme_bw()

Problem 2:
# Disease Progression Prediction Example
# Age (independent variable 1)
age <- c(45, 52, 60, 62, 55, 48, 50, 57, 65, 68, 41, 49, 56, 63, 67)

# Gender (independent variable 2)


gender <- c("Male", "Female", "Male", "Female", "Male", "Female", "Male",
"Female", "Male", "Female", "Male", "Female", "Male", "Female", "Male")

# Smoking Status (independent variable 3)


smoking <- c("Non-Smoker", "Smoker", "Non-Smoker", "Smoker", "Non-Smoker",
"Smoker", "Non-Smoker", "Smoker", "Non-Smoker", "Smoker", "Non-Smoker",
"Smoker", "Non-Smoker", "Smoker", "Non-Smoker")

# Genetic Marker (independent variable 4)


genetic_marker <- c("Present", "Absent", "Present", "Absent", "Present",
"Absent", "Present", "Absent", "Present", "Absent", "Present", "Absent",
"Present", "Absent", "Present")

41
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

# Disease Progression (dependent variable)


disease_progression <- c(3, 5, 8, 10, 6, 4, 5, 7, 12, 15, 2, 5, 7, 9, 13)

# Create a data frame


df <- data.frame(Age = age, Gender = gender, Smoking = smoking,
Genetic_Marker = genetic_marker, Disease_Progression =
disease_progression)

# Perform multiple linear regression


model <- lm(Disease_Progression ~ Age + Gender + Smoking + Genetic_Marker,
data = df)

# Print the regression coefficients and statistical information


summary(model)

# Visualize the predicted vs. observed values


ggplot(df, aes(x = Disease_Progression, y = fitted(model))) +
geom_point() +
geom_abline(intercept = 0, slope = 1, color = "red") +
labs(x = "Observed", y = "Predicted") +
theme_bw()

Problem 3:
# Drug Dosage Optimization Example
# Body Weight (independent variable 1)
weight <- c(65, 70, 75, 80, 85, 69, 75, 78, 82, 88, 72, 76, 79, 83, 87)

# Age (independent variable 2)


age <- c(45, 52, 60, 62, 55, 48, 50, 57, 65, 68, 41, 49, 56, 63, 67)

# Liver Function (independent variable 3)


liver_function <- c(85, 90, 92, 87, 88, 86, 88, 91, 95, 98, 84, 87, 89,
92, 96)

# Drug Dosage (dependent variable)

42
Introduction to Biostatistics – Pharmacy Assoc. Prof. Pham The Hai (pham-the.hai@usth.edu.vn)

dosage <- c(150, 160, 170, 180, 190, 155, 165, 168, 175, 185, 158, 162,
169, 173, 180)

# Create a data frame


df <- data.frame(Weight = weight, Age = age, Liver_Function =
liver_function, Dosage = dosage)

# Perform multiple linear regression


model <- lm(Dosage ~ Weight + Age + Liver_Function, data = df)

# Print the regression coefficients and statistical information


summary(model)

# Visualize the predicted vs. observed values


ggplot(df, aes(x = Dosage, y = fitted(model))) +
geom_point() +
geom_abline(intercept = 0, slope = 1, color = "red") +
labs(x = "Observed", y = "Predicted") +
theme_bw()

7. Logistic regression: Logistic regression is a statistical model used when the dependent
variable is binary or categorical. It models the relationship between the independent variables
and the probability of a particular outcome. Logistic regression estimates the odds ratios,
which represent the change in odds of the outcome for each unit change in the independent
variable. The model is widely used in medical and biological research for predicting binary
outcomes and assessing the impact of risk factors.
8. Survival analysis: Survival analysis is a statistical method used to analyze time-to-event data,
where the event of interest could be death, disease recurrence, or any other event. It assesses
the survival rates over time and examines the impact of different variables on survival.
Kaplan-Meier survival curves are used to estimate the survival probability over time, and the
log-rank test is used to compare survival between groups. Cox proportional hazards regression
is a commonly used model in survival analysis to estimate hazard ratios and assess the effect
of covariates on survival.
9. Non-parametric tests: Non-parametric tests are used when the data do not meet the
assumptions of parametric tests, such as normal distribution or equal variances. These tests
make fewer assumptions about the underlying distribution and rely on ranks or other
distribution-free methods. The Wilcoxon rank-sum test compares the medians of two
independent groups, the Mann-Whitney U test is a variation of the rank-sum test, and the
Kruskal-Wallis test compares three or more independent groups. These tests are robust
alternatives when the assumptions of parametric tests are not met.

43

You might also like