Q1) Load the built-in R dataset air quality.
Inspect the summary statistics of the dataset
along with the covariance and correlation matrix. (8 marks)
Ans) To load the built-in airquality dataset in R, use the command data("airquality"), which
loads air quality measurements from New York City. To inspect the summary statistics of the
dataset, use summary (airquality). This provides a detailed overview of each variable,
including the minimum, first quartile, median, mean, third quartile, and maximum values. It
helps identify the distribution of the data and any potential outliers or missing values. To
examine the relationships between the numeric variables, calculate the covariance matrix
using cov(airquality[, sapply(airquality, is.numeric)]). Covariance indicates how the variables
vary together, with positive values showing that variables increase or decrease together, and
negative values indicating an inverse relationship. Compute the correlation matrix with
cor(airquality[, sapply(airquality, is.numeric)]), which standardizes the relationships between
variables. The correlation matrix reveals the strength and direction of the linear relationships
between numeric variables, helping identify patterns or dependencies within the data.
Q2) Visualize the data to understand the distribution and relationships of air quality
measurements. (4 marks)
Q2a. Load the airquality dataset using data ("airquality") and display the first few
rows. (2 marks)
To load the airquality dataset, use the data ("airquality") function in R. This command
loads the dataset, which contains air quality measurements in New York City. To display the
first few rows of the dataset and get an initial understanding of its structure, use the head
(airquality) function. This will show you a snapshot of the data, including the first few entries
for each variable, helping you understand the variables present and their initial values. To
inspect the summary statistics of the airquality dataset, use the summary (airquality) function
in R. This function provides essential descriptive statistics for each variable, including the
minimum, first quartile, median, mean, third quartile, and maximum values. These statistics
give a comprehensive overview of the distribution, central tendency, and range of the data,
helping to identify patterns, detect outliers, and understand the variability of the air quality
measurements across the dataset.
To calculate and display the covariance and correlation matrices for the numeric
columns in the airquality dataset, use the following code: cov_matrix <- cov(airquality[,
sapply(airquality, is.numeric)]) to compute the covariance matrix and cor_matrix <-
cor(airquality[, sapply(airquality, is.numeric)]) to compute the correlation matrix. The
covariance matrix shows how numeric variables vary together, indicating the degree to which
two variables change in relation to each other. The correlation matrix, on the other hand,
reveals the strength and direction of the linear relationships between the variables, with
values ranging from -1 to 1. Both matrices provide valuable insights into the
interdependencies and associations among the numeric variables in the dataset.
Q2b.Use pairs () to create a scatter plot matrix to visualize relationships between the
numeric variables.