KEMBAR78
Fall 2023-2024 IE 451 Homework 2 Solutions | PDF | Errors And Residuals | Statistics
0% found this document useful (0 votes)
63 views20 pages

Fall 2023-2024 IE 451 Homework 2 Solutions

This document discusses analyzing a dataset called Auto using R. It contains: - Descriptions of which variables are quantitative vs qualitative - The ranges for each quantitative variable - Calculations of summary statistics like mean, standard deviation, minimum and maximum values for each quantitative variable. Tables are presented to show these results in a concise format.

Uploaded by

Abdullah Bingazi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views20 pages

Fall 2023-2024 IE 451 Homework 2 Solutions

This document discusses analyzing a dataset called Auto using R. It contains: - Descriptions of which variables are quantitative vs qualitative - The ranges for each quantitative variable - Calculations of summary statistics like mean, standard deviation, minimum and maximum values for each quantitative variable. Tables are presented to show these results in a concise format.

Uploaded by

Abdullah Bingazi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Fall 2023-2024 IE 451 Homework 2 Solutions file:///home/sdayanik/Downloads/ie451/Homework/...

Table of contents
• 1 Question 9
• 2 Question 10

Fall 2023-2024 IE 451 Homework 2


Solutions
Author

Deniz Şahin

1 Question 9
This exercise involves the Auto data set studied in the lab. Make sure that the
missing values have been removed from the data.

(If you check the description, missing values have been already removed from the
data set)

Auto %>% head %>% pander

Table 1: The first six rows of Auto dataset


mpg cylinders displacement horsepower weight acceleration year origin name
chevrolet
18 8 307 130 3504 12 70 1 chevelle
malibu
buick
15 8 350 165 3693 11.5 70 1 skylark
320
plymouth
18 8 318 150 3436 11 70 1
satellite
amc
16 8 304 150 3433 12 70 1
rebel sst
ford
17 8 302 140 3449 10.5 70 1
torino
ford
15 8 429 198 4341 10 70 1 galaxie
500

Auto %>% summary()

mpg cylinders displacement horsepower weight


Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140

1 of 20 19/02/2024, 15:04
Fall 2023-2024 IE 451 Homework 2 Solutions file:///home/sdayanik/Downloads/ie451/Homework/...

acceleration year origin name


Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
(Other) :365

#?Auto

(a) Which of the predictors are quantitative, and which are qualitative?

If we check the Auto data set using ?Auto we will see that mpg, cylinders,
displacement, horsepower, weight, acceleration and year are the quantitative
predictors. On the other hand origin and name are the qualitative predictors.
Observe that even though origin takes integer values, each number corresponds to
a country, hence this is a categorical predictor. We can factor this one.
d1 <- as_tibble(Auto) %>%
mutate(origin=factor(origin))

(b) What is the range of each quantitative predictor? You can answer this
using the range() function.
d1 %>%
summarise(range(mpg),range(cylinders),range(displacement),range(horsepower),range(weight),range
%>%
pander()

Table 2: Ranges of variables


range(mpg) range(cylinders) range(displacement) range(horsepower) range(weigh
9 3 68 46 1613
46.6 8 455 230 5140

or
d1 %>%
summarize(across(where(is.numeric), list(min=min, max=max))) %>% # calculate all relevant
statistics for every numerical variable
pivot_longer(cols=everything(), names_to = c("variable", "stat"), names_pattern =
"(.*)_(.*)") %>% # collect variable names, statistic names, and values in three
columns
pivot_wider(names_from = stat, values_from = value) %>% # place stat names across the
columns
pander()
# Alternatively
d1 %>%
summarize(across(where(is.numeric), list(min=min, max=max))) %>% # calculate all relevant
statistics for every numerical variable
pivot_longer(cols=everything(), names_to = c("variable", "stat"), names_pattern =
"(.*)_(.*)") %>% # collect variable names, statistic names, and values in three
columns
pivot_wider(names_from = variable, values_from = value) %>% # place variable names across
the columns
pander()

Table 3: If the number of variables is large, we can apply the same function to all
variables at once

2 of 20 19/02/2024, 15:04
Fall 2023-2024 IE 451 Homework 2 Solutions file:///home/sdayanik/Downloads/ie451/Homework/...

(a) Statistics are across the columns


variable min max
mpg 9 46.6
cylinders 3 8
displacement 68 455
horsepower 46 230
weight 1613 5140
acceleration 8 24.8
year 70 82
(b) Variables are across the columns
stat mpg cylinders displacement horsepower weight acceleration year
min 9 3 68 46 1613 8 70
max 46.6 8 455 230 5140 24.8 82

Let us extend the usage to several other statistics: calculate min, mean, median,
sd, max for every numerical values.
d1 %>%
summarize(across(where(is.numeric), list(min=min, mean = mean, median = median, sd = sd,
max=max))) %>% # calculate all relevant statistics for every numerical variable
mutate(across(everything(), round, digits=2)) %>% # do not let more than two decimals appear
in the table
pivot_longer(cols=everything(), names_to = c("variable", "stat"), names_pattern =
"(.*)_(.*)") %>% # collect variable names, statistic names, and values in three
columns
pivot_wider(names_from = variable, values_from = value) %>% # place variable names across
the columns
pander()

stat mpg cylinders displacement horsepower weight acceleration year


min 9 3 68 46 1613 8 70
mean 23.45 5.47 194.4 104.5 2978 15.54 75.98
median 22.75 4 151 93.5 2804 15.5 76
sd 7.81 1.71 104.6 38.49 849.4 2.76 3.68
max 46.6 8 455 230 5140 24.8 82

(c) What is the mean and standard deviation of each quantitative


predictor?
d1 %>%
summarise(across(where(is.numeric),list(mean=mean,sd=sd))) %>%
pivot_longer(cols = everything(), names_to = c("variable","stat"), names_sep = "_") %>%
pivot_wider(names_from = variable, values_from = value) %>%
pander()

stat mpg cylinders displacement horsepower weight acceleration year


mean 23.45 5.472 194.4 104.5 2978 15.54 75.98
sd 7.805 1.706 104.6 38.49 849.4 2.759 3.684

## Alternatively

d1 %>%
summarise(mean(mpg),mean(cylinders),mean(displacement),mean(horsepower),mean(weight),mean(acceleratio

3 of 20 19/02/2024, 15:04
Fall 2023-2024 IE 451 Homework 2 Solutions file:///home/sdayanik/Downloads/ie451/Homework/...

# A tibble: 1 x 7
`mean(mpg)` `mean(cylinders)` `mean(displace~` `mean(horsepow~` `mean(weight)`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 23.4 5.47 194. 104. 2978.
# ... with 2 more variables: `mean(acceleration)` <dbl>, `mean(year)` <dbl>

d1 %>%
summarise(sd(mpg),sd(cylinders),sd(displacement),sd(horsepower),sd(weight),sd(acceleration),sd

# A tibble: 1 x 7
`sd(mpg)` `sd(cylinders)` `sd(displacement)` `sd(horsepower)` `sd(weight)`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 7.81 1.71 105. 38.5 849.
# ... with 2 more variables: `sd(acceleration)` <dbl>, `sd(year)` <dbl>

(d) Now remove the 10th through 85th observations. What is the range,
mean, and standard deviation of each predictor in the subset of the data
that remains?
d1_new <- d1 %>%
filter(!row_number() %in% 10:85)

d1_new %>%
summarise(across(where(is.numeric), list(min=min, max=max, mean=mean, sd=sd))) %>%
pivot_longer(cols = everything(), names_to = c("variable","stat"), names_sep = "_") %>%
pivot_wider(names_from=stat, values_from = value) %>%
pander()

variable min max mean sd


mpg 11 46.6 24.4 7.867
cylinders 3 8 5.373 1.654
displacement 68 455 187.2 99.68
horsepower 46 230 100.7 35.71
weight 1649 4997 2936 811.3
acceleration 8.5 24.8 15.73 2.694
year 70 82 77.15 3.106

(e) Using the full data set, investigate the predictors graphically, using
scatterplots or other tools of your choice. Create some plots highlighting
the relationships among the predictors. Comment on your findings.
my_fn <- function(data, mapping, method="loess", ...){
p <- ggplot(data = data, mapping = mapping) +
geom_point() +
geom_smooth(formula = y ~ x,method=method,se=F, ...)
p
}

select(d1,c(1:7)) %>%
ggpairs(progress = FALSE,
upper = list(continuous = my_fn),
lower = list(continuous = "cor"))

Warning in simpleLoess(y, x, w, span, degree = degree, parametric =


parametric, : pseudoinverse used at 6

Warning in simpleLoess(y, x, w, span, degree = degree, parametric =


parametric, : neighborhood radius 2

Warning in simpleLoess(y, x, w, span, degree = degree, parametric =


parametric, : reciprocal condition number 9.6355e-017

4 of 20 19/02/2024, 15:04
Fall 2023-2024 IE 451 Homework 2 Solutions file:///home/sdayanik/Downloads/ie451/Homework/...

Observe that the correlation between displacement, horsepower and weight are
quite high. Moreover their relation within each other seem to be linear.

As displacement, horsepower and weight increases, mpg decreases.

mpg, displacement and horsepower have right-skewed distributions.

• How does mpg changes with respect to weight?


d1 %>%
ggplot(aes(x=weight,y=mpg))+
geom_point(aes(color=origin))

5 of 20 19/02/2024, 15:04
Fall 2023-2024 IE 451 Homework 2 Solutions file:///home/sdayanik/Downloads/ie451/Homework/...

• Does number of cylinders have an effect on mpg?


d1 %>%
mutate(cylinders=factor(cylinders)) %>%
ggplot(aes(cylinders,mpg))+
geom_boxplot()

6 of 20 19/02/2024, 15:04
Fall 2023-2024 IE 451 Homework 2 Solutions file:///home/sdayanik/Downloads/ie451/Homework/...

• Does acceleration decrease as weight increases?


d1 %>%
ggplot(aes(weight,acceleration))+
geom_point(aes(color=origin))+
geom_smooth(se=FALSE)

`geom_smooth()` using method = 'loess' and formula 'y ~ x'

7 of 20 19/02/2024, 15:04
Fall 2023-2024 IE 451 Homework 2 Solutions file:///home/sdayanik/Downloads/ie451/Homework/...

• Is mpg lower at older cars?


d1 %>% ggplot(aes(year,mpg))+
geom_point(aes(color=origin))+
geom_smooth(se=FALSE,color="black")

`geom_smooth()` using method = 'loess' and formula 'y ~ x'

8 of 20 19/02/2024, 15:04
Fall 2023-2024 IE 451 Homework 2 Solutions file:///home/sdayanik/Downloads/ie451/Homework/...

• How does the proportion of the origins change within years?


d1 %>%
ggplot()+
geom_bar(aes(x=year,fill=origin))

9 of 20 19/02/2024, 15:04
Fall 2023-2024 IE 451 Homework 2 Solutions file:///home/sdayanik/Downloads/ie451/Homework/...

(f) Suppose that we wish to predict gas mileage (mpg) on the basis of the
other variables. Do your plots suggest that any of the other variables might
be useful in predicting mpg? Justify your answer.

Almost every variable has a relation with mpg as it is seen from the graphs. We can
check the answer with linear models. However, be careful, as the correlation
between some variables are quite high to use in the same model. Let’s check them
separately.

summary(lm(mpg~displacement,d1))

Call:
lm(formula = mpg ~ displacement, data = d1)

Residuals:
Min 1Q Median 3Q Max
-12.9170 -3.0243 -0.5021 2.3512 18.6128

Coefficients:

10 of 20 19/02/2024, 15:04
Fall 2023-2024 IE 451 Homework 2 Solutions file:///home/sdayanik/Downloads/ie451/Homework/...

Estimate Std. Error t value Pr(>|t|)


(Intercept) 35.12064 0.49443 71.03 <2e-16 ***
displacement -0.06005 0.00224 -26.81 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.635 on 390 degrees of freedom


Multiple R-squared: 0.6482, Adjusted R-squared: 0.6473
F-statistic: 718.7 on 1 and 390 DF, p-value: < 2.2e-16

summary(lm(mpg~horsepower,d1))

Call:
lm(formula = mpg ~ horsepower, data = d1)

Residuals:
Min 1Q Median 3Q Max
-13.5710 -3.2592 -0.3435 2.7630 16.9240

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.935861 0.717499 55.66 <2e-16 ***
horsepower -0.157845 0.006446 -24.49 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.906 on 390 degrees of freedom


Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16

summary(lm(mpg~weight,d1))

Call:
lm(formula = mpg ~ weight, data = d1)

Residuals:
Min 1Q Median 3Q Max
-11.9736 -2.7556 -0.3358 2.1379 16.5194

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 46.216524 0.798673 57.87 <2e-16 ***
weight -0.007647 0.000258 -29.64 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.333 on 390 degrees of freedom


Multiple R-squared: 0.6926, Adjusted R-squared: 0.6918
F-statistic: 878.8 on 1 and 390 DF, p-value: < 2.2e-16

summary(lm(mpg~acceleration,d1))

Call:
lm(formula = mpg ~ acceleration, data = d1)

Residuals:
Min 1Q Median 3Q Max
-17.989 -5.616 -1.199 4.801 23.239

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.8332 2.0485 2.359 0.0188 *
acceleration 1.1976 0.1298 9.228 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

11 of 20 19/02/2024, 15:04
Fall 2023-2024 IE 451 Homework 2 Solutions file:///home/sdayanik/Downloads/ie451/Homework/...

Residual standard error: 7.08 on 390 degrees of freedom


Multiple R-squared: 0.1792, Adjusted R-squared: 0.1771
F-statistic: 85.15 on 1 and 390 DF, p-value: < 2.2e-16

summary(lm(mpg~I(year-70),d1))

Call:
lm(formula = mpg ~ I(year - 70), data = d1)

Residuals:
Min 1Q Median 3Q Max
-12.0212 -5.4411 -0.4412 4.9739 18.2088

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 16.09081 0.61331 26.24 <2e-16 ***
I(year - 70) 1.23004 0.08736 14.08 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.363 on 390 degrees of freedom


Multiple R-squared: 0.337, Adjusted R-squared: 0.3353
F-statistic: 198.3 on 1 and 390 DF, p-value: < 2.2e-16

summary(lm(mpg~displacement+horsepower+weight+acceleration+I(year-70),d1))

Call:
lm(formula = mpg ~ displacement + horsepower + weight + acceleration +
I(year - 70), data = d1)

Residuals:
Min 1Q Median 3Q Max
-8.5211 -2.3920 -0.1036 2.0312 14.2874

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.3527596 2.0617039 18.117 <2e-16 ***
displacement 0.0027817 0.0054617 0.509 0.611
horsepower 0.0010201 0.0137631 0.074 0.941
weight -0.0068738 0.0006653 -10.333 <2e-16 ***
acceleration 0.0903236 0.1019070 0.886 0.376
I(year - 70) 0.7541153 0.0526118 14.334 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.435 on 386 degrees of freedom


Multiple R-squared: 0.8088, Adjusted R-squared: 0.8063
F-statistic: 326.5 on 5 and 386 DF, p-value: < 2.2e-16

Observe that even though separate linear models suggest that each of these
variables are significant for the model, when we put them all together in a model,
only weight and year remained significant. As stated earlier, this may be caused by
higher correlation values. Let’s write a final model:

summary(lm(mpg~weight+I(year-70),d1))

Call:
lm(formula = mpg ~ weight + I(year - 70), data = d1)

Residuals:
Min 1Q Median 3Q Max
-8.8505 -2.3014 -0.1167 2.0367 14.3555

12 of 20 19/02/2024, 15:04
Fall 2023-2024 IE 451 Homework 2 Solutions file:///home/sdayanik/Downloads/ie451/Homework/...

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 38.6650267 0.8015347 48.24 <2e-16 ***
weight -0.0066321 0.0002146 -30.91 <2e-16 ***
I(year - 70) 0.7573183 0.0494727 15.31 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.427 on 389 degrees of freedom


Multiple R-squared: 0.8082, Adjusted R-squared: 0.8072
F-statistic: 819.5 on 2 and 389 DF, p-value: < 2.2e-16

This model seems well.

2 Question 10
This exercise involves the Boston housing data set.

(a) How many rows are in this data set? How many columns? What do the
rows and columns represent?

head(Boston)

crim zn indus chas nox rm age dis rad tax ptratio lstat medv
1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 4.98 24.0
2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 9.14 21.6
3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 4.03 34.7
4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 2.94 33.4
5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 5.33 36.2
6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 5.21 28.7

#?Boston

d2 <- as_tibble(Boston) %>% mutate(chas=factor(chas))

ncol(d2) #number of columns

[1] 13

nrow(d2) #number of rows

[1] 506

crim: per capita crime rate by town.

zn: proportion of residential land zoned for lots over 25,000 sq.ft.

indus: proportion of non-retail business acres per town.

chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

nox: nitrogen oxides concentration (parts per 10 million).

rm: average number of rooms per dwelling.

age: proportion of owner-occupied units built prior to 1940.

dis: weighted mean of distances to five Boston employment centres.

rad: index of accessibility to radial highways.

13 of 20 19/02/2024, 15:04
Fall 2023-2024 IE 451 Homework 2 Solutions file:///home/sdayanik/Downloads/ie451/Homework/...

tax: full-value property-tax rate per $10,000.

ptratio: pupil-teacher ratio by town.

lstat: lower status of the population (percent).

medv: median value of owner-occupied homes in $1000s.

(b) Make some pairwise scatterplots of the predictors (columns) in this data set.
Describe your findings.
my_fn <- function(data, mapping, method="loess", ...){
p <- ggplot(data = data, mapping = mapping) +
geom_point() +
geom_smooth(formula = y ~ x,method=method,se=F, ...)
p
}

select(d2,-chas) %>%
ggpairs(progress = FALSE,
upper = list(continuous = my_fn),
lower = list(continuous = "cor"))

Warning in simpleLoess(y, x, w, span, degree = degree, parametric =


parametric, : pseudoinverse used at -0.5

Warning in simpleLoess(y, x, w, span, degree = degree, parametric =


parametric, : neighborhood radius 13

Warning in simpleLoess(y, x, w, span, degree = degree, parametric =


parametric, : reciprocal condition number 2.9038e-031

Warning in simpleLoess(y, x, w, span, degree = degree, parametric =


parametric, : There are other near singularities as well. 156.25

14 of 20 19/02/2024, 15:04
Fall 2023-2024 IE 451 Homework 2 Solutions file:///home/sdayanik/Downloads/ie451/Homework/...

As distance from employment areas increases, nitrogen oxides concentration and


crime rates decrease.

15 of 20 19/02/2024, 15:04
Fall 2023-2024 IE 451 Homework 2 Solutions file:///home/sdayanik/Downloads/ie451/Homework/...

As the median value of owner-occupied homes increases, the percent lower status
of the population decreases.

(c) Are any of the predictors associated with per capita crime rate? If so, explain
the relationship.

As distance increases, crime rate decreases.

As the median value of owner-occupied homes increases, the crime rate decreases.

Finally, as the percent of the lower status of the population increases, the crime
rate also increases.

(d) Do any of the census tracts of Boston appear to have particularly high crime
rates? Tax rates? Pupil-teacher ratios? Comment on the range of each predictor.
par(mfrow=c(1,3))
boxplot(d2$crim, xlab="crim")
boxplot(d2$tax, xlab="tax")
boxplot(d2$ptratio, xlab="ptratio")

16 of 20 19/02/2024, 15:04
Fall 2023-2024 IE 451 Homework 2 Solutions file:///home/sdayanik/Downloads/ie451/Homework/...

For crim: This variable has many outliers on the upper extreme. The data ranges
from 0 to 80, while the outliers range from 10 to 80. Many of the outliers are
between 10 and 30. There are some extreme values more than 70. Moreover, this
variable is right skewed.

For tax: This variable has no extreme points. The data ranges from 200 to 700. The
median is around 350, this variable is also right-skewed.

For ptratio: This variable has two lower extreme outliers. The data ranges from
12.6 to 22. Median is around 19.

(e) How many of the census tracts in this data set bound the Charles river?

sum(d2$chas == 1)

[1] 35

(f) What is the median pupil-teacher ratio among the towns in this data set?

17 of 20 19/02/2024, 15:04
Fall 2023-2024 IE 451 Homework 2 Solutions file:///home/sdayanik/Downloads/ie451/Homework/...

median(d2$ptratio)

[1] 19.05

(g) Which census tract of Boston has lowest median value of owneroccupied
homes? What are the values of the other predictors for that census tract, and how
do those values compare to the overall ranges for those predictors? Comment on
your findings.
lwst <- d2 %>% filter(medv==min(medv))
lwst_ind <- which(d2$medv == min(d2$medv)) ##which index?
lwst %>% pander()

Census Tracts with Lowest Median Values of Owneroccupied Homes


crim zn indus chas nox rm age dis rad tax ptratio lstat medv
38.35 0 18.1 0 0.693 5.453 100 1.49 24 666 20.2 30.59 5
67.92 0 18.1 0 0.693 5.683 100 1.425 24 666 20.2 22.98 5

d2 %>%
summarise(across(where(is.numeric), quantile)) %>%
pander()

Quantiles of variables
crim zn indus nox rm age dis rad tax ptratio lstat medv
0.00632 0 0.46 0.385 3.561 2.9 1.13 1 187 12.6 1.73 5
0.08204 0 5.19 0.449 5.886 45.02 2.1 4 279 17.4 6.95 17.02
0.2565 0 9.69 0.538 6.208 77.5 3.207 5 330 19.05 11.36 21.2
3.677 12.5 18.1 0.624 6.623 94.07 5.188 24 666 20.2 16.96 25
88.98 100 27.74 0.871 8.78 100 12.13 24 711 22 37.97 50

As it is seen, there are two data points which have the lowest value for the median
value of owneroccupied homes.

If we check the quantiles:

The crime rates are higher than 75% of all data points, in these areas.

These areas are within the closest 25% of all data points, from employment
centres.

The percentage of lower status of the population is higher than 75% of all data
points, in these areas.

Both of these ares doesn’t bound Charles River.

(h) In this data set, how many of the census tracts average more than seven rooms
per dwelling? More than eight rooms per dwelling? Comment on the census tracts
that average more than eight rooms per dwelling.

sum(d2$rm >7) ##number of census tracts average more than seven rooms per dwelling

[1] 64

sum(d2$rm > 8) ##number of census tracts average more than eight rooms per dwelling

18 of 20 19/02/2024, 15:04
Fall 2023-2024 IE 451 Homework 2 Solutions file:///home/sdayanik/Downloads/ie451/Homework/...

[1] 13

par(mfrow=c(2,3))
d2_upd <- d2 %>% filter(rm > 8)
boxplot(d2$crim,d2_upd$crim,xlab="crim", names=c("std","more_than_8"))
boxplot(d2$nox,d2_upd$nox,xlab="nox", names=c("std","more_than_8"))
boxplot(d2$dis,d2_upd$dis,xlab="dis", names=c("std","more_than_8"))
boxplot(d2$tax,d2_upd$tax,xlab="tax", names=c("std","more_than_8"))
boxplot(d2$lstat,d2_upd$lstat,xlab="lstat", names=c("std","more_than_8"))
boxplot(d2$medv,d2_upd$medv,xlab="medv", names=c("std","more_than_8"))

crim: As it is seen on the boxplot, crime rates are quite low in these census tracks
in comparison with the whole data, no crime value is more than 10.

nox: The median value for this variable is less than the whole data.

dis: The median is similar with the whole data, however, the highest dis value is
lower than the original data.

19 of 20 19/02/2024, 15:04
Fall 2023-2024 IE 451 Homework 2 Solutions file:///home/sdayanik/Downloads/ie451/Homework/...

tax: Tax is lower with respect to the original data, with the exception of two
outliers. The median values are similar.

lstat: Lower status of the population is quite low in these census tracks, with no
extreme points, and the median value is less than the original data.

medv: This variable is higher in the updated dataset, probably extreme points of
the original dataset belong to this new dataset.

20 of 20 19/02/2024, 15:04

You might also like