Assignment 9: How much for that car?
Raymond Guo
2020-04-13
Exercise 1
i. The other continuous variable is mileage.
cars %>%
gather(Mileage, Liter, key = "catagory", value = "value") %>%
ggplot() +
geom_point(mapping = aes(x = value, y = Price)) +
facet_wrap(~catagory, scales = "free_x") +
labs(title="Relationship Between Price with Liter and Mileage")
Relationship Between Price with Liter and Mileage
Liter Mileage
60000
Price
40000
20000
2 3 4 5 6 0 10000 20000 30000 40000 50000
value
Exercise 2
continuous_model <-lm(Price~Mileage + Liter, data = cars)
continuous_model %>%
tidy()
term estimate std.error statistic p.value
(Intercept) 9426.6014688 1095.0777745 8.608157 0.0e+00
Mileage -0.1600285 0.0349084 -4.584237 5.3e-06
Liter 4968.2781155 258.8011436 19.197280 0.0e+00
continuous_model %>%
glance() %>%
select(r.squared)
1
r.squared
0.3291279
The r.squared is closer to 0 than 1 which means this model is doing a poor job in capturing the
varability of Price. ## Exercise 3
# predict model plane over sensible grid of values
lit <- unique(cars$Liter)
mil <- unique(cars$Mileage)
grid <- with(cars, expand.grid(lit, mil))
d <- setNames(data.frame(grid), c("Liter", "Mileage"))
vals <- predict(continuous_model, newdata = d)
# form surface matrix and give to plotly
m <- matrix(vals, nrow = length(unique(d$Liter)), ncol = length(unique(d$Mileage)))
p <- plot_ly() %>%
add_markers(
x = ~cars$Mileage,
y = ~cars$Liter,
z = ~cars$Price,
marker = list(size = 1)
) %>%
add_trace(
x = ~mil, y = ~lit, z = ~m, type="surface",
colorscale=list(c(0,1), c("yellow","yellow")),
showscale = FALSE
) %>%
layout(
scene = list(
xaxis = list(title = "mileage"),
yaxis = list(title = "liters"),
zaxis = list(title = "price")
)
)
if (!is_pdf) {p}
This model accurately fits with the data from excerise 1. I do not even know how am I suppose to
integrate the 3 assumptions with the looks of this 3D model. It is much easier to understand the
2D model compared to the 3D.
Exercise 4
continuous_df <- cars %>%
add_predictions(continuous_model) %>%
add_residuals(continuous_model)
2
ggplot(continuous_df) +
geom_point(mapping = aes(x = pred, y = Price)) +
geom_abline(
slope = 1,
intercept = 0,
color = "red",
size = 1
) +
labs(title="Observed vs Predicted of Price",
x = "Predicted Price",
y = "Observed Price")
Observed vs Predicted of Price
60000
Observed Price
40000
20000
10000 20000 30000 40000
Predicted Price
This graph barely shows a linear relationship from the explanatory variable and the response
variable.
ggplot(continuous_df) +
geom_point(mapping =aes(pred, resid)) +
geom_ref_line(h = 0) +
labs(title="Residual vs Predicted", x = "Predicted", y = "Predicted")
3
Residual vs Predicted
40000
30000
Predicted 20000
10000
−10000
10000 20000 30000 40000
Predicted
It sort of looks funky because there happens to be a large contingent within the southern border,
but the northern border shows a few points that look like outliers. I say it is roughly yields a
constant variability.
ggplot(data = continuous_df) +
geom_qq(mapping = aes(sample = resid)) +
geom_qq_line(mapping = aes(sample = resid)) +
labs(title="Theoretical Residuals vs Actual Residuals")
Theoretical Residuals vs Actual Residuals
40000
20000
sample
−20000
−2 0 2
theoretical
This obviously does not follow a bell shape curve. ## Exercise 5
cars %>%
ggplot() +
geom_boxplot(aes(x = reorder(Make, Price, FUN=median), y = Price)) +
labs(x = "Make of car", title = "Effect of make of car on price")
4
Effect of make of car on price
60000
Price
40000
20000
Saturn Chevrolet Pontiac Buick SAAB Cadillac
Make of car
Based on these box plots, there are instances where outliers only exist on the right side for half of
them. The value for q3 is significantly higher because of the outliers.
i. Cadillac
ii. Cadillac
iii. Chevrolet
Exercise 6
cars %>%
gather(Model:Cylinder, Doors:Leather, key="original_column", value="value") %>%
ggplot() +
geom_boxplot(aes(x = reorder(value, Price, FUN=median), y = Price)) +
facet_wrap(~original_column, scales = "free_x") +
labs(title = "Boxplot of All Categorical Variables") +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 90, hjust = 1))
5
Boxplot of All Categorical Variables
Cruise Cylinder Doors
60000
40000
20000
2
Leather Model Sound
60000
Price 40000
20000
AVEO
Sunfire
Cavalier
Classic
Ion
Cobalt
Grand Am
Vibe
Century
L Series
Malibu
Grand Prix
Impala
G6
Lesabre
Monte Carlo
Bonneville
Lacrosse
Park Avenue
9−2X AWD
9_5 HO
9_3
GTO
9_3 HO
9_5
CTS
Deville
STS−V6
Corvette
STS−V8
CST−V
XLR−V8
1
Trim Type
60000
40000
20000
SVM Sedan 4D
SVM Hatchback 4D
LS Hatchback 4D
LT Hatchback 4D
Coupe 2D
LS Sport Coupe 2D
LS Coupe 2D
LS Sport Sedan 4D
Quad Coupe 2D
LT Sedan 4D
LS Sedan 4D
GT Sportwagon
AWD Sportwagon 4D
Sportwagon 4D
GT Coupe 2D
L300 Sedan 4D
Sedan 4D
LS MAXX Hback 4D
SE Sedan 4D
MAXX Hback 4D
LT MAXX Hback 4D
Custom Sedan 4D
GT Sedan 4D
CX Sedan 4D
GTP Sedan 4D
LT Coupe 2D
SLE Sedan 4D
Limited Sedan 4D
SS Coupe 2D
CXL Sedan 4D
CXS Sedan 4D
GXP Sedan 4D
SS Sedan 4D
Linear Sedan 4D
Special Ed Ultra 4D
Aero Sedan 4D
Aero Wagon 4D
Linear Wagon 4D
Arc Sedan 4D
Arc Wagon 4D
Aero Conv 2D
Linear Conv 2D
Arc Conv 2D
DHS Sedan 4D
DTS Sedan 4D
Conv 2D
Hardtop Conv 2D
Hatchback
Coupe
Sedan
Wagon
Convertible
reorder(value, Price, FUN = median)
Exercise 7
cars_factor_df <- cars %>%
mutate(Cylinder = as.factor(Cylinder))
mixed_model <-lm(Price~Mileage + Liter + Cylinder + Make +
Type, data = cars_factor_df)
mixed_model %>%
tidy()
term estimate std.error statistic p.value
(Intercept) 1.885018e+04 892.4119413 21.122738 0.0000000
Mileage -1.861764e-01 0.0106433 -17.492387 0.0000000
Liter 5.697442e+03 342.7322419 16.623596 0.0000000
Cylinder6 -3.312544e+03 619.9683651 -5.343086 0.0000001
Cylinder8 -3.672597e+03 1246.2162662 -2.946998 0.0033032
MakeCadillac 1.450444e+04 517.9855224 28.001635 0.0000000
MakeChevrolet -2.270807e+03 355.9736337 -6.379145 0.0000000
MakePontiac -2.355468e+03 363.9063301 -6.472731 0.0000000
MakeSAAB 9.905074e+03 450.2011112 22.001443 0.0000000
MakeSaturn -2.090266e+03 470.8305609 -4.439529 0.0000103
TypeCoupe -1.163869e+04 464.7055454 -25.045297 0.0000000
TypeHatchback -1.172638e+04 545.3936364 -21.500769 0.0000000
TypeSedan -1.178618e+04 411.1021489 -28.669707 0.0000000
TypeWagon -8.156551e+03 500.6379995 -16.292312 0.0000000
Yes, there are slopes for all of the categorical variables.
mixed_model %>%
glance() %>%
6
select(r.squared)
r.squared
0.9389165
Exercise 8
mixed_df <- cars_factor_df %>%
add_predictions(mixed_model) %>%
add_residuals(mixed_model)
ggplot(mixed_df) +
geom_point(mapping = aes(x = pred, y = Price)) +
geom_abline(
slope = 1,
intercept = 0,
color = "red",
size = 1
) +
labs(title="Observed vs Predicted of Price",
x = "Predicted Price",
y = "Observed Price")
Observed vs Predicted of Price
60000
Observed Price
40000
20000
10000 20000 30000 40000 50000
Predicted Price
ggplot(mixed_df) +
geom_point(mapping =aes(pred, resid)) +
geom_ref_line(h = 0) +
labs(title="Residual vs Predicted", x = "Predicted", y = "Predicted")
7
Residual vs Predicted
15000
10000
Predicted
5000
−5000
10000 20000 30000 40000 50000
Predicted
ggplot(data = mixed_df) +
geom_qq(mapping = aes(sample = resid)) +
geom_qq_line(mapping = aes(sample = resid)) +
labs(title="Theoretical Residuals vs Actual Residuals")
Theoretical Residuals vs Actual Residuals
15000
10000
sample
5000
−5000
−2 0 2
theoretical
Exercise 9
i. The value for r.squared is significantly closer to 1 compared to the 2 variable model. The
observed vs predicted graph perfectly shows a linear relationship. The variability of points
around the line is perfectly constant. The 2 variable model meets these requirements, but it
is a lot weaker. The qqplot clearly shows a bell shape curve compared to the first where it
obviously was not.
ii. The second model is the best because there are 3 conditions that needs to be satisfied to be a
reliable model. The second model does that job more effectively than the first one.