Statistics Introduction
Statistics is the science of analyzing, reviewing and conclude data.
Some basic statistical numbers include:
Mean, median and mode
Minimum and maximum value
Percentiles
Variance and Standard Deviation
Covariance and Correlation
Probability distributions
The R language was developed by two statisticians. It has many built-in
functionalities, in addition to libraries for the exact purpose of statistical analysis.
R Data Set
Data Set
A data set is a collection of data, often presented in a table.
There is a popular built-in data set in R called "mtcars" (Motor Trend Car Road
Tests), which is retrieved from the 1974 Motor Trend US Magazine.
In the examples below (and for the next chapters), we will use the mtcars data set,
for statistical purposes:
Example
# Print the mtcars data set
mtcars
Result:
mpg cyl disp hp drat wt qsec vs am gear
carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4
4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4
4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4
1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3
1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3
2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3
1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3
4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4
2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4
2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4
4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4
4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3
3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3
3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3
3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3
4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3
4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3
4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4
1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4
2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4
1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3
1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3
2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3
2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3
4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3
2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4
1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5
2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5
2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5
4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5
6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5
8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4
2
Try it Yourself »
Information About the Data Set
You can use the question mark (?) to get information about the mtcars data set:
Example
# Use the question mark to get information about the data set
?mtcars
Result:
mtcars {datasets} R Documentation
Motor Trend Car Road Tests
Description
The data was extracted from the 1974 Motor Trend US magazine, and comprises
fuel consumption and 10 aspects of automobile design and performance for 32
automobiles (1973-74 models).
Usage
mtcars
Format
A data frame with 32 observations on 11 (numeric) variables.
[, 1] mpg Miles/(US) gallon
[, 2] cyl Number of cylinders
[, 3] disp Displacement (cu.in.)
[, 4] hp Gross horsepower
[, 5] drat Rear axle ratio
[, 6] wt Weight (1000 lbs)
[, 7] qsec 1/4 mile time
[, 8] vs Engine (0 = V-shaped, 1 = straight)
[, 9] am Transmission (0 = automatic, 1 = manual)
[,10] gear Number of forward gears
[,11] carb Number of carburetors
Note
Henderson and Velleman (1981) comment in a footnote to Table 1: 'Hocking
[original transcriber]'s noncrucial coding of the Mazda's rotary engine as a straight
six-cylinder engine and the Porsche's flat engine as a V engine, as well as the
inclusion of the diesel Mercedes 240D, have been retained to enable direct
comparisons to be made with previous analyses.'
Source
Henderson and Velleman (1981), Building multiple regression models
interactively. Biometrics, 37, 391-411.
Examples
require(graphics)
pairs(mtcars, main = "mtcars data", gap = 1/4)
coplot(mpg ~ disp | as.factor(cyl), data = mtcars,
panel = panel.smooth, rows = 1)
## possibly more meaningful, e.g., for summary() or bivariate plots:
mtcars2 <- within(mtcars, {
vs <- factor(vs, labels = c("V", "S"))
am <- factor(am, labels = c("automatic", "manual"))
cyl <- ordered(cyl)
gear <- ordered(gear)
carb <- ordered(carb)
})
summary(mtcars2)
Try it Yourself »
ADVERTISEMENT
Get Information
Use the dim() function to find the dimensions of the data set, and
the names() function to view the names of the variables:
Example
Data_Cars <- mtcars # create a variable of the mtcars data set for better
organization
# Use dim() to find the dimension of the data set
dim(Data_Cars)
# Use names() to find the names of the variables from the data set
names(Data_Cars)
Result:
[1] 32 11
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am"
"gear"
[11] "carb"
Try it Yourself »
Use the rownames() function to get the name of each row in the first column, which is
the name of each car:
Example
Data_Cars <- mtcars
rownames(Data_Cars)
Result:
[1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
[4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
[7] "Duster 360" "Merc 240D" "Merc 230"
[10] "Merc 280" "Merc 280C" "Merc 450SE"
[13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
[16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
[19] "Honda Civic" "Toyota Corolla" "Toyota Corona"
[22] "Dodge Challenger" "AMC Javelin" "Camaro Z28"
[25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
[28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
[31] "Maserati Bora" "Volvo 142E"
Try it Yourself »
From the examples above, we have found out that the data set
has 32 observations (Mazda RX4, Mazda RX4 Wag, Datsun 710, etc)
and 11 variables (mpg, cyl, disp, etc).
A variable is defined as something that can be measured or counted.
Here is a brief explanation of the variables from the mtcars data set:
Variable Name Description
mpg Miles/(US) Gallon
cyl Number of cylinders
disp Displacement
hp Gross horsepower
drat Rear axle ratio
wt Weight (1000 lbs)
qsec 1/4 mile time
vs Engine (0 = V-shaped, 1 = straight)
am Transmission (0 = automatic, 1 = manual)
gear Number of forward gears
carb Number of carburetors
Print Variable Values
If you want to print all values that belong to a variable, access the data frame by
using the $ sign, and the name of the variable (for example cyl (cylinders)):
Example
Data_Cars <- mtcars
Data_Cars$cyl
Result:
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
Try it Yourself »
Sort Variable Values
To sort the values, use the sort() function:
Example
Data_Cars <- mtcars
sort(Data_Cars$cyl)
Result:
[1] 4 4 4 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8
Try it Yourself »
From the examples above, we see that most cars have 4 and 8 cylinders.
Analyzing the Data
Now that we have some information about the data set, we can start to analyze it
with some statistical numbers.
For example, we can use the summary() function to get a statistical summary of the
data:
Example
Data_Cars <- mtcars
summary(Data_Cars)
The summary() function returns six statistical numbers for each variable:
Min
First quantile (percentile)
Median
Mean
Third quantile (percentile)
Max
OutPut:
Information About the Data Set
You can use the question mark (?) to get information about the mtcars data set:
Result: