Unit 3 – Data science
LECTURE BY
DR.A.SHANTHINI
ASSOCIATE PROFESSOR
DEPT OF DSBS
SRM IST
Introduction to R
The annual sales in U.S. dollars for 10,000 retail customers have been provided in the form of
a comma-separated-value (CSV) file. The read.csv() function is used to import the CSV file.
This dataset is stored to the R variable sales using the assignment operator <-
# import a CSV file of the total annual sales for each customer (reading yearly_sales dataset
to sales
sales <- read.csv(“c:/data/yearly_sales.csv”)
# examine the imported dataset (head -1st 6 rows of the dataset, tail –last 6 rows of data set,
summary )
head(sales)
tail(sales)
summary(sales)
# plot num_of_orders vs. sales (plot of 2 attributes from the dataset, histogram of )
plot(sales$num_of_orders,sales$sales_total, main=“Number of Orders vs. Sales”)
hist(sales$sales_total)
R Graphical User Interfaces
+
+*
Histogram plot Scatter plot
Data Import and Export
>sales <- read.csv(“c:/data/yearly_sales.csv”) #the dataset was imported into R
using the read.csv()
#the setwd() function can be used to set the working directory for the subsequent
import and export operations
>setwd(“c:/data/”)
>sales <- read.csv(“yearly_sales.csv”)
Creating a column average sales per order and
writing it to the sales table
jpeg() function
► jpeg() function, the following R code creates a new JPEG file, adds a histogram
plot to the file, and then closes the file. png(), bmp(), pdf(), and postscript() formats
can also be used to save the image
>jpeg(file="c:/Users/SHANTHINI A/Documents/sales_hist.jpeg")
> hist(sales$num_of_orders)
> dev.off()
► The sales_his.jpeg file shows the histogram
Attribute and Data Types
R Lab exercise
► Vectors
► Factors
► Array and matrices
► List
► Data frames
Vectors:
Creating Vectors
The c() function can be used to create vectors of objects by concatenating things together.
> x <- c(0.5, 0.6) ## numeric
> x <- c(TRUE, FALSE) ## logical
> x <- c(T, F) ## logical
> x <- c("a", "b", "c") ## character
> x <- 9:29 ## integer
> x <- c(1+0i, 2+4i) ## complex
You can also use the vector() function to initialize vectors.
>x <- vector("numeric", length = 10)
>x
[1] 0 0 0 0 0 0 0 0 0 0
Factors:
Factors are used to represent categorical data and can be unordered or ordered. One can think of a factor as an integer
vector where each integer has a label. Factors are important in statistical modeling and are treated specially by modelling
functions like lm() and glm().
> x <- factor(c("yes", "yes", "no", "yes", "no"))
>x
[1] yes yes no yes no
Levels: no yes
> table(x)
no yes
23
Arrays and Matrix
Arrays
A = array(1:10)
>A1= array(1:8,c(2,4))
>A1
[,1] [,2] [,3] [,4]
[1,] 1 3 5 7
[2,] 2 4 6 8
is.array(A1)
as.array(A1)
Matrix
Matrices are vectors with a dimension attribute. The
dimension attribute is itself an integer vector of length 2 cbind() and rbind()
(number of rows, number of columns)
> m <- matrix(nrow = 2, ncol = 3)
>m Matrices can be created by column-binding
[,1] [,2] [,3] or row-binding with the cbind() and rbind()
[1,] NA NA NA functions.
[2,] NA NA NA
> dim(m) > x <- 1:3
[1] 2 3 > y <- 10:12
> attributes(m) > cbind(x, y)
$dim
[1] 2 3 xy
► Matrices are constructed column-wise, so entries can be [1,] 1 10
thought of starting in the “upper left” corner and running [2,] 2 11
down the columns.
> m <- matrix(1:6, nrow = 2, ncol = 3) [3,] 3 12
>m > rbind(x, y)
[,1] [,2] [,3] [,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
x123
y 10 11 12
List
Lists are a special type of vector that can contain elements of different classes. Lists are a very important data type in R and you should get to know them
well. Lists, in combination with the various “apply” functions discussed later, make for a powerful combination.
Lists can be explicitly created using the list() function, which takes an arbitrary number of
arguments.
> x <- list(1, "a", TRUE, 1 + 4i)
>x
[[1]]
[1] 1
[[2]]
[1] "a"
[[3]]
[1] TRUE
[[4]]
[1] 1+4i
We can also create an empty list of a prespecified length with the vector() function
> x <- vector("list", length = 5)
>x
[[1]]
NULL
[[2]]
NULL
[[3]]
NULL
[[4]]
NULL
[[5]]
NULL
Data frames
Data frames are used to store tabular data in R. They are an important type of object in R and are used in a
variety of statistical modeling applications.
Binding in data frames:
> df1 = data.frame(name = c("Rahul","joe","Adam","Brendon"), married_year = c(2016,2015,2016,2008))
> df2 = data.frame(Birth_place = c("Delhi","Seattle","London","Moscow"), Birth_year = c(1988,1990,1989,1984))
> df1
name married_year
1 Rahul 2016
2 joe 2015
3 Adam 2016
4 Brendon 2008
> df2
Birth_place Birth_year
1 Delhi 1988
2 Seattle 1990
3 London 1989
4 Moscow 1984
> cbinded_df = cbind(df1,df2)
> cbinded_df
name married_year Birth_place Birth_year
1 Rahul 2016 Delhi 1988
2 joe 2015 Seattle 1990
3 Adam 2016 London 1989
4 Brendon 2008 Moscow 1984
#Similarly for rowbind
>Rbinded_df=rbind(df1,df2)
Contingency Tables
► Contingency tables are very useful to condense a large number of observations
into smaller to make it easier to maintain tables
► The following R code builds a contingency table based on the sales$gender and
sales$spender factors
Descriptive Statistics
Hypothesis testing
► A statistical hypothesis is an assumption made by the
researcher about the data of the population collected
for any experiment.
► It is not mandatory for this assumption to be true Distribution of two samples of data
every time.
► Hypothesis testing, in a way, is a formal process of
validating the hypothesis made by the researcher
► In order to validate a hypothesis, it will consider the
entire population into account. However, this is not
possible practically.
► Thus, to validate a hypothesis, it will use random
samples from a population.
► On the basis of the result from testing over the sample
data, it either selects or rejects the hypothesis.
Categories of Hypothesis testing
► Null Hypothesis – Hypothesis testing is carried out in order to test the validity of
a claim or assumption that is made about the larger population.
► This claim that involves attributes to the trial is known as the Null Hypothesis.
► The null hypothesis testing is denoted by H0.
► Alternative Hypothesis – An alternative hypothesis would be considered valid if
the null hypothesis is false.
► The evidence that is present in the trial is basically the data and the statistical
computations that accompany it.
► The alternative hypothesis testing is denoted by H1or Ha
Example - Null hypothesis and Alternate
hypothesis
Hypothesis Testing in R
► Statisticians use hypothesis testing to formally check whether the hypothesis is
accepted or rejected. Hypothesis testing is conducted in the following manner:
► State the Hypotheses – Stating the null and alternative hypotheses.
► Formulate an Analysis Plan – The formulation of an analysis plan is a crucial
step in this stage.
► Analyze Sample Data – Calculation and interpretation of the test statistic, as
described in the analysis plan.
► Interpret Results – Application of the decision rule described in the analysis
plan.
p-Value
► Hypothesis testing ultimately uses a p-value to weigh the strength of the evidence
or in other words what the data are about the population.
► The p-value ranges between 0 and 1.
It can be interpreted in the following way:
► A small p-value (typically ≤ 0.05) indicates strong evidence against the null
hypothesis, so you reject it.
► A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so
you fail to reject it.
► A p-value very close to the cutoff (0.05) is considered to be marginal and could go
either way.
Difference of mean:
If the mean of different samples are approx. equal, the
distribution overlaps as shown in the image and null
hypothesis is supported.
Else
H0 Null hypothesis (mean are equal) The difference in mean can be tested using Student’s t-test
and Welch’s t-test
HA Alternate hypothesis (difference in mean of two samples)
Using the Student’s T-test in R
► The Student’s T-test is a method for comparing two samples.
► It can be implemented to determine whether the samples are different.
► This is a parametric test, and the data should be normally distributed.
Student’s t- test using R
x <- rnorm(10, mean=100, sd=5) # normal distribution centered at 100
y <- rnorm(20, mean=105, sd=5) # normal distribution centered at 105
t.test(x, y, var.equal=TRUE) # run the Student’s t-test
Output?
Two Sample t-test
data: x and y
t = -1.7828, df = 28, p-value = 0.08547
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-6.1611557 0.4271893
sample estimates:
mean of x mean of y
102.2136 105.0806
The commands used in the Student’s t-test
► Listed below are the commands used in the Student’s t-test and their explanation:
► t.test(data.1, data.2) – The basic method of applying a t-test is to compare two
vectors of numeric data.
► var.equal = FALSE – If the var.equal instruction is set to TRUE, the variance is
considered to be equal and the standard test is carried out. If the instruction is set
to FALSE (the default), the variance is considered unequal and the Welch
two-sample test is carried out.
► mu = 0 – If a one-sample test is carried out, mu indicates the mean against which
the sample should be tested.
► alternative = “two.sided” – It sets the alternative hypothesis. The default value for this
is “two.sided” but a greater or lesser value can also be assigned. You can abbreviate the
instruction.
► conf.level = 0.95 – It sets the confidence level of the interval (default = 0.95).
► paired = FALSE – If set to TRUE, a matched pair T-test is carried out.
► t.test(y ~ x, data, subset) – The required data can be specified as a formula of the form
response ~ predictor. In this case, the data should be named and a subset of the predictor
variable can be specified.
► subset = predictor %in% c(“sample.1”, sample.2”) – If the data is in the form
response ~ predictor, the two samples to be selected from the predictor should be
specified by the subset instruction from the column of the data.
Welch’s t-test
When the equal population variance assumption is not justified in performing Student’s t test for the
difference of means, Welch’s t-test can be used.
In Welch’s test, under the remaining assumptions of random samples from two normal populations with
the same mean, the distribution of T is approximated by the t distribution.
The following R code performs the Welch’s t-test on the same set of data analyzed in the earlier Student’s
t-test example
t.test(x, y, var.equal=FALSE) # run the Welch’s t-test
Output?
Welch Two Sample t-test
data: x and y
t = -1.6596, df = 15.118, p-value = 0.1176
alternative hypothesis: true difference in means is not
equal to 0
95 percent confidence interval:
-6.546629 0.812663
sample estimates:
mean of x mean of y
102.2136 105.0806
Wilcoxon Rank-Sum test
► A t-test represents a parametric test in that it makes assumptions about the
population distributions from which the samples are drawn.
► If the populations cannot be assumed or transformed to follow a normal
distribution, a nonparametric test can be used.
► TheWilcoxon rank-sum test is a nonparametric hypothesis test that checks
whether two populations are identically distributed.
► The Wilcoxon rank-sum test determines the significance of the observed rank-sums. The following
R code performs the test on the same dataset used for the previous t-test.
>wilcox.test(x, y, conf.int = TRUE)
Output?
Wilcoxon rank sum test
data: x and y
W = 55, p-value = 0.04903
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
-6.2596774 -0.1240618
sample estimates:
difference in location
-3.417658
Decision Errors in R
► The two types of error that can occur from the hypothesis testing:
► Type I Error -is the rejection of the null hypothesis when the null hypothesis is
TRUE
► Type II Error - is the acceptance of a null hypothesis when the null hypothesis is
FALSE
Wilcoxon Ranksum Test
► A t-test represents a parametric test in that it makes assumptions about the
population distributions from which the samples are drawn. If the populations
cannot be assumed or transformed to follow a normal distribution, a
nonparametric test can be used.
► The Wilcoxon rank-sum test is a nonparametric hypothesis test that checks
whether two populations are identically distributed.
► Ex: two populations: pop1 and pop2
► N=n1+n2
► Rank the observations from two groups.
► Smallest observation is first rank.
► Again group it according to population
► Assigned ranks are summed
► Compare the rank sum of pop1 and pop2
► wilcox.test(x,y)
Type I and Type II Errors
► A type I error is the rejection of the null hypothesis when
the null hypothesis is TRUE. The probability of the type I
error is denoted by the Greek letter α.
► A type II error is the acceptance of a null hypothesis when
the null hypothesis is FALSE. The probability of the type
II error is denoted by the Greek letter β .