KEMBAR78
Lecture 1 | PDF | P Value | Statistics
0% found this document useful (0 votes)
29 views167 pages

Lecture 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views167 pages

Lecture 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 167

Msc EDCBA

Marc-Arthur Diaye
Full Professor
University Paris 1 Pantheon-Sorbonne

Data Analytics
Basic notions of R
CLASS 1A
• http://www.r-project.org/

• CRAN (Comprehensive R Arxiv Network)


• http://cran.r-project.org/

• Rstudio
• https://www.rstudio.com/
Data
• To find which folder R is currently looking in,
type :
• getwd()
• To change folder (directory) :
• setwd(‘’path’’)
• Important : use / instead of \.
• For instance: setwd(‘’C:/Marc/Master’’)
• Go directly to « Files » or « Session » (if you use
Rstudio)
• From « Files » or « Session », change folder.
Data
• R can read data sets in text format (ascii)
using the following functions :
• read.table
• scan
• read.fwf
Data
• R can read also files in Excel, SAS, SPSS,….
• These functions are however not in the
baseline package.
Data
• Function read.table permits to read a
dataset.
• It is the main function used to read a dataset.

• Example: A « txt » file called « coi2006 »
• From this dataset, we can create a dataset
called mydata:
• mydata<-read.table("coi2006.txt" ,
header=TRUE)
Data
• sep: sep="\t" tells R that the file is tab-
delimited (use " " for space delimited and ","
for comma delimited; use "," for a .csv file).

• row.names: a vector containing the names of


the lines which can be a vector of mode
character, or the number (or the name) of a
variable of the file (by default: 1, 2, 3, ...)
Data
• col.names: a vector that includes vecteur the
name of the dataset variables (by default :
V1, V2, V3, . . .).

• as.is: controls the conversion of character


variables into factor (if FALSE) or keeps them
in characters (TRUE); as.is can be a logical
vector, numeric or character specifying the
variables preserved in character.
Data
• Les variantes de read.table sont utiles car
elles ont des valeurs par défaut différentes :
• read.csv(file, header = TRUE, sep = ",",
quote="\"", dec=".", fill = TRUE, ...)
• read.csv2(file, header = TRUE, sep = ";",
quote="\"", dec=",",fill = TRUE, ...)
• read.delim(file, header = TRUE, sep = "\t",
quote="\"", dec=".",fill = TRUE, ...)
• read.delim2(file, header = TRUE, sep = "\t",
quote="\"", dec=",",fill = TRUE, ...)
Data
• The dataset we will use is already at the .dta
format
• In order to use database at this format, first
download the package called « foreign ».
• You can also directly write in the console :
library("foreign")
• Then duplicate a dataset called « mydata »
from « coi2006.dta ».
• mydata<-read.dta("coi2006.dta")
Data
• If you use Rstudio, you can directly visualize
the data from the « Environment » part.
• If you use the standard R, you can go to
« Edit », then to « Edit data».

• If you want to have the list of files in the folder that


you use, you can write in the console
• ls()
• Or
• objects()
Data
• Quit R

• Enter R

• library("foreign")

• mydata<-
read.dta(file="f:/Coi2006/coi2006.dta")
Data
• To have the list of variables:
• head(mydata)

• To make the columns of the dataset available


for calculus, the data must be attached:
• attach(mydata)
• The variables from the dataset are therefore
available for calculus.
Data
• You can compactly display the structure of all
variables from the dataset:
• str(mydata)

• You can compactly display the structure of a


specific variable z from the dataset:
• str(mydata$z)
• For instance : str(mydata$sexe)
Data
• Statistics:

• summary(mydata)
• Provide a summary of all variables from the dataset.

• summary(salnet)
• Provide a summary of variable salnet.

• For numerical variables, summary provide Min, Max,


Q1, Q2, Mean and Q3 (third quartile).

• Missing values are denoted NA.


Data
• Example of the NET WAGE variable: salnet.

• mean(salnet)
• var(salnet)
• sd(salnet)
• quantile(salnet)
• median(salnet)
• range(salnet)

• boxplot(salnet)
• boxplot(salnet, horizontal=TRUE)
Data

• Example of the NET WAGE variable: salnet.

• You can first define a list, before performing a


boxplot

• R> b=list("Man"=salnet[sex==1],
"Woman"=salnet[sex==2])
• R> boxplot(b)
Data

• BOXPLOT.

A boxplot, is a simple diagram that represents the


distribution of a variable.

This diagram is composed of:


- A rectangle that extends from the first to the third
quartile.
- The rectangle is divided by a line corresponding to the
median.

Data

• BOXPLOT.

- This rectangle is completed by two segments of lines.


- To draw them, we first calculate the bounds:
 b− = x1/4− 1.5IQ
 and b+ = x3/4 + 1.5IQ
With IQ the interquartile distance (i.e., the difference
between the 3rd quartile x3/4 and the 1st quartile x1/4).
Data
• BOXPLOT.
- The smallest and largest observation between
these boundaries is then identified. These
observations are called "adjacent values".
- We draw the line segments linking these
observations to the rectangle.
- Values that are not between adjacent values
are represented by dots and are called
"extreme values".
Data : More about Boxplot
• boxplot(list(salnet,salbrut))
• By default, whiskers have a maximum length
equal to 1.5 times the size of the box.
• This coefficient can be modified with the range
option.
• You can also change the width of the box with
the width option.
• The names option makes it possible to specify
the labels to be displayed under each box. For
example here one could use names = c("x","y").
Data : More about Boxplot
• Example:
• boxplot(list(salnet,salbrut),names=c("Net
wage", "Gross wage"))
Data

• Example of the NET WAGE variable: salnet.

• Compute average net wage by gender:


• R> b=list(sex)
• R> aggregate(salnet,by=b,mean)
• Or :
• R> aggregate(mydata$salnet,by=b,mean)
Data

• Example of the NET WAGE variable: salnet.


• Mean net wage per gender and PCS (3 :
Managers, 4: Middle managers, 5:
Employees, 6: Blue-collars):
• R> b=list(sex,cscor)
• R> aggregate(salnet,by=b,mean)
• Or :
• R> aggregate(mydata$salnet,by=b,mean)
Data

• Example of the NET WAGE variable: salnet.


• Standard-deviation of net wage per gender
and PCS (3 : Managers, 4: Middle managers,
5: Employees, 6: Blue-collars):
• R> b=list(sex,cscor)
• R> aggregate(salnet,by=b,sd)
• Or :
• R> aggregate(mydata$salnet,by=b,sd)
Data

• When a variable includes some missing


values:
• Example of variable r100 (profit).

• R> mean(r100)
• [1] NA
Data
• When a variable includes some missing
values:
• Example of variable r100 (profit).

• How to deal with it ?


• Answer : create a new database without
missing values:
• R> mydata2<-mydata[!is.na(mydata$r100),]
Data

• When a variable includes some missing values:


• Example of variable r100 (profit).

• Then compute the mean value of profit using the new


dataset.
• R> mean(mydata2$r100)
• [1] 17605.16

• A simpler solution is to use the option «na.rm »


directly on the original dataset:
• R> mean (r100, na.rm=TRUE)
Data
• When a variable includes some missing values:
• Suppose that you want to have the number of missing
observations from a a given variable.
• For instance « salnet » and « r100 »
• We know that salnet includes no missing values, while r100
includes some missing values.
• R> sum(is.na(salnet))
• [1] 0
• R> sum(is.na(r100))
• [1] 1072

• You can also use « table »:


• R> table(is.na(r100))
• [1] FALSE TRUE
• 11912 1072
Data

• Compute the mean of net wage, per gender.

• R> b=list(mydata2$sex)
• R> aggregate(mydata2$r100,by=b,mean)
Data
• Categorical variable :
• R> table(diplome)
• R> diplome
• 1 10 2 3 4 5 6 7 8 9
• 677 597 903 1864 2961 1066 1228 2350 716 622

• If we want some proportions:


• R> tab<-table(diplome)
• R> prop.table(tab)
• 1 10 2 3 4
• 0.05214110 0.04597967 0.06954713 0.14356131 0.22804991
• 5 6 7 8 9
• 0.08210105 0.09457794 0.18099199 0.05514479 0.04790511
• We can have the same result directly with :
• R> prop.table(table(diplome))
Data
• Categorical variable:

• Graph of frequencies:
• R> barplot(table(diplome))
Data
• Categorical variable:

• R> plot(table(diplome))
4. Data
• Categorical variable:
• Pie:
• pie(table(diplome))
Data
• Categorical variable :
• Pie:
• pie(table(sex), main="Distribution
Man/Woman", labels=c("Man",
"Woman"),col=c("green", "yellow"))
Data
• Two categorical variables:
• sex, diplome
• R> xtabs(~sex+diplome)
• Provides a contengency table.
• diplome
• sex 1 10 2 3 4 5 6 7 8 9
• 1 421 479 553 1486 1909 551 785 1271 384 320
• 2 256 118 350 378 1052 515 443 1079 332 302
Data
• Two categorical variables:

• We can do also:
• R> table(sex,diplome)

• Or:
• R> x<-table(sex,diplome)
• R> x
Data
• Two categorical variables:
• R> plot(x)
• Provides a bar plot.
Data
• Two categorical variables:
• R> summary(x)
• Provides the chi-square (Chisq), the number
of degree of freedom (df) and the p-value:
Independence test of two variables (here: sex
and diploma).
Data
• Two categorical variables:
• R> summary(x)

• Number of cases in table: 12984


• Number of factors: 2
• Test for independence of all factors:
• Chisq = 504.4, df = 9, p-value = 6.532e-103
Data
• Chi-square Independence Test
• H0 (Null hypothesis): The two distributions are
independent / H1: The two distributions are not
independent.

• summary(x) provides the p-value of the test.


• Fix the probability of the type 1 error (the rejection
of a true null hypothesis)  (1%, 5% or 10%). If the p-
value   then reject H0.
• In our example, p-value = 6.532e-103 <  =1%. Then
we do not reject H0.
Data
• Two numerical variables:
• Example : salnet (net wage); salbrut (gross
wage)
• R> cor(salnet,salbrut)
• Provides the correlation coefficient (in the
sense of Pearson) betwen the two variables.
• R> cor(salnet,salbrut,method=« spearman »)
provides the correlation coefficient in the
sense of Spearman.
Data
• Two numerical variables :
• Spearman Correlation
• Spearman compares the order of the values taken by
the two variables.
• Let us assume variables x = (x1,…,xn) and y =
(y1,…,yn).
• The distributions of x and y are ordered in the increasing
direction and renumbered, so that:
• x1<…<xn ; y1<…<yn.
• Spearman correlation coefficient =
• 1- where di = xi-yi
Data
• Two numerical variables :

• Link between two numerical variables:


• R> plot(salnet~salbrut)
• R> plot(salnet, salbrut)
Data
• One numerical variable and one categorical
variables:
• Let takes « salnet » and « sex »:
• R> tapply(salnet,sex,mean)
• 1 2
• 26330.01 19942.62
Data

• R> tapply(salnet,sex,summary)

• $`1`
• Min. 1st Qu. Median Mean 3rd Qu. Max.
• 3210 16930 21010 26330 28490 100000

• $`2`
• Min. 1st Qu. Median Mean 3rd Qu. Max.
• 3444 13580 16930 19940 22780 100000
Data

• Graph:
• R> b=list("Man"=salnet[sex==1],
"Woman"=salnet[sex==2])
• R> boxplot(b)
Data
• Remark concerning attach()
• We said that we need to write
• attach(mydata)
• before being able to do any calculus over the variables
of the data set.

• Actually, it is not necessary to go through this step.


But then, you have to specify the dataset that you use.
• Example:
• mean(mydata$salnet)
• table(mydata$sexe)
Data
• Create a new variable from another:
• We want to create a variable denoted
« lsalnet » defined as the ln of net wage.
• lsalnet<-log(salnet)
• mean(lsalnet)

• We can do also:
• mydata$lsalnet<-log(mydata$salnet)
• mean(mydata$lsalnet)
Data
• Create a new variable from another:

• Remind that the PCS variable (denoted CSCOR) takes four


modalities : 3 : Manager, 4: Middle Manager, 5: Employee,
6: (Blue-collar) Worker
• From CSCOR, we want to crate a new variable called PCS2.
With:
• PCS2 = TRUE if cscor = 3 ou 4
• PCS2 = FALSE otherwise
• pcs2<-cscor %in% c(3,4)

• On peut faire la même chose avec


• pcs2<-(cscor %in% c(3,4))
Data
• Create a new variable from another:
• If we want to create a binary variable 0/1:
CSP3

• pcs3<-rep(0,length(cscor))
• pcs3[cscor %in% c(3,4)]=1
Data
• Create a new variable from another:
• Practice.

• Create a binary variable from the « sex »


variable.
• Remind: sex=1 if man; 2 if woman
Data

• man=rep(0,length(sex))
• table(man)

• man[sex=="1"]=1
• table(man)
Data

• Practice : Impliq variable


• Create from this variable a binary effort
variable.

• Answer:
• Effort (impliq)
• effort<-rep(0,length(impliq))
• effort[impliq =="3"]=1
Data
• More concerning the Plot function
• xlim, ylim : set the lower and upper limits of
the axes (two-element vectors).
• xlab, ylab : allow you to specify the axis
legend (character mode).
• main : allows to put a title above the graph
(character mode).
• pch : defines the symbol representing the
points; an integer from 1 to 25, or any
character in quotation marks.
Data
• More concerning the Plot function
• col: specifies the color of symbols
("blue","red" etc. the exhaustive list of
available colours can be get with colors()),

• bty: controls the shape of the frame; default


square ("o"), L ("l"), U ("u"), C ("c"), 7 ("7") or
square brackets ("]").
Data
• More concerning the Plot function

• We can play on the size of the symbols thanks to


the option cex.

• By default cex=1 ; however we can provide to


the software, a positive number that represents
a multiplicative coefficient relative to the default
size (a value between 0 and 1 to reduce the size,
or greater than 1 to increase it).
Data
• More concerning the Plot function

• In the same way the options cex.axis, cex.lab


and cex.main control the size of the graduations
of the axes, labels of the axes and the title.

• To change the style of the text, we use the font


option, which also comes in the forms font.axis,
font.lab and font.main, (1 for normal, 2 for italic,
3 for bold and 4 for bold italic) .
Data
• More concerning the Hist function

• We saw that to draw a histogram, the basic


command is the hist function :
• hist(x)

• Some options of the hist function:


• breaks : allows you to specify the break points
between the bars of the histogram, either as a vector
or as a number of bars.
• freq : allows to choose the frequency (freq=TRUE,
default option), or the proportion (freq=FALSE).
Data
• More concerning the Hist function

• Some options of the hist function :


• col : indicates the color to fill the bars.
• plot : if plot=FALSE, the histogram is not drawn
and the function returns the list of break points
and numbers.
• right : allows to choose intervals of type ]a, b] if
right=TRUE (by default, [a, b[).
Data
• More concerning the Hist function

• Example:
• hist(salnet, freq = FALSE, main = "Histogram
NET Wage")
Data
• More concerning the Hist function

• Example:
• hist(salnet, freq = TRUE, main = "Histogram
Net Wage")
Data
• More concerning the Hist function

• Example:

• hist(salnet, freq = TRUE, col = "blue", main =


"Histogram Net Wage")
Data
• More concerning the Hist function

• Example: Add some cuts


• Suppose that you want two cuts:

• hist(salnet, breaks = 2, freq = FALSE, col =


"blue", main = "Histogram Net Wage")
• hist(salnet, breaks = 2, freq = TRUE, col =
"blue", main = "Histogram Net Wage")
Data
• More concerning the Hist function

• Transform a variable inside the hist function:

• hist(log(salnet), freq = FALSE, main =


"Histogram Net Wage")
Data
• More concerning the Hist function

• Transform a variable inside the hist function +


smooth:

• lines(density(log(salnet)), col= "red")


Data
#Remove some variables from the dataset : example with
cscor, age and effl_corr
myvars <- names(mydata) %in% c("cscor", "age", "effl_corr")
newdata <- mydata[!myvars]

#We can check that the variable cscor no more exists in


newdata
table(newdata$cscor)
< table of extent 0 >

#Remove the 3rd and 5th variables


newdata <- mydata[c(-3,-5)]
Data

#Remove variables TYPEMPLOI and IMPLIQ in the


original database mydata
mydata$typemploi <- mydata$impliq <- NULL

We can do also
mydata$typemploi <- NULL
mydata$impliq <- NULL

#If you don’t need to specify the dataset, then


typemploi <- impliq <- NULL
Data
# Select some variables: Example of cscor, age,
effl_corr
myvars2 <- c("cscor", "age", "effl_corr")
newdata <- mydata[myvars2]
fix(newdata)

# Select the 1st variable and variables 5 to 10


newdata <- mydata[c(1,5:10)]
Data
# Select the 20 first observations of the dataset
newdata <- mydata[1:20,]
fix(newdata)

#Select observations that fulfill some conditions


#Example: Select executive woman under 40 years old
newdata <- mydata[ which(mydata$sex=="2" &
mydata$cscor=="3" & mydata$age <= 40), ]

# or if we have first performed attach(mydata)


newdata <- mydata[ which(sex=="2" & cscor=="3" & age
<= 40), ]
Data
The best way to select observations is to use
the subset function

Syntax: subset( )
Data
Example:
• Select executive women less than 40 years old.
• Keep variables : sex, cscor, age, salnet, siren,
depnaiss, effl_corr, couple.

#Use function subset


newdata <- subset(mydata, sex=="2" & cscor=="3"
& age <= 40, select=c(sex,cscor,age,salnet,siren,
depnaiss,effl_corr,couple))
Data
Practice 1:
• Create from COI2006 two datasets including
respectively:
• Stressed men less than 35.
• Stressed men at least 35.
• Compare the average net wage of the two
groups.
Data
Practice 2:
• COI2006 includes two gender variables : sexe
and sex. However these variables come from
two different statistical sources. As a
consequence, they do not agree all the times.
• Compare both variables.
• Is it possible to have a convincing answer about
the agents’ true gender, when the two variables
disagree ?
Introduction to GGPLOT 2
CLASS 1 B
Comparing densities between groups
• library("foreign")
• mydata=read.dta(file="F:/COI2006/coi2006.dta")
• attach(mydata)

• library (lattice)
• densityplot(~salnet|sex)

#If you need to specify the name of the data set


• density(~salnet|sex, data=mydata)
Practice 1
Comparing (Kernel) densities between group

• In the same graph, draw the densities of net


wage (salnet variable) for managers and non
managers, for women and for men.
• Reproduce the following graph.
Solution 1
• library(lattice)

• v_manager<-rep("Non manager",length(cscor))
• v_manager[cscor %in% c(3,4)]="Manager"
• v_homme<-rep("Woman",length(sex))
• v_homme[sex=="1"]="Man"

• densityplot(~salnet|v_homme,
groups=v_manager, data=mydata,
auto.key=list(space="right"), main="Density Net
Wage Manager/Non manager, Woman/Man")
Data Visualization
• Visualisation is a fundamentally human
activity.

• A good visualization may show us things that


we did not expect, or raise new questions
about the data.
Datavisualization
• The majority of dataviz packages are part of the
so-called tidyverse.

• The packages in the tidyverse share a common


philosophy of data and R programming, and are
designed to work together naturally.

• We can install the complete tidyverse with a


single line of code:

• install.packages("tidyverse")
Datavisualization
• Then :
• library(tidyverse)

• You will see :

• -- Attaching packages --------------------------------------- tidyverse 1.2.1


• v ggplot2 3.1.0 v purrr 0.3.2
• v tibble 2.1.1 v dplyr 0.8.0.1
• v tidyr 0.8.3 v stringr 1.4.0
• v readr 1.3.1 v forcats 0.4.0
• -- Conflicts ------------------------------------------ tidyverse_conflicts() --
• x dplyr::filter() masks stats::filter()
• x dplyr::lag() masks stats::lag()
Datavisualization
• This tells us that tidyverse is loading the ggplot2,
tibble, tidyr, readr, purrr, and dplyr packages.

• These are considered to be the core of the tidyverse


because we will use them in almost every analysis.

• Packages in the tidyverse change fairly frequently.

• We can see if updates are available, and optionally


install them, by running :
• tidyverse_update()
Datavisualization
• Creating a ggplot

• With ggplot2, we begin a plot with the function


ggplot()

• ggplot() creates a coordinate system that we can add


layers to.

• The first argument of ggplot() is the dataset to use in


the graph.
• So ggplot(data = mydata) creates an empty graph.
Datavisualization
• Creating a ggplot

• We complete our graph by adding one or more


layers to ggplot().

• For instance, the function geom_point() adds a


layer of points to our plot, which creates a
scatterplot.

• ggplot2 comes with many geom functions that


each add a different type of layer to a plot.
Datavisualization
• Creating a ggplot

• Each geom function in ggplot2 takes a mapping


argument. This defines how variables in our dataset
are mapped to visual properties.

• The mapping argument is always paired with aes(),


and the x and y arguments of aes() specify which
variables to map to the x and y axes.

• ggplot2 looks for the mapped variables in the data


argument.
Datavisualization
• Creating a ggplot

• Example :
• ggplot(data=mydata)+geom_point(mapping =
aes(x = salnet, y = age))
Datavisualization
• Creating a ggplot

• We can convey information about our data by


mapping the aesthetics in our plot to the variables in
our dataset.

• For example, we can map the colors of our points to


the “sex” variable to reveal the “sex” of each .

• Example :
• ggplot(data=mydata)+geom_point(mapping = aes(x =
salnet, y = age, color=sex))
Datavisualization
• Creating a ggplot

• Example with shape :


• ggplot(data=mydata)+geom_point(mapping =
aes(x = salnet, y = age, shape=sex))
Datavisualization
• Creating a ggplot

• Example with alpha :


• ggplot(data=mydata)+geom_point(mapping =
aes(x = salnet, y = age, alpha=sex))
Datavisualization
• Creating a ggplot

• We can also set the aesthetic properties of


our geom manually.
• For example, we can make all of the points in
our plot blue:
• ggplot(data=mydata)+geom_point(mapping =
aes(x = salnet, y = age), color="blue")
Datavisualization
• Creating a ggplot

• The color does not convey information about a


variable, but only changes the appearance of the plot.

• To set an aesthetic manually, set the aesthetic by


name as an argument of our geom function;

• We will need to pick a level that makes sense for that


aesthetic:
• The name of a color as a character string;
• The size of a point in mm;
• The shape of a point as a number (see the below figure).
Datavisualization
• Creating a ggplot

• Example:
• ggplot(data=mydata)+geom_point(mapping = aes(x =
salnet, y = age), shape=23)

• ggplot(data=mydata)+geom_point(mapping = aes(x =
salnet, y = age), color="blue", shape=11)

• ggplot(data=mydata)+geom_point(mapping = aes(x =
salnet, y = age), color="blue", shape=19, size=3)
SHAPE=23
COLOR=BLUE , SHAPE=11
COLOR=BLUE , SHAPE=19
Datavisualization
• Creating a ggplot

• Example: Add a title using ggtitle

• ggplot(data=mydata)+geom_point(mapping =
aes(x = salnet, y = age), color="blue",
shape=11) + ggtitle("Shalom to my niece
Rivka")
Datavisualization
• Creating a ggplot

• Example: Add a title using labs

• ggplot(data=mydata)+geom_point(mapping =
aes(x = salnet, y = age), color="blue",
shape=11) + labs(title="Shalom to my niece
Rivka")
Datavisualization
• Creating a ggplot

• Example: Change the name of the axis using


xlab and ylab

• ggplot(data=mydata)+geom_point(mapping =
aes(x = salnet, y = age), color="blue",
shape=11) + ggtitle("Shalom to my niece
Rivka") + xlab("Net wage") + ylab("Ageeeee")
Datavisualization
• Creating a ggplot

• Example: We can have the same result (i.e.,


change the name of the axis) using labs

• ggplot(data=mydata)+geom_point(mapping =
aes(x = salnet, y = age), color="blue",
shape=11) + labs(title="Shalom to my niece
Rivka", x="Net wage", y="Ageeeee")
Datavisualization
• Creating a ggplot

• Example: Change the color, size and type of the main


title and the axis titles, using theme

• ggplot(data=mydata)+geom_point(mapping = aes(x =
salnet, y = age), color="blue", shape=11) +
labs(title="Shalom to my niece Rivka", x="Net wage",
y="Ageeeee")+theme(plot.title =
element_text(color="red", size=14, face="bold.italic"),
• axis.title.x = element_text(color="blue", size=14,
face="bold"), axis.title.y =
element_text(color="#993333", size=14, face="bold"))
Hexadecimal color code chart
• Colors can specified as a hexadecimal RGB (Red, Green,
Blue) triplet, such as "#0066CC".

• The first two digits : are the level of red,


• the next two digits : green,
• and the last two digits : blue.

• The value for each ranges from 00 to FF in hexadecimal


(base-16) notation, which is equivalent to 0 and 255 in
base-10.

• For example, in the table below (slide 30), “#FFFFFF” is


white and “#990000” is a deep red.
http://www.visibone.com
Datavisualization
• Creating a ggplot

• Example: Remove main title or axis titles


using theme

• theme(plot.title = element_blank(),
• axis.title.x = element_blank(), axis.title.y =
element_blank())
Datavisualization
• Creating a ggplot

• One way to add additional variables is with


aesthetics.

• Another way, particularly useful for categorical


variables, is to split our plot into facets, subplots
that each display one subset of the data.

• To facet our plot by a single variable, we can use


facet_wrap().
Datavisualization
• Creating a ggplot

• The first argument of facet_wrap() should be


a formula, which we create with ~ followed
by a variable name.

• The variable that we pass to facet_wrap()


should be discrete.
Datavisualization
• Creating a ggplot

• Example :

• ggplot(data = mydata) + geom_point(mapping =


aes(x = age, y = salnet)) + facet_wrap(~sex, nrow
= 2)

• ggplot(data = mydata) + geom_point(mapping =


aes(x = age, y = salnet)) + facet_wrap(~cscor,
nrow = 2)
Datavisualization
• Creating a ggplot

• To facet our plot on the combination of two


variables, we can add facet_grid() to our plot
call.

• The first argument of facet_grid() is also a


formula.

• This time the formula should contain two


variable names separated by a ~.
Datavisualization
• Creating a ggplot

• Example :
• ggplot(data = mydata) +
geom_point(mapping = aes(x = age, y =
salnet)) + facet_grid(sex ~ cscor)
Datavisualization
• Creating a ggplot

• If we prefer to not facet in the rows or columns


dimension, we can use a “.” instead of a variable
name.
• Example :
• ggplot(data = mydata) + geom_point(mapping =
aes(x = age, y = salnet)) + facet_grid(. ~ cscor)
• ggplot(data = mydata) + geom_point(mapping =
aes(x = age, y = salnet)) + facet_grid(sex ~ .)
Datavisualization
• Creating a ggplot : geometric objects

• Since then, we have used geom_point.


• However for bar charts we can use bar geoms;
• For line charts, we can use line geoms;
• For boxplots, we can use boxplot geoms;
• (…)
• We can use the smooth geom, in order to have a smooth
line fitted to the data.
• the data is fitted using the so-called Loess method; Loess short
for Local Regression is a non-parametric approach that fits
multiple regressions in local neighborhood;
• Or the so-called Gam method; Gam short for generalized
additive model.
Datavisualization
• Creating a ggplot : geometric objects

• To change the geom in our plot, we have


simply to change the geom function that we
add to ggplot().

• Example: Smooth the plot


• ggplot(data = mydata) +
geom_smooth(mapping = aes(x = age, y =
salnet))
Datavisualization
• Creating a ggplot : geometric objects

• Every geom function in ggplot2 takes a mapping argument.

• However, not every aesthetic works with every geom.

• We cant set the shape of a point, but we cannot set the “shape” of
a line.

• On the other hand, we can set the linetype of a line.

• geom_smooth() will draw a different line, with a different linetype,


for each unique value of the variable that we map to linetype.
Datavisualization
• Creating a ggplot : geometric objects

• Example:
• ggplot(data = mydata) +
geom_smooth(mapping = aes(x = age, y =
salnet, linetype = sex))
Datavisualization
• Creating a ggplot : geometric objects

• geom_smooth(), uses a single geometric


object to display multiple rows of data.
• Therefore we can set the group aesthetic to a
categorical variable to draw multiple objects.
• ggplot2 will draw a separate object for each
unique value of the grouping variable.
Datavisualization
• Creating a ggplot : geometric objects

• In practice, ggplot2 will automatically group the


data for these geoms whenever we map an
aesthetic to a discrete variable (as in the linetype
example).

• It is convenient to rely on this feature because


the group aesthetic by itself does not add a
legend or distinguishing features to the geoms.
Datavisualization
• Creating a ggplot : geometric objects

• Example:
• ggplot(data = mydata) + geom_smooth(mapping =
aes(x = age, y = salnet, group = sex))

• ggplot(data = mydata) + geom_smooth(mapping =


aes(x = age, y = salnet, color = sex))

• ggplot(data = mydata) + geom_smooth(mapping =


aes(x = age, y = salnet, color = sex), show.legend =
FALSE)
Datavisualization
• Creating a ggplot : geometric objects

• To display multiple geoms in the same plot, we


can add multiple geom functions to ggplot().

• Example :
• ggplot(data = mydata) + geom_point(mapping =
aes(x = age, y = salnet)) +
geom_smooth(mapping = aes(x = age, y =
salnet))
Datavisualization
• Creating a ggplot : geometric objects

• This, however, introduces some duplication in


our code.

• Imagine if we wanted to change the y-axis to


display salbrut instead of salnet.

• We will need to change the variable in two


places, and we may forget to update one.
Datavisualization
• Creating a ggplot : geometric objects

• We can avoid this kind of repetition by


passing a set of mappings to ggplot().

• ggplot2 will treat these mappings as global


mappings that apply to each geom in the
graph.
Datavisualization
• Creating a ggplot : geometric objects

• Example :
• ggplot(data = mydata, mapping = aes(x = age,
y = salnet)) + geom_point() + geom_smooth()
Datavisualization
• Creating a ggplot : geometric objects

• If we place mappings in a geom function, ggplot2


will treat them as local mappings for the layer.
• It will use these mappings to extend or overwrite
the global mappings for that layer only.

• This makes it possible to display different


aesthetics in different layers.
Datavisualization
• Creating a ggplot : geometric objects

• Example:
• ggplot(data = mydata, mapping = aes(x = age,
y = salnet)) + geom_point(mapping =
aes(color = sex)) + geom_smooth()
Datavisualization
• Creating a ggplot : geometric objects

• Example:
• ggplot(data = mydata, mapping = aes(x = age,
y = salnet)) + geom_point(mapping =
aes(color = sex)) + geom_smooth(data =
filter(mydata, sex == "1"))
• #Here the smooth line is the one for men
Datavisualization
• Creating a ggplot : geometric objects

• Example:
• ggplot(data = mydata, mapping = aes(x = age,
y = salnet)) + geom_point(mapping =
aes(color = sex)) + geom_smooth(data =
filter(mydata, sex == “2"))
• #Here the smooth line is the one for women
Datavisualization
• Creating a ggplot : geometric objects

• Example:
• ggplot(data = mydata, mapping = aes(x = age,
y = salnet)) + geom_point(mapping =
aes(color = sex)) + geom_smooth(data =
filter(mydata, sex == “2"), se=FALSE)
• #Here the smooth line is the one for women
and we have removed the standard-error
representation
Datavisualization
• Creating a ggplot : geometric objects

• Practice 2

• Recreate the code necessary to have the


following graph:
Practice 2
Datavisualization
• Creating a ggplot : geometric objects

• A solution to Pratice 2:
• ggplot(data = mydata, mapping = aes(x = age,
y = salnet, color=sex)) + geom_point() +
geom_smooth()

You might also like