0% found this document useful (0 votes)

25 views47 pages

Data - Analysis - With - R - 24

The document serves as an introduction to statistics using R, highlighting its advantages, disadvantages, and functionalities for data analysis. It covers the workflow of data analysis, the use of R for statistical computing, and the creation and manipulation of data frames. Additionally, it emphasizes R's capabilities for reproducible research and high-quality graphical outputs.

Uploaded by

Kar Wai Hong

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views47 pages

Data - Analysis - With - R - 24

Uploaded by

Kar Wai Hong

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Introduction to Statistics

Joao Lourenço (joao.lourenco@sib.swiss) and Rachel Marcone (rachel.marcone@sib.swiss)

January 2024
Data analysis with R:
An introduction
Data analysis workflow

Hadley Wickham

Garrett Grolemund
Prepare: make data available in a specific format

• Database
• Flat file
• Proprietary file
Which tool to use for data analysis ?
Annoyances with spreadsheets
Microsoft Excel

•Many standard methods in statistics are not available. Other

methods only offer basic options (linear regression)

• Different analysis require user to reorganize the data

•Probably ok for simple calculations (basic summary statistics,

simple regression)

•Add-ons can be used for missing functions (e.g. StatPlus for

Excel)
Libreoffice

• Many types of graphics violate standards of good graphics

Annoyances with spreadsheets

“The date conversions affect at least 30 gene names; the floating-point

conversions affect at least 2,000 if Riken identifiers are included. These
conversions are irreversible; the original gene names cannot be recovered.”
Example of a dataset which is difficult to use with any statistical program
https://en.wikipedia.org/wiki/Comparison_of_statistical_packages
What is R ?

• R is an open source complete and flexible software environment for

statistical computing and graphics.
• It includes :
• Tools for data import and manipulation
• Large set of data analysis tools
• Graphical tools
• As a programming language, a simple development environment, with a text
editor
• R itself is written primarily in C and Fortran, and is an implementation
of the statistical language S
Why R ?

• R has become the tool of choice for statistical analysis in several

fields, including life sciences
• Two reasons for this success: it is free and many contributed packages
are available (can be installed and run directly from R).
• Well-designed publication-quality plots can be produced, including
mathematical symbols and formulae where needed.
• Many tools implemented for bioinformatics
Advantages of R

• Advantages of R
▪ Availability and compatibility
▪ State-of-the-art graphics capabilities
▪ Can import files from other (statistical) programs
▪ New version every x months
▪ Interactive development environments (IDEs) available
▪ Large users community

• Advantages of learning R
▪ Learn to program and do reproducible research
▪ Speak the common language
Drawbacks of R

• «Expert friendly»
• Learn by example
• Not very (easily) interactive
• Command-based
• Documentation sometimes cryptic

• (Too) large amount of resources

• Constantly evolving
• Memory intensive and slow at times
Downloading and installing R: the R website

https://www.r-project.org/
R console

The prompt “>”

indicates that R is
waiting for you to
type a command
RStudio interface

Editor Workspace,
history

File explorer,
plots,
packages,
Console,
help
terminal
R scripts and workspace

• R script (.R file)

▪ Very useful instead of typing commands on the console.
▪ Allows you to keep track of what you are doing and make any modification easier
▪ To actually execute some commands, you can select the lines and run the execution

• Workspace (.Rdata file)

▪ The internal memory where R will store the objects you created during the session.
▪ To list what is in your workspace: ls()
▪ To empty the workspace from all objects: rm(list=ls())
▪ To save only specific R objects: save(object_name(s),"name_of_file.RData")
▪ To save your entire workspace: save.image("name_of_file.RData")
▪ To load your workspace / specific R objects: load("name_of_file.RData")
R Markdown

• R Markdown provides an authoring framework for data science. You can use a single R Markdown file to
both:
▪ save and execute code
▪ generate high quality reports that can be shared with an audience
• R Markdown documents are fully reproducible and support dozens of static and dynamic output formats

https://rmarkdown.rstudio.com/lesson-1.html
LeavingR

• Toleave R, use the q()command (or "quit" from the menu in RStudio):
> q()
Save workspace image? [y/n/c]:

Answers:
y save workspace image
n don't save workspace image
c cancel quitting
Functions, operators and variables

CIhigh <- mean(x) + 1.96*sd(x)/sqrt(n)

Variables: objects stored in memory

Functions: always followed by parenthesis
Operators
R syntax

• Case sensitive: A is not a

• Variable names can include A-Z, a-z, 0-9, .… but can not start with a
number
• Commands can be separated by ; or newline
> x <- 2; x+2
[1] 4
• # indicates comments:
> maxvalue <- 2 # Data above two is not relevant
R help

> ?sum # equivalent to help(sum)

Using R as a calculator

> 2*3
[1] 6
>log(6)/2^2
[1] 0.4479399
>exp(6)-4
[1] 399.4288
> pi-3
[1] 0.1415927
Using R as a programming language

> x <- 2.0

> x
[1] 2.0
> y = 3.0 # Equivalent to y <- 3.0
> y; x
[1] 3
[1] 2
>1/x
[1] 0.5
Creating vectors using the c() command

> x <- c(1.3, 0.32 10.5, 5.9, 6.3)

,
> x
[1] 1.30 0.32 10.5 5.90 6.30
0
> y <- c(x, 1.4, x, x); y
[1] 1.30 0.32 10.5 5.90 6.30
0
[6] 1.40 1.30 0.32 10.50 5.90
[11] 6.30 1.30 0.3 10.50 5.90
2
[16] 6.30
Vector operations

Vector operations work element by element:

> x <- c(1.3, 0.32, 10.5, 5.9, 6.3)

> y <- x*2; y
[1] 2.60 0.64 21.00 11.80 12.60
>z <- x*y; z
[1] 3.38 0.21 220.50 69.62 79.38
Recycling
• If a vector is too short, R recycles it (reuses it) as needed:
> x <- c(1.3, 0.32, 10.5, 5.9)
> y <- c(2, 10)
> x*y
[1] 2.6 3.2 21.0 59.0
1.3*2 0.32*10 10.5*2 5.9*10

• A warning message is displayed if the shortest vector can not be recycled entirely:
> x <- c(1.3, 0.32, 10.5, 5.9, 6.3)
> x*y
[1] 2.6 3.2 21.0 59.0 12.6
Warning message:
In x * y :
longer object length is not a multiple of shorter object length
Generating sequences of numbers
> 1:10
[1] 1 2 3 4 5 6 7 8 9 10

This is equivalent to:

>c(1,2,3,4,5,6,7,8,9,10)
[1] 1 2 3 4 5 6 7 8 9 10
> 10:1
[1] 10 9 8 7 6 5 4 3 2 1
Beware of operator priority
> x <- 2*1:10
# equivalent to x <- 2*(1:10)
> x
[1] 2 4 6 8 10 12 14 16 18 20
> n <- 10
> 1:n-1
# equivalent to (1:n)-1
[1] 0 1 2 3 4 5 6 7 8 9
> 1:(n-1)
[1] 1 2 3 4 5 6 7 8 9
The seq() function: the same, but more flexible
> seq(from=1, to=10)
[1] 1 2 3 4 5 6 7 8 9 10
> seq(from=1, to=5, by=0.5)
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
> x <- seq(from=1, to=5, length=17)
> x
[1] 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75
[9] 3.00 3.25 3.50 3.75 4.00 4.25 4.50 4.75
[17 5.0
] 0
Non numeric vectors: boolean (logical) values
> x <- seq(from=1, to=5, length=17)
> x
[1] 1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75
[9] 3.00 3.25 3.50 3.75 4.00 4.25 4.50 4.75
[17] 5.00
> y <- x<5 # help(“<”) shows list of relational operators
> y
[1] TRUE TRUE TRUE TRUE TRUE TRUE
[7] TRUE TRUE TRUE TRUE TRUE TRUE
[13] TRUE TRUE FALSE
>sum(x<5)
[1] 16
Missing values are designated by NA
> z <- c(1:3,NA)
> z
[1] 1 2 3 NA
> is.na(z)
[1] FALSE FALSE FALSE TRUE
> mean(z)
[1] NA
> mean(z, na.rm=TRUE)
[1] 2
Character strings
> char <- c("hello","world","!"); char
[1] "hello" "world" "!"

Vectors can not combine numbers and characters:

> char <- c("hello",3:5,"world"); char
[1] "hello" "3" "4" "5" "world"
> char <- c(char, NA); char
[1] "hello" "3" "4" "5" "world" NA
Selecting subsets of vectors using [ ]
> x <- 10:30
> x[2]
[1] 11
> x[1:5]
[1] 10 11 12 13 14
Selecting subsets of vectors using [ ] and boolean vectors
> x <- 10:30
> x[x>25]
[1] 26 27 28 29 30
>x <-c(seq(from=5, to=10,by=0.5),NA,
seq(from=11,to=15,by=0.5),NA,
seq(from=16,to=20,by=0.5))
> x[!is.na(x)]
[1] 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5
[9] 9.0 9.5 10.0 11.0 11.5 12.0 12.5 13.0
[17] 13.5 14.0 14.5 15.0 16.0 16.5 17.0 17.5
[25] 18.0 18.5 19.0 19.5 20.0
Changing parts of vectors using [ ]
> x[32] <- 200
> x[c(10,29)] <- c(1,100)
> x[x>15] <- NA
Finding the length of a vector
> x <- 1:5
> length(x)
[1] 5

> y <- 1:16

>len <- length(y) ; len
[1] 16
Data analysis workflow
Importing data into R
• R can import flat files using e.g. the commands:
read.table()
read.csv()
read.delim()
(with many options – check the help).

• R can also:
▪ Read Excel spreadsheets
▪ Read plenty of other formats
▪ Directly access databases
▪ Access files over the web
Data frames
• Data frames are made of columns having all the same number of elements
• They look like matrices, except that the columns can hold different variables
types
• They are typically used to store data, with
▪ Each row being an experimental unit
▪ Each column being a measurement

> data[,1] # access first column

> data[, "data1"] # access column "data1"
> data$data1 # … same
Creating data frames
> x <- 1:10
> y <- seq(from=5,to=10,length=10)
> z <- c("A","B","B","A","A","A","B","A","B","B")
> df <- data.frame(d1=x, d2=y, fact=z)
> df
d1 d2 fact
1 1 5.000000 A
2 2 5.555556 B
..
> names(df)
[1] "d1" "d2" "fact“
>dim(df)
[1] 10 3
Adding new columns
> df$d3 <- 10:1
> df
d1 d2 fact d3
1 1 5.000000 A 10
2 2 5.555556 B 9
…
> summary(df)
d1 d2 fact d3
Min. : 1.00 Min. : 5.00 Length:10 Min. : 1.00
1st Qu.: 3.25 1st Qu.: 6.25 Class :character 1st Qu.: 3.25
Median : 5.50 Median : 7.50 Mode :character Median : 5.50
Mean : 5.50 Mean : 7.50 Mean : 5.50
3rd Qu.: 7.75 3rd Qu.: 8.75 3rd Qu.: 7.75
Max. :10.00 Max. :10.00 Max. :10.00
Select data from a data frame
• Select all valuesof "d2" for which "fact" is "B"
> df[ df$fact == "B", "d2" ]
[1] 5.555556 6.111111 8.333333 9.444444 10.000000

• Select all values of "d1" for which "fact"is "B " and "d2" > 7
> df[ (df$fact == "B" & df$d2 > 7), "d1" ]
[1] 7 9 10

• Select all values of "d3" for which "fact" is “A " or "d2" < 6
>df[ (df$fact == "B" | df$d2 < 6), "d3" ]
[1] 10 9 8 4 2 1
Exercise
• Import students.csv into a variable (call it data)

• Extract the weight of women only in a new

variable

• Extract the weights of the people who weight

more than 80 kilos

• Extract the entries of men who weight more

than 80 kg (you can use the "&" operator to
include two conditions)
If you do not know what to do:

1.Extract the weight of women only in

a new variable
2.Extract the weights of the people
who weight more than 80 kilos
3.Extract the entries of men who
weight more than 80 kg
[you can use the "&" operator to
include two conditions]

Introduction To R PDF
No ratings yet
Introduction To R PDF
56 pages
Introduction to R Programming
No ratings yet
Introduction to R Programming
59 pages
Programming With R: Lecture #4
No ratings yet
Programming With R: Lecture #4
34 pages
R Short Tutorial
No ratings yet
R Short Tutorial
5 pages
Teaching R
No ratings yet
Teaching R
15 pages
P1 - NotesOnR
No ratings yet
P1 - NotesOnR
17 pages
R Programming Slides
No ratings yet
R Programming Slides
73 pages
RStudio Exercices
No ratings yet
RStudio Exercices
8 pages
R Examples
No ratings yet
R Examples
56 pages
R Intro STAT5000
No ratings yet
R Intro STAT5000
17 pages
R Course ISLR Basics 2023
No ratings yet
R Course ISLR Basics 2023
77 pages
Da Session 4
No ratings yet
Da Session 4
75 pages
2 Undefined
No ratings yet
2 Undefined
86 pages
Network Analysis and Visualization With R and Igraph
No ratings yet
Network Analysis and Visualization With R and Igraph
62 pages
Data Science Using R - Lab Manual-Complete Ver 2.0 - Nov 2024
No ratings yet
Data Science Using R - Lab Manual-Complete Ver 2.0 - Nov 2024
36 pages
An R Tutorial Starting Out
No ratings yet
An R Tutorial Starting Out
9 pages
Introduction To R
No ratings yet
Introduction To R
20 pages
R Lab
No ratings yet
R Lab
114 pages
R Statistical Package
No ratings yet
R Statistical Package
63 pages
Lecture 1
No ratings yet
Lecture 1
35 pages
Practical 1 - Data Frame Manipulation - 072502
No ratings yet
Practical 1 - Data Frame Manipulation - 072502
16 pages
RBasics Handout
No ratings yet
RBasics Handout
6 pages
People Analytics With R Part 3
No ratings yet
People Analytics With R Part 3
11 pages
R Programming 101 Part 1
No ratings yet
R Programming 101 Part 1
53 pages
STAT 04 Simplify Notes
No ratings yet
STAT 04 Simplify Notes
34 pages
R Studio
No ratings yet
R Studio
41 pages
Section 03
No ratings yet
Section 03
20 pages
Rintro
No ratings yet
Rintro
14 pages
Introduction To R: 1 Getting Started
No ratings yet
Introduction To R: 1 Getting Started
14 pages
DR - Pierpaolo-Delser - Introduction R
No ratings yet
DR - Pierpaolo-Delser - Introduction R
83 pages
Introduction To R
No ratings yet
Introduction To R
23 pages
R-Basic Concepts
No ratings yet
R-Basic Concepts
67 pages
Intro To Statistic Using R - Session 2
No ratings yet
Intro To Statistic Using R - Session 2
1 page
Session Set Working Directory Choose Directlry
No ratings yet
Session Set Working Directory Choose Directlry
17 pages
Data Analysis Using R and Vectors
No ratings yet
Data Analysis Using R and Vectors
35 pages
Brief Introduction To R Kaustav Banerjee: Decision Sciences Area, IIM Lucknow
No ratings yet
Brief Introduction To R Kaustav Banerjee: Decision Sciences Area, IIM Lucknow
7 pages
Introduction To R Chap 2
No ratings yet
Introduction To R Chap 2
30 pages
R Programming
No ratings yet
R Programming
50 pages
R-Programming: To See The Working Directory in R Studio
No ratings yet
R-Programming: To See The Working Directory in R Studio
17 pages
Basic Data Science With R
100% (1)
Basic Data Science With R
364 pages
MIS 4.hafta (Introduction To R)
No ratings yet
MIS 4.hafta (Introduction To R)
52 pages
Assignment 2: Introduction To R: Text Like This Will Be Problems For You To Do and Turn In. (There Are 7 in All.)
No ratings yet
Assignment 2: Introduction To R: Text Like This Will Be Problems For You To Do and Turn In. (There Are 7 in All.)
15 pages
Rtips. Revival 2012!: Paul E. Johnson June 8, 2012
No ratings yet
Rtips. Revival 2012!: Paul E. Johnson June 8, 2012
72 pages
Lecture 1
No ratings yet
Lecture 1
42 pages
S24 Stats10 Lab1-1
No ratings yet
S24 Stats10 Lab1-1
8 pages
Introdution To R - Network Analysis - Practical 1 - Sacha Epskamp - University of Amsterdam, 2013
No ratings yet
Introdution To R - Network Analysis - Practical 1 - Sacha Epskamp - University of Amsterdam, 2013
34 pages
R
No ratings yet
R
13 pages
R Lab File Deepak
No ratings yet
R Lab File Deepak
27 pages
Part I: Introductory Materials: Introduction To R
No ratings yet
Part I: Introductory Materials: Introduction To R
25 pages
Advantages of R Programming Language:: Extensive Libraries
No ratings yet
Advantages of R Programming Language:: Extensive Libraries
34 pages
Introduction To Analytics and R File
No ratings yet
Introduction To Analytics and R File
29 pages
R Programming
No ratings yet
R Programming
61 pages
R for NGS Data Analysis Beginners
No ratings yet
R for NGS Data Analysis Beginners
5 pages
Basic Statistics
No ratings yet
Basic Statistics
66 pages
Template Based Protein Structure Modeling
No ratings yet
Template Based Protein Structure Modeling
98 pages
CS273 - Protein Structure Prediction
No ratings yet
CS273 - Protein Structure Prediction
39 pages
From PDB To AlphaFold2 and Beyond
No ratings yet
From PDB To AlphaFold2 and Beyond
13 pages
Exploratory Data Analysis24
No ratings yet
Exploratory Data Analysis24
27 pages
Role of Statistics in Clinical Trials
No ratings yet
Role of Statistics in Clinical Trials
5 pages
Research Question and Data Collection Feb 20 2020
No ratings yet
Research Question and Data Collection Feb 20 2020
46 pages
3 - PowerAnalysis - Slides
No ratings yet
3 - PowerAnalysis - Slides
58 pages
L3 - Microbiology of Acute Pyogenic Meningitis
No ratings yet
L3 - Microbiology of Acute Pyogenic Meningitis
24 pages
Introduction To Clinical Informatics - 2020
No ratings yet
Introduction To Clinical Informatics - 2020
24 pages
RAHUL CHOUBEY - Updated Term Paper
No ratings yet
RAHUL CHOUBEY - Updated Term Paper
15 pages
Power Cloud For Technical Sales - Part 2 Private Cloud Quiz - Attempt Review
No ratings yet
Power Cloud For Technical Sales - Part 2 Private Cloud Quiz - Attempt Review
11 pages
Server-Side Development Basics
No ratings yet
Server-Side Development Basics
15 pages
Telnet Ftp-Mono
No ratings yet
Telnet Ftp-Mono
13 pages
PACOM GMS Web v3.5 Installation Configuration Guide
No ratings yet
PACOM GMS Web v3.5 Installation Configuration Guide
23 pages
Extreme Privacy What It Takes To Disappear
100% (7)
Extreme Privacy What It Takes To Disappear
640 pages
Animesh Kumar: Software Engineer Profile
No ratings yet
Animesh Kumar: Software Engineer Profile
3 pages
Datami Android VPN SDK
No ratings yet
Datami Android VPN SDK
14 pages
VI Lect - Notes#3 Btech Vii Sem Aug Dec2022
No ratings yet
VI Lect - Notes#3 Btech Vii Sem Aug Dec2022
164 pages
Engineering Comment Resolution
No ratings yet
Engineering Comment Resolution
1 page
Layer Animation
No ratings yet
Layer Animation
15 pages
5 Exception Handling
No ratings yet
5 Exception Handling
24 pages
ZKBio Time 9.0 Datasheet 20240516
No ratings yet
ZKBio Time 9.0 Datasheet 20240516
2 pages
60000/40000 Security Platforms: Release Notes
No ratings yet
60000/40000 Security Platforms: Release Notes
11 pages
FINAL Impact of AI On Society
No ratings yet
FINAL Impact of AI On Society
18 pages
Python File Operations & OOP Basics
No ratings yet
Python File Operations & OOP Basics
39 pages
PHP & Mail Servers for Engineers
No ratings yet
PHP & Mail Servers for Engineers
7 pages
UI Is Communication - Everett N McKay
No ratings yet
UI Is Communication - Everett N McKay
380 pages
DHS User Guide v61 PDF
100% (1)
DHS User Guide v61 PDF
781 pages
Implementing Fuzzy Control Systems Using VHDL and Statecharts
No ratings yet
Implementing Fuzzy Control Systems Using VHDL and Statecharts
7 pages
Microwave Oven Project Using Picsimlab
No ratings yet
Microwave Oven Project Using Picsimlab
30 pages
2018 Hypack-Manual en PDF
No ratings yet
2018 Hypack-Manual en PDF
2,434 pages
CSS Essentials for Web Developers
No ratings yet
CSS Essentials for Web Developers
44 pages
Unit III
No ratings yet
Unit III
17 pages
Integnance VR - en
No ratings yet
Integnance VR - en
20 pages
Sequence Diagrams
No ratings yet
Sequence Diagrams
19 pages
SDCA Student Portal Overview
No ratings yet
SDCA Student Portal Overview
22 pages
Tech Interview Prep Guide
No ratings yet
Tech Interview Prep Guide
5 pages
755 - Tech Failure Case - BlackBerry Failure
No ratings yet
755 - Tech Failure Case - BlackBerry Failure
10 pages
Tanmay Taneja CV (Tech) - 2025.02.25
No ratings yet
Tanmay Taneja CV (Tech) - 2025.02.25
1 page

Data - Analysis - With - R - 24

Uploaded by

Data - Analysis - With - R - 24

Uploaded by

Introduction to Statistics

Joao Lourenço (joao.lourenco@sib.swiss) and Rachel Marcone (rachel.marcone@sib.swiss)

•Many standard methods in statistics are not available. Other

• Different analysis require user to reorganize the data

•Probably ok for simple calculations (basic summary statistics,

•Add-ons can be used for missing functions (e.g. StatPlus for

• Many types of graphics violate standards of good graphics

“The date conversions affect at least 30 gene names; the floating-point

• R is an open source complete and flexible software environment for

• R has become the tool of choice for statistical analysis in several

• (Too) large amount of resources

The prompt “>”

• R script (.R file)

• Workspace (.Rdata file)

CIhigh <- mean(x) + 1.96*sd(x)/sqrt(n)

Variables: objects stored in memory

• Case sensitive: A is not a

> ?sum # equivalent to help(sum)

> x <- 2.0

> x <- c(1.3, 0.32 10.5, 5.9, 6.3)

Vector operations work element by element:

> x <- c(1.3, 0.32, 10.5, 5.9, 6.3)

This is equivalent to:

Vectors can not combine numbers and characters:

> y <- 1:16

> data[,1] # access first column

• Extract the weight of women only in a new

• Extract the weights of the people who weight

• Extract the entries of men who weight more

1.Extract the weight of women only in

You might also like