CS636 Data Analytics with R
Programing
Instructor
David Li
Course Logistics
• Basic information
• Requirements
• Goal
CS636 Data Analytics with R Programing
• Class Schedule: Saturday 9:00 am - 11:50 am, Fenster Hall 160
• Instructor: David Li, email: dli@njit.edu, tel: 631-800-3381
• TA: Maggie Zhang, email: mz339@njit.edu, tel: 908-917-1528
• Office Hours: Sunday morning in library 11AM to 12PM or by appointment.
• Textbooks
– R Programming for Data Science, by Roger D. Peng
– Using R for Introductory Statistics, by John Verzani, 2014, ISBN 1466590734
– Advanced R, by Hadley Wickham, ISBN 9781466586963
• Website
– https://njit.instructure.com/courses/10227
Requirements
• Homework & computing lab exercise (10%)
• Quiz (20%)
• Term Project (10%)
• Midterm (20%)
• Final (40%)
You should sign the attendance sheet at the end of each class. Extra
bonus based on attendance will be determined.
Homework (5 %)
• Homework assignments
– Try to do it independently, discussions allowed, but copying is
forbidden.
• Homework Grading Policy
– Your homework: may have several homework assignments, but
pick only one (the worst one) to grade. Namely, if you miss one
assignment, you get 0.
• Late homework policy
– 25% penalization per late day;
– Not accepted more than 3 days late
Lab exercise (5 %)
• Have a lab session every week
• Lab exercises
– Focus on R computing exercises
– 3 students a group. Please find your group mates as quick as
possible.
– Some answers may be selected for discussion by the end of lab
session.
– Lab exercise grade is based on the attendance sheet.
Two Term Projects (10%)
• Submit code and report to summarize what you have
done and results you obtained.
• Prepare for presentation and demo.
• 1~4 students a group. It can be same as lab group.
• More details to be announced soon
• Cheating/Copying is strictly prohibited. I will report to
Dean and you will get F in this course.
• If you think your group members don’t make contribution,
talk to me.
Quiz (20%)
• Focus on course materials.
• 5 Quizzes
• Every other week
Two Exams (60%)
• One midterm and one Final (20%+40%)
– In-class
– closed book
– a cheat sheet is allowed
– Final is cumulative
Some tips
• Computer/smartphone is not allowed in quiz/exam
• You should memorize the basic syntax and the usage of
functions
• Prior to quiz/exam, restudy the slides and Jupyter sample code
• If I discover cheating, I will report the incident to the Dean of
Student’s office Re: Academic Integrity. (TAs report the incident
to the course instructor)
Goal
• Gain programming proficiency of R
• Familiarize you with the commonly used analytical techniques
in Data Science
• Develop the way of data science thinking
– Learn how to preprocess, explore and interpret real data
– Learn how to model real problems using computational
techniques
Prerequisites
• Basic programming skills
• Linear algebra
• Probability
• Statistics
Tentative course topics
(Subject to changes according to progress)
R libraries for data science. The most common
knowledge that you can easily apply to Python.
Visualization with probability and statistics basics
Regular Expression and NLP for text processing
Machine learning algorithms (a lot of math)
Model/Feature Selection(more math)
May cover advanced big data and deep learning
Intro to R
David Li
14
What is R?
• Statistical computer language similar to S-plus
• Interpreted language (like Matlab)
• Has many built-in (statistical) functions
• Easy to build your own functions
• Good graphic displays
• Extensive help files
15
Strengths
• Many built-in functions
• Can get other functions from the internet by downloading
libraries
• Relatively easy data manipulations
Weaknesses
Not as commonly used by non-statisticians
Not a compiled language, language interpreter
can be very slow, but allows to call own C/C++
code
16
R packages
• Packaging: a crucial infrastructure to efficiently produce, load and
keep consistent software libraries from (many) different sources /
authors
• Statistics
– most packages deal with statistics and data analysis
– State of the art: many statistical researchers provide their methods as R packages
17
A sample job opening
18
When to use R?
• When
– Requires standalone computing or analysis on individual servers.
– Great for exploratory work: it's handy for almost any type of data
analysis because of the huge number of packages and necessary
tools to get up and running quickly
– R can even be part of a big data solution.
19
How to use/learn R?
• How
– (optional) Install and Use Rstudio IDE
– (optional) Install Jupyter with R kernel
– Getting started with R (Basic grammars)
– Get to use/learn those popular packages
• dplyr, plyr and reshape2 for data manipulation
• stringr for string operation
• ggplot2 for data visualization
• …
– Do (a lot of) practices including real projects
20
Install RStudio
• An integrated development environment (IDE) available for R
– a nice editor with syntax highlighting
– there is an R object viewer
– there are a number of other nice features that are integrated
• How to install
– https://www.youtube.com/watch?v=9-RrkJQQYqY
Install Jupyter with R kernel
1. Install R and Rstudio
2. Download and install the latest Anaconda at
https://www.anaconda.com/download/
3. In windows, add your R bin path and Anaconda3 Scripts path to your
environmental variable "Path"
– In my computer the R bin path is C:\Program Files\R\R-3.5.1\bin
– Anaconda3 Scripts path is C:\ProgramData\Anaconda3\Scripts, the paths in your
computer may vary.
– How to set the path and environment variables in Windows
https://www.computerhope.com/issues/ch000549.htm
– Install R kernel to Jupyter (PLEASE DO THIS STEP IN R CONSOLE, not in Rstudio or
RGui)
https://irkernel.github.io/installation/
https://stackoverflow.com/questions/44056164/jupyter-client-has-to-be-
installed-but-jupyter-kernelspec-version-exited-wit
– Then you can start "Jupyter Notebook" from the start menu.
Starting and stopping R
• Starting
– Windows: Double click on the R icon
– Unix/Linux: type R (or the appropriate path on your
machine)
Stopping
Type q()
q()is a function execution
Everything in R is a function
q merely returns the content of the function
23
Writing R code
• Can input lines one at a time into R
• Can write many lines of code in any of your favorite text editors
(including Rstudio) and run all at once
– Simply paste the commands into R
– Use function source(“path/yourscript”), to run in batch mode the
codes saved in file “yourscript” (use options(echo=T) to have the
commands echoed)
24
R as a Calculator
> log2(32)
[1] 5
1.0
> sqrt(2)
0.5
sin(seq(0, 2 * pi, length = 100))
[1] 1.414214
0.0
> seq(0, 5, length=6)
[1] 0 1 2 3 4 5
-0.5
> plot(sin(seq(0,
-1.0
2*pi, length=100))) 0 20 40 60 80 100
Index
25
Recalling Previous Commands
• In WINDOWS/UNIX one may use the arrow up key or the
history command under the menus
• Given the history window then one can copy certain commands
or else past them into the console window
26
Language layout
• Three types of statement
– expression: it is evaluated, printed, and the value is lost (3+5)
– assignment: passes the value to a variable but the result is not
printed automatically (out<-3+5)
– comment: (#This is a comment)
27
Naming conventions
• Any roman letters, digits, underline, and ‘.’ (non-initial position)
• Avoid using system names: c, q, s, t, C, D, F, I, T, diff, mean, pi,
range, rank, tree, var
• Hold for variables, data and functions
• Variable names are case sensitive
28
Arithmetic operations and functions
• Most operations in R are similar to Excel and calculators
• Basic: +(add), -(subtract), *(multiply), /(divide)
• Exponentiation: ^
• Remainder or modulo operator: %%
• Matrix multiplication: %*%
• sin(x), cos(x), cosh(x), tan(x), tanh(x), acos(x), acosh(x), asin(x),
asinh(x), atan(x), atan(x,y) atanh(x)
• abs(x), ceiling(x), floor(x)
• exp(x), log(x, base=exp(1)), log10(x), sqrt(x), trunc(x) (the next integer
closer to zero)
• max(), min(), mean(), median()
29
Defining new variables
• Assignment symbol, use “<-” (shortcut: alt -) or =
• Scalars
>scal<-6
>value<-7
• Vectors; using c() to enter data
>whales<-c(74,122,235,111,292,111,211,133,16,79)
>simpsons<-c("Homer", "Marge", "Bart", "Lisa", "Maggie")
• Factors
>pain<-c(0,3,2,2,1)
>fpain<-factor(pain,levels=0:3)
>levels(fpain)<-c("none", "Mild", "medium", "severe")
30
Use functions on a vector
• Most functions work on vectors exactly as we would want
them to do
>sum(whales)
>length(whales)
>mean(whales)
– sort(), min(), max(), range(), diff(), cumsum()
• Vectorization of (arithmetic) functions
>whales + whales
>whales - mean(whales)
– Other arithmetic funs: sin(), cos(), exp(), log(), ^, sqrt()
– Example: calculate the standard deviation of whales
31
Functions that create vectors
• Simple sequences
>1:10 >c(1:10, 10:1)
>rev(1:10) >fractions(1/(2:10))
>10:1 >library(MASS) #to have fractions()
• Arithmetic sequence
– a+(n-1)*h: how to generate 1, 3, 5, 7, 9?
>a=1; h=2; n=5 OR >seq(1,9,by=2)
>a+h*(0:(n-1)) >seq(1,9,length=5)
• Repeated numbers
>rep(1,10)
>rep(1:2, c(10,15))
– getting help: ?rep or help(rep)
– help.search(“keyword”) or ??keyword
32
Next week
• More data structure and R packages
• Homework 1
• Please find your lab group mates and sit together. I expect 13
groups of 39 students.