BIA 5303 Big Data 2
Module 1
Introduction to R Reference:
Based on Prabhpreet Sidhu's slides
Objectives
✓ Understand the importance of big data in data science.
✓ See the evolution of big data.
✓ Look at the landscape of big data.
✓ Install R interpreter and RStudio IDE. Lecturer
✓ Learn to program in R using RStudio. Anas Kuzechie
Big Data
✓ Big Data is a discipline that studies four aspects:
application of Big Data analysis to
enhance business performance
Big Data History
Enterprise Data Warehouse
for small, refined, and
relational data storage NoSQL database store data in a
format other than relational tables
Enterprise Data Lake
Open-source framework that manages the storage and for large, raw, and
processing of large amounts of data for applications undefined data storage
Real-time data
streaming
Cloud-based processing
(Microsoft, Google, etc.) Hybrid processing (on
Data stored on premise and cloud)
company's premises
Machine Learning
Business
Intelligence Natural Language Processing
Big Data Era 2
Streaming
✓ Stream Processing: continuously query and analyze data in real-time, as it arrives.
✓ Examples: Sensors, traffic, web events, health, social media, gaming.
Big Data Landscape Summary
✓ Big Data: data sets that are so large or complex that traditional software cannot
deal with them.
✓ Volume: terabytes to exabytes of data to store and process.
✓ Velocity: streaming data, milliseconds to respond.
✓ Variety: data in many forms.
✓ Data Storage:
✓ How? Data warehouse vs data lake.
✓ Where? On-premise vs cloud.
✓ Data Processing:
✓ Where? On-premise vs cloud.
✓ When? Batch vs streaming.
Structured vs Unstructured Data
✓ Structured Data is data that fits neatly into a table with columns and rows, e.g.,
transactional data, financial data, etc.
✓ Unstructured data is data that does not fit into a table, e.g., images, videos,
audios, tweets, etc. To interact with such data, we need special tools and database
structures like Hadoop ecosystem.
Installing R and RStudio
https://posit.co/download/rstudio-desktop/
R Interpreter R Integrated Development Environment (IDE)
RStudio
R script with extension
.R containing R code. Environment shows objects in
memory with assigned value(s).
Console is where we can
type commands and see
output. Files show all files and folders in your default
workspace.
Plots will show all the graphs.
Packages will list a series of packages needed to
run certain processes.
Example
Click on the dotted square to
see the data on the top left.
Statement to generate a matrix
having 2 rows and 3 columns
History Tab
✓ History tab keeps a record of all previous commands.
✓ We can select the commands we want and send them to an R script.
Click To Source to copy the
selected commands to source file.
Setting Default Working Directory
Set default working directory that
will have all your R source files.
R Script
✓ RStudio interface has four windows:
✓ Console.
✓ Environment and History.
✓ Files, Plots, Packages, and Help
✓ R Scripts and Data View.
✓ Creating an R script:
Or
✓ Running an R
script:
Select the commands
to execute, then click
Run to see output on
the Console.
Packages Tab
✓ Package tab shows the list of add-ons
included in the installation of
RStudio. If checked, the package is loaded
into R.
✓ We can also install
other add-ons by
clicking on the
Install icon.
Plots Tab
R Programming Language
✓ R is an object-oriented programming (OOP) language. Everything we do in R can be
saved in an object and all functions are referenced by those objects.
✓ OOP is designed to reduce the amount of code required to accomplish any task. In
the case of R, the amount of code needed to perform statistical analysis.
✓ Numbers, datasets, or the output of a linear regression can all be stored in an
object (variable) using the <- operator.
✓ R is case sensitive! Check for this first when you get errors.
✓ R Objects: Comments
In R, we can annotate our code with comments. Just preface the
✓ Single entry line with a hash mark (#), and anything that comes thereafter will
✓ Vector be ignored by the interpreter.
✓ Matrix
✓ Dataframe
✓ List
Single Entry
✓ Most basic data class in R. They are either single numbers or single strings.
✓ Example
Vector
✓ A vector contains a series of numbers or strings of one consistent type. We create
a vector using the c command.
✓ Example
Matrix
✓ Matrix is a series of vectors of the same type.
✓ Example
Data Frame
✓ Data frame is a series of vectors of different types.
✓ Example
List
✓ List can be a combination of the previous four types. For example, the output of a
regression is a list.
R Pros & Cons
Pros Cons
Fast and free Steep learning curve
R is way ahead of SPSS and SAS No commercial support
Second only to Matlab for graphics Easy to make mistakes and not know
Active user community Working with large datasets is limited by RAM
Excellent for simulation, programming, Data preparation and cleaning can be messier
computer intensive analysis, etc. and more mistake prone in R vs SPSS or SAS
Forces you to think about your analysis
Interfaces with database systems such as
MySQL.