KEMBAR78
Data Analytic Using R - Advanced | PDF | Analytics | Computer File
0% found this document useful (0 votes)
7 views51 pages

Data Analytic Using R - Advanced

The document states that the training data is current only up to October 2023. No additional information is provided. It emphasizes the limitation of the data's recency.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views51 pages

Data Analytic Using R - Advanced

The document states that the training data is current only up to October 2023. No additional information is provided. It emphasizes the limitation of the data's recency.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 51

MIS2502:

Data Analytics
Advanced Analytics Using R

Zhe (Joe) Deng


deng@temple.edu
http://community.mis.temple.edu/zdeng

1
The Information Architecture of an
Organization
Now we’re here…

Data Data Data


entry extraction analysis
Transactional Analytical Data
Database Store

Stores real-time Stores historical


transactional data transactional and
summary data
What is Advanced Data
Analytics/Mining?
• The examination of data or content using sophisticated
techniques and tools, to
• discover deeper insights,
• make predictions, or
• generate recommendations.

• Goals:

Extraction of implicit, Exploration and


Prediction of future
previously unknown, analysis of large data
events based on
and potentially useful sets to discover
historical data
information from data meaningful patterns
What data
analytics/mining is not…
Sales analysis

• How do sales compare in two different stores


in the same state? If these aren’t
data mining
Profitability analysis examples,
then what are
• Which product lines are the highest revenue they

?
producers this year?

Sales force analysis

• Did salesperson X meet this quarter’s target?


Advanced data analytics/mining is about…
Sales analysis

• Why do sales differ in two stores in the state?

Profitability analysis

• Which product lines will be the highest revenue producers


next year?

Sales force analysis

• How much likely would the salesperson X meet next quarter’s


target?
Example: Smarter
Customer
Retention
• Consider a marketing manager for a brokerage
company
• Problem: High churn (customers leave)
• Customers get an average reward of $160 to
open an account
• 40% of customers leave after the 6 month
introductory period
• Giving incentives to everyone who might
leave is expensive
• Getting a customer back after they leave is
expensive
Answer: Not all
customers have the
same value
One month before the end of the
introductory period, predict which
customers will leave
Offer those customers something
based on their future value

Ignore the ones that are not


predicted to churn
Three Analytics Tasks
We Will Be Doing in this
Class

Classification Clustering Analysis Association Rule Learning


(Decision Tree Approach)
Decision Trees(To Realize
Classification)
Used to classify data
according to a
pre-defined outcome

Based on
characteristics
of that data

Predict whether a customer should receive a loan


Uses Flag a credit card charge as legitimate
Determine whether an investment will pay off
Cluster Analysis
Used to determine
distinct groups of
data

Based on data
across multiple
dimensions

Customer segmentation
Uses Identifying patient care groups
Performance of business sectors
Association Rule
Learning
Find out which events
predict the occurrence of
other events

Often used to see which


products are bought together

What products are bought together?


Uses Amazon’s recommendation engine
Telephone calling patterns
Introduction to R and
RStudio
• R has become one of the dominant language for data analysis
• A large user community
• Thousands of third-party packages that contribute functionality

Install R with R studio both on your computer according to


the installation instruction on our website.
http://www.kdnuggets.com/2015/05/poll-r-rapidminer-python-big-data-spark.html
• Software development • Integrated
platform and language Development
• Open source, free Environment(IDE) for R
• Many, many, many • Nicer interface that
statistical add-on makes R easier to use
“packages” that perform • Requires R to run
data analysis
• After install both, you only need to interact with Rstudio
• Mostly, you do not need to touch R directly
Environment
Panel

Script Panel

Utility Panel

Console Panel
RStudio Interface
•Script Panel
• This is where the R code is shown and edited
• When you open a R code file, its content shows up here
•Console Panel
• This is where R code is executed. Results will show up here
• If there is error with your code, the error message will also show up here
•Environment Panel
• This is where the variables and data are displayed
• It helps to keep track of the variables and data you have
•Utility Panel
• This window includes several tabs
• Files: shows the path to your current file, not often used
• Plots: if you use R to plot a graph, it will show up here
• Packages: install/import packages, more on this later
• Help: manuals and documentations to every R functions, very useful
Creating and opening a .R file
• The R script is where you keep a record of your work in R/RStudio.

• To create a .R file
• Click “File|New File|R Script” in the menu

• To save the .R file


• click “File|Save”

• To open an existing .R file


• click “File|Open File” to browse for the .R file
The Basics:
• Calculation
• Variable & Value
• Function & Argument(Parameter)
• Basic Data Types: Numeric, Character, Logical
• Advanced Data Types: Vector, Frame
• Packages
• Loading data to R
• Working Directory
The Basics: Calculations
• In its simplest form, R can be
used as a calculator:

Type commands into the console and


it will give you an answer
The Basics: Variable & Value
• Variable & Value

Read from the right to left as “Assign [value] 5 to [variable] x”.


IDE first requires OS to allocate a segment of machine memory to store
a empty variable template called “x”, then requires OS to allocate another segment
of memory to fill a copy of the template with a value, 5.
The Basics: Variable & Value
• Variables are named containers for data

• The assignment operator in R is:


<- or =

• Variable names can start with a letter or digits.


• Just not a number by itself.
• Examples: result, x1, 2b (not 2)

• R is case-sensitive (i.e. Result is a different variable than result)

<- and = do the


same thing

x, y, and z are variables that


can be manipulated
The Basics: Function & Argument
(Parameter)
• Function & Argument

Function: rm(ARGUMENT). rm() here is a build-in function. You can also define your own function.
Some function are used to return a value, such as AVG() in SQL. The others are used to complete
an operation, such as this. A function can take no argument, a single argument, or multiple arguments.
The Basics: Function &
Argument(Parameter)

sqrt(), log(), abs(), and exp() are


functions.

Functions accept parameters


(in parentheses) and return a
value
Simple statistics with R
• You can get descriptive statistics from a vector
> scores
[1] 65 75 80 88 82 99 100 100 50
> length(scores)
[1] 9
> min(scores)
[1] 50
> max(scores) Again, length(), min(), max(), mean(),
[1] 100 median(), sd(), var() and summary()
> mean(scores) are all functions.
[1] 82.11111
> median(scores)
[1] 82 These functions accept vectors as
> sd(scores)
parameter.
[1] 17.09857
> var(scores)
[1] 292.3611
> summary(scores)
Min. 1st Qu. Median Mean 3rd Qu. Max.
50.00 75.00 82.00 82.11 99.00 100.00
The Basics: Basic Data Types

Type Range Assign a Value


X <-1
Numeric Numbers
Y <- -2.5

name<-"Mark"
Character Text strings
color<-"red"

Logical (Boolean) TRUE or FALSE female<-TRUE


The Basics: Advanced Data Types –
Vector & Data Frame
• Vectors
• Vector: a combination of elements (i.e. numbers, words) of the same basic
type, usually created using c(), seq(), or rep()

• Data frames
• Data frame: a table consist of one or more vectors
Vector Examples
> scores<-c(65,75,80,88,82,99,100,100,50)
> scores
[1] 65 75 80 88 82 99 100 100 50
> studentnum<-1:9
>
studentnum
[1] 1 2 3 4 5 6 7 8 9
> ones<-rep(1,4)
c() and rep() are functions
> ones
[1] 1 1 1 1
> names<-c("Nikita","Dexter","Sherlock")
> names
[1] "Nikita" "Dexter" "Sherlock"
Indexing Vectors
• We use brackets [ ] to pick specific elements in the vector.

• In R, the index of the first element is 1

> scores
[1] 65 75 80 88 82 99 100 100 50
> scores[1]
[1] 65
> scores[2:3]
[1] 75 80
> scores[c(1,4)]
[1] 65 88
Data Frames
• A data frame is a type of variable used for storing data tables
• is a special type of list where every element of the list has same
length (i.e. data frame is a “rectangular” list or table)
> BMI<-data.frame(
+ gender = c("Male","Male","Female"),
+ height = c(152,171.5,165),
+ weight = c(81,93,78),
+ Age = c(42,38,26)
+ )
> BMI
gender height weight Age
1 Male 152.0 81 42
2 Male 171.5 93 38
3 Female 165.0 78 26
> nrow(BMI)
[1] 3
> ncol(BMI)
[1] 4
Identify elements of a data frame
• To retrieving cell values
> BMI[1,3]
[1] 81
> BMI[1,]
gender height weight Age
1 Male 152 81 42
> BMI[,3]
[1] 81 93 78

• More ways to retrieve columns as vectors


> BMI[[2]]
[1] 152.0 171.5 165.0
> BMI$height
[1] 152.0 171.5 165.0
> BMI[["height“]]
[1] 152.0 171.5 165.0
The Basics: Packages
• Packages (add-ons) are collections of R functions and code in a well-defined
format.

• To install a package: Each package only needs to


be installed once
install.packages("pysch")

• For every new R session (i.e., every time you re-open Rstudio), you must
load the package before it can be used
Must load for every new R
library(psych)
session
or
require(psych)
Packages

Downloads and
installs the package
(once per R
installation)
The Basics: Loading Data into R
• R can handle all kinds of data files. We will mostly deal with csv files

• Use read.csv() function to import csv data


• You need to specify path to the data file
• By default, comma is used as field delimiter
• First row is used as variable names Very
Important!!!

• You can simply do this by


• Download source file and csv file into the same folder (i.e., C:\RFiles).
• Set that folder as working directory by assigning source file location as
working directory
Working directory
• The working directory is where Rstudio will look first for scripts and
files

• Keeping everything in a self contained directory helps organize code


and analyses

• Check you current working directory with


getwd()
To change working directory
Use the Session | Set Working Directory Menu
• If you already have an .R file open, you can select “Set Working
Directory>To Source File Location”.
Loading data from a file
• Usually you won’t type in data manually,
you’ll get it from a file
• Example: 2009 Baseball Statistics
(http://www2.stetson.edu/~jrasp/data.htm)

reads data from a CSV file


and creates a data frame
called teamData that
store the data table.

reference the HomeRuns column in the


data frame using TeamData$HomeRuns
More On Loading Datasets
Suppose you want to load a dataset called “MIS2502”.

If the dataset is in
• an existing R package, load the package and type data(MIS2502)
• .RData format, type load(MIS2502)
• .txt or other text formats, type read.table("MIS2502.txt")
• .csv format, type read.csv("MIS2502.txt")
• .dta (Stata) format, load the foreign library and type read.dta(“MIS2502.dta")
• Remember “function & argument” in the first part

To save objects into these formats, use the equivalent


write.table(), write.csv(), etc. commands.
The Basics: Summary
• Calculation
• Variable & Value
• Function & Argument(Parameter)
• Basic Data Types: Numeric, Character, Logical
• Advanced Data Types: Vector, Frame
• Packages
• Loading data to R
• Working Directory
Analysis Examples
• Student t-Test: Compare means
• Histogram
• Plotting data
Analysis Example: [Student] t-Test
• Compare differences across groups:
• We want to know if National League (NL) teams scored more runs than American
League (AL) Teams
• And if that difference is statistically significant

• To do this, we need a package that will do this analysis


• In this case, it’s the “psych” package

Downloads and
installs the package
(once per R
installation)
t-Test: Compare Differences Across Groups
describeby(teamData$Runs, teamData$League)

Variable of interest (Runs) Broken up by group (League)

Results of t-test for


differences in Runs by
League
Analysis Example: Histogram
hist(teamData$BattingAvg,
xlab="Batting Average",
main="Histogram: Batting Average")

hist()
first parameter – data values
xlab parameter – label for x axis
main parameter - sets title for chart
Analysis Example: Plotting data
plot(teamData$BattingAvg,teamData$WinningPct,
xlab="Batting Average",
ylab="Winning Percentage",
main="Do Teams With Better Batting Averages Win More?")

plot()
first parameter – x data values
second parameter – y data values
xlab parameter – label for x axis
ylab parameter – label for y axis
main parameter - sets title for chart
Execute the script
Use the Code | Run Region | Run All Menu

Commands can be entered one at


a time, but usually they are all put
into a single file that can be saved
and run over and over again.
Getting help
help.start() general help

help(mean) help about function mean()

?mean same. Help about function mean()

example(mean) show an example of function mean()

help.search("regression") get help on a specific topic such as regression.


Online Tutorials
• If you’d like to know more about R, check these out:
• Quick-R (http://www.statmethods.net/index.html)
• R Tutorial (http://www.r-tutor.com/r-introduction)
• Learn R Programing (https://www.tutorialspoint.com/r/index.htm)
• Programming with R (https://swcarpentry.github.io/r-novice-inflammation/

• There is also an interactive tutorial to learn R basics, highly


recommended! (http://tryr.codeschool.com/)
Time for our 9 th
ICA!

You might also like