Data Analytics Using R (DAR)
JALDEEP BABARIYA
Data Analytics Using R (DAR)
Unit 1: Introduction to Data Analysis
Unit 2: R Programming Basics
Unit 3: Data Visualization using R
Unit 4: Statistics with R
Unit 5: Prescriptive Analytics
Introduction to Data Analysis
Overview of Data Analytics
Need of Data Analytics
Nature of Data
Classification of Data
Characteristics of Data
Applications of Data Analytics
Overview of Data Analytics
Data Analysis: Data analysis is the process of inspecting, cleansing,
transforming, and modeling data with the goal of discovering useful
information, informing conclusions, and supporting decision-making.
Key components: Data Collection, Data Cleaning, Data Analysis, Data
Visualization, Interpretation
Overview of Data Analytics
Types of Data Analytics
Descriptive Analytics: What has happened?
Diagnostic Analytics: Why did it happen?
Predictive Analytics: What could happen?
Prescriptive Analytics: What should we do?
Overview of Data Analytics
Types of Data Analytics
Descriptive Analytics: Descriptive Analytics focuses on summarizing historical
data to identify trends and patterns. It answers the question, "What has
happened?" by providing insights into past events using data aggregation
and data mining techniques.
Examples
• Sales Reports: Generating monthly or quarterly sales reports to understand
revenue trends over time.
• Customer Surveys: Summarizing survey results to identify customer satisfaction
levels and common feedback themes.
Overview of Data Analytics
Types of Data Analytics
Diagnostic Analytics: Diagnostic Analytics digs deeper into data to understand
the causes of past events. It answers the question, "Why did it happen?"
by identifying relationships and patterns that explain the reasons behind
specific outcomes.
Examples
• Healthcare Diagnostics: Analyzing patient records to understand the causes of a
spike in a particular illness or health issue.
• Manufacturing Defects: Studying production data to determine the root causes
of defects in manufactured goods, such as equipment malfunctions or
supply chain issues.
Overview of Data Analytics
Types of Data Analytics
Predictive Analytics: Predictive Analytics uses statistical models and machine
learning techniques to forecast future events. It answers the question, "What
could happen?" by predicting outcomes based on historical data and
identifying patterns that suggest future trends.
Examples
• Customer Prediction: Identifying customers who are likely to cancel a
subscription or stop using a service based on their past behavior.
• Sales Forecasting: Using historical sales data to predict future sales volumes
and trends.
Overview of Data Analytics
Types of Data Analytics
Prescriptive Analytics: Prescriptive Analytics suggests actions to achieve desired
outcomes. It answers the question, "What should we do?" by providing
recommendations based on data analysis, optimization algorithms, and
simulations.
Examples
• Healthcare Treatment Plans: Providing treatment recommendations for patients
based on their medical history and predictive models of treatment
outcomes.
• Personalized Marketing: Suggesting specific marketing strategies to target
individual customers based on their predicted behavior.
Need for Data Analytics
Informed Decision Making: Using data to drive decisions rather than intuition or
observation alone.
Efficiency Improvements: Streamlining operations based on data-driven insights.
Competitive Advantage: Gaining an edge over competitors through data insights.
Cost Reduction: Identifying cost-saving opportunities through data analysis.
Innovation and Development: Creating new products or services based on data
trends and patterns.
Need for Data Analytics
Importance in today’s world
Decision-Making: Business, Healthcare, Government
Innovation and Development: Research, Product Development
Enhanced Customer Experience: Personalization, Feedback Analysis
Predictive Capabilities: Forecasting, Risk Management
Enhanced Personal Decisions: Health and Fitness, Financial Management
Social Impact: Education, Public Safety
Nature of Data
Definition of Data: Data is a collection of facts, such as numbers, words,
measurements, observations, or even just descriptions of things.
Forms of Data:
• Numbers: Quantitative data (e.g., sales figures).
• Text: Qualitative data (e.g., customer reviews).
• Images: Visual data (e.g., photographs, charts).
Importance of Data: Data is crucial for making informed decisions, identifying
trends, and driving business strategies.
Nature of Data
Types of Data
Quantitative Data: Data that can be measured and counted.
• Discrete Data: Countable items (e.g., number of students in a class).
• Continuous Data: Measurable quantities (e.g., height, weight).
Qualitative Data: Data that describes qualities or characteristics.
• Nominal Data: Categories without a specific order (e.g., types of fruits, gender).
• Ordinal Data: Categories with a specific order (e.g., satisfaction rating, education
levels).
Sources
• Primary Data: Collected firsthand (e.g., surveys, experiments).
• Secondary Data: Collected by others (e.g., research articles, official records).
Nature of Data
Quantitative Data: Quantitative data are data represented numerically,
including anything that can be counted, measured, or given a numerical
value. Quantitative data can be classified in different ways, including
categorical data that contain categories or groups (like countries), discrete
data that can be counted in whole numbers (like the number of students in a
class), and continuous data that is a value in a range (like height or
temperature). Quantitative data are typically analyzed with statistics.
Examples
Examples of quantitative data are a spreadsheet of numbers or data
collected from a survey question where an answer must be selected from a
pre-determined set of values.
Nature of Data
Qualitative Data: Qualitative Data refers to non-numeric information that describes qualities
or characteristics. It is often used to capture people's thoughts, feelings, or behaviors and is
typically expressed in words, images, or descriptions rather than numbers. This type of data is
subjective and can be categorized based on properties, attributes, labels, or other identifiers.
Example
Customer Feedback: Comments left by customers about their experience with a product or service. For
instance, feedback such as "The service was excellent and the staff were very friendly" is qualitative data
because it describes the quality of the service in descriptive terms.
Interview Transcripts: Responses from interviewees during a survey about their lifestyle choices,
opinions, or experiences. For example, an interview response like "I prefer online shopping because it is
more convenient and I can compare prices easily" is qualitative data.
Classification of Data
Structured Data
Semi-Structured Data
Unstructured Data
Classification of Data
Structured Data: Structured Data refers to data that is organized in a predefined format or
model, making it easy to enter, store, query, and analyze. This type of data is typically found in
databases and spreadsheets where the data is arranged in rows and columns with clear,
identifiable labels for each field. Structured data is often quantitative and can be easily managed
and processed by algorithms and database systems.
Examples:
Relational Databases: Data stored in tables within a database. For instance, a customer database where
each record includes structured fields like Customer ID, Name, Address, Phone Number, and Email.
Spreadsheets: Data organized in rows and columns in tools like Microsoft Excel or Google Sheets. For
example, a sales spreadsheet where each row represents a sale and includes columns for Sale ID,
Product Name, Quantity Sold, Sale Date, and Total Amount.
Classification of Data
Semi-Structured Data: Semi-Structured Data refers to data that does not conform to a rigid
structure like structured data but still contains tags or markers to separate data elements. This
type of data is more flexible than structured data and can accommodate variations in format,
making it suitable for diverse and dynamic information that may not fit neatly into traditional
database schemas.
Examples: JSON, XML
Classification of Data
Unstructured Data: Unstructured Data refers to data that does not have a predefined data
model or structure. It is not organized in a manner that is easily readable by machines or
typically stored in traditional relational databases. Unstructured data can be text-heavy but may
also contain data such as dates, numbers, and facts that do not fit neatly into structured
databases. Analyzing unstructured data often requires more advanced techniques like natural
language processing, image recognition, or machine learning.
Examples:
Email Messages: Emails contain various types of information, such as the text body, attachments,
metadata (sender, receiver, date, etc.), and often follow no specific structure, making them unstructured
data.
Social Media Posts: Posts on platforms like Facebook, Twitter, or Instagram, which include text, images,
videos, hashtags, and user interactions (likes, comments, shares), are considered unstructured data.
Characteristics of Data – 5V’s
Volume
Variety
Value
Velocity
Veracity
Characteristics of Data
Volume: Volume refers to the sheer quantity or amount of data generated and stored. It
highlights the scale of data, which can be extremely large, especially with the advent of big data
technologies. Managing and analyzing such vast amounts of data require advanced storage
solutions and powerful processing capabilities.
Example:
Social Media Data: Platforms like Facebook, Instagram, and Twitter generate massive volumes of data
daily. For instance, Facebook generates over 4 petabytes of data per day from user interactions such as
posts, comments, likes, shares, and uploaded media content.
Characteristics of Data
Variety: Variety refers to the different types and sources of data. It highlights the diversity in
data formats, structures, and origins, which can include structured data, semi-structured data,
and unstructured data. This characteristic underscores the need for flexible and versatile tools to
process and integrate data from various sources.
Example:
Healthcare Data: In the healthcare industry, data comes from numerous sources and in multiple formats.
This includes structured data from electronic health records (EHRs), semi-structured data from laboratory
results or medical imaging reports, and unstructured data from doctors' notes, patient feedback, or medical
research articles. Integrating and analyzing this variety of data is crucial for comprehensive patient care and
medical research.
Characteristics of Data
Value: Value refers to the usefulness and relevance of data in making decisions and driving
business outcomes. It emphasizes the importance of extracting meaningful insights and
actionable information from data. High-value data contributes to achieving strategic goals,
improving efficiency, and gaining competitive advantages.
Example:
Customer Purchase Data: Retail businesses collect and analyze customer purchase data to understand
buying patterns, preferences, and trends. This valuable data helps businesses tailor marketing strategies,
optimize inventory management, and improve customer experiences, ultimately driving sales and
profitability.
Characteristics of Data
Velocity: Velocity refers to the speed at which data is generated, collected, and processed. It
emphasizes the importance of real-time or near-real-time data handling to enable timely
decision-making and actions. High-velocity data requires robust technologies to manage
continuous data streams efficiently.
Example:
Stock Market Data: In financial markets, stock prices, trading volumes, and other relevant information
change rapidly, often within milliseconds. Real-time data processing systems are required to capture,
analyze, and react to these rapid changes to facilitate trading decisions, risk management, and market
analysis.
Characteristics of Data
Veracity: Veracity refers to the accuracy, trustworthiness, and reliability of data. It highlights
the importance of data quality and the need to ensure that data is truthful, consistent, and free
from errors. High veracity data is essential for making informed and accurate decisions.
Example:
Financial Transactions Data: In banking and financial services, accurate and reliable transaction data is
critical for detecting fraud, ensuring compliance, and making investment decisions. For example, a bank
uses accurate transaction records to monitor for unusual activities that might indicate fraudulent behavior.
Ensuring the veracity of this data helps prevent financial losses and maintains customer trust.
Applications of Data Analytics
Business Intelligence and Decision-Making: Data analytics helps businesses make informed
decisions by transforming raw data into actionable insights. Business intelligence tools use data
to generate reports, dashboards, and visualizations.
Example: A retail chain analyzes sales data to identify peak shopping times, popular products, and
customer demographics. This information enables the company to optimize store layouts, tailor
promotions, and adjust inventory levels to increase sales and enhance customer satisfaction.
Applications of Data Analytics
Healthcare: In healthcare, data analytics is used to improve patient outcomes, optimize
operations, and enhance overall care quality. It involves analyzing electronic health records
(EHRs), patient feedback, and clinical data.
Example: Hospitals use predictive analytics to forecast patient admissions, manage staffing levels,
and prevent readmissions. Analyzing patient data helps in early detection of diseases, personalized
treatment plans, and tracking the effectiveness of therapies.
Applications of Data Analytics
Marketing: Marketing analytics involves analyzing customer behavior, campaign performance,
and market trends to improve marketing strategies. It helps in targeting specific customer
segments and measuring campaign effectiveness.
Example: A company uses web analytics to track user behavior on its website. This data helps in
understanding customer preferences, optimizing website design, and personalizing marketing messages
to increase engagement and conversion rates.
Applications of Data Analytics
Education: Data analytics in education helps institutions enhance learning outcomes, tailor
educational content, and improve administrative processes. It involves analyzing student
performance data, attendance records, and feedback.
Example: Schools analyze student performance data to identify learning gaps and personalize
educational resources. Analytics help educators track progress, design targeted interventions, and
improve student engagement.
Unit 2: R Programming Basics
DATA ANALYTICS USING R (DAR)
R Programming Basics
Overview of R programming
Environment setup with R Studio
R Commands
Variables and Data Types
Control Structures
Array, Matrix, Vectors, Factors
Functions
R packages
Overview of R programming
R is a programming language and free software developed by Ross Ihaka and Robert
Gentleman in 1993. R possesses an extensive catalog of statistical and graphical methods. It
includes machine learning algorithms, linear regression, time series, statistical inference to name
a few. Most of the R libraries are written in R, but for heavy computational tasks, C, C++ and
Fortran codes are preferred. R is not only entrusted by academic, but many large companies also
use R programming language, including Uber, Google, Airbnb, Facebook and so on.
Data analysis with R is done in a series of steps; programming, transforming, discovering,
modeling and communicate the results.
Overview of R programming
What is R?
R is a popular programming language used for statistical computing and graphical presentation.
Its most common use is to analyze and visualize data.
R is a programming language and environment specifically designed for statistical computing and
data analysis. It is widely used in various fields, including data science, statistics, and research,
due to its powerful capabilities for data manipulation, statistical modeling, and visualization.
Overview of R programming
What is R used for?
Statistical inference
Data analysis
Machine learning algorithm
Overview of R programming
Core Features of R
Statistical Analysis: Comprehensive tools for various statistical methods and tests.
Data Visualization: Advanced capabilities for creating plots and charts, using packages like ‘ggplot2’.
Data Manipulation: Efficient data cleaning and transformation with packages like ‘dplyr’ and ‘tidyr’.
Open Source: Free and open-source software with a supportive community.
Extensive Packages: A vast collection of packages for specialized tasks available via CRAN and Bioconductor.
Integration: Ability to integrate with other programming languages and data formats.
Environment setup with R
Studio
RStudio is a popular integrated development environment (IDE) for R, providing a user-friendly
interface for coding, debugging, and visualizing data.
Here is the step-by-step guide to setting up your environment with RStudio:
Step 1: Install R
Step 2: Install Rstudio
Step 3: Launch Rstudio
◦ Explore RStudio Interface:
• Console: Where you can type and execute R commands interactively.
• Script Editor: Write and edit R scripts here.
• Environment/History: Shows the variables and data in your workspace and command history.
• Files/Plots/Packages/Help: Manage files, view plots, install and load packages, and access help documentation.
Environment setup with R
Studio
Environment setup with R
Studio
Step 4: Install Packages
Open RStudio Console: Install packages by typing commands directly into the console.
Install Common Packages: To install a package, use the ‘install.packages()’ function.
For example:
Load Packages: After installation, load packages using the ‘library()’ function.
For example:
R Commands
R commands are instructions given to the R interpreter to perform specific tasks, such as
arithmetic operations, variable assignments, calculations, data manipulations, and visualizations.
Arithmetic Operations: Perform basic mathematical operations.
Assignment: Assign values to variables.
Logical Operations: Evaluate logical expressions.
Variables and Data Types
Variables: Variables in R are used to store data values. They act as containers that hold different
types of data and can be used throughout your code.
R supports three ways of variable assignment:
• Using equal operator- operators use an arrow or an equal sign to assign values to variables.
• Using the leftward operator- data is copied from right to left. # using equal to operator
var1 = "hello"
print(var1)
• Using the rightward operator- data is copied from left to right.
# using leftward operator
var2 <- "hello"
print(var2)
# using rightward operator
"hello" -> var3
print(var3)
Variables
The following rules need to be kept in mind while naming a R variable:
• A valid variable name consists of a combination of alphabets, numbers, dot(.), and underscore(_) characters.
• Example: var.1_ is valid
• Apart from the dot and underscore operators, no other special character is allowed.
• Example: var$1 or var#1 both are invalid
• Variables can start with alphabets or dot characters.
• Example: .var or var is valid
• The variable should not start with numbers or underscore.
• Example: 2var or _var is invalid.
• If a variable starts with a dot the next thing after the dot cannot be a number.
• Example: .3var is invalid
• The variable name should not be a reserved keyword in R.
• Example: TRUE, FALSE,etc.
Variables
Important Methods for R Variables
1. class() function
This built-in function is used to determine the data type of the variable provided to it.
The R variable to be checked is passed to this as an argument and it prints the data type in
return.
var1 = "hello"
print(class(var1))
Variables
Important Methods for R Variables
2. ls() function # using equal to operator
var1 = "hello"
This built-in function is used to know all the present
variables in the workspace. # using leftward operator
var2 <- "hello"
This is generally helpful when dealing with a large number
of variables at once and helps prevents overwriting any of # using rightward operator
"hello" -> var3
them.
print(ls())
Variables
3. rm() function
# using equal to operator
This is again a built-in function used to delete an var1 = "hello"
unwanted variable within your workspace.
# using leftward operator
This helps clear the memory space allocated to certain var2 <- "hello"
variables that are not in use thereby creating more
space for others. The name of the variable to be deleted # using rightward operator
is passed as an argument to it. "hello" -> var3
# Removing variable
rm(var3)
print(var3)
Data Types
1. Numeric
Numeric data types represent real numbers (both integers and floating-point numbers).
num1 <- 42 #int
num2 <- 3.14 #float
print(num1) # Output: 42
print(num2) # Output: 3.14
print(is.integer(num1)) # Output: TRUE
Data Types
2. Character(strings)
Character data types represent text.
char1 <- "Hello, World!"
char2 <- 'R is great for data analysis.'
print(char1) # Output: "Hello, World!"
print(char2) # Output: "R is great for data analysis."
Data Types
3. Logical
Logical data types represent boolean values: TRUE or FALSE.
num1 <- 10
num2 <- 20
# Greater than
is_greater <- num1 > num2
print(is_greater) # Output: FALSE
Data Types
4. Complex
Complex data types represent complex numbers with real and imaginary parts.
# Complex data type
comp1 <- 2 + 3i
comp2 <- 5 - 4i
print(comp1) # Output: 2+3i
print(comp2) # Output: 5-4i
Control Structures
Control structures in R allow you to control the flow of execution in your programs based on
conditions or iterative processes.
Example: ‘if’, ‘else if’, ‘else’
Control Structures
Example: ‘for’ Loop:
for (i in 1:5) {
print(i)
}
Example: ‘while’ Loop:
y <- 1
while (y <= 5) {
print(y)
y <- y + 1
}
Arrays
Arrays
Arrays are multi-dimensional data structures in R, allowing you to store data in more than two
dimensions.
# Create a 3-dimensional array
arr <- array(1:24, dim = c(3, 4, 2))
print(arr)
Matrix
Matrix
Matrices are two-dimensional arrays where each element has the same type.
# Create a matrix
mat <- matrix(1:9, nrow = 3, ncol = 3)
print(mat)
Vectors
Vectors
Vectors are the simplest type of data structure in R and can hold elements of the same type.
# Numeric vector
num_vec <- c(1, 2, 3, 4, 5)
print(num_vec)
# Character vector
char_vec <- c("a", "b", "c")
print(char_vec)
# Accessing elements in a vector
print(num_vec[1])
Factors
Factors
Factors are used to handle categorical data and can store both strings and integers.
# Create a factor
factor_vec <- factor(c("low", "medium", "high", "medium", "low"))
print(factor_vec)
# Output: [1] low medium high medium low
# Levels: high low medium