IDS Notes Unit 1
IDS Notes Unit 1
UNIT - I
Introduction: Definition of Data Science- Big Data and Data Science hype – and getting past
the hype - Datafication - Current landscape of perspectives - Statistical Inference -
Populations and samples - Statistical modeling, probability distributions, fitting a model –
Over fitting. Basics of R: Introduction, REnvironment Setup, Programming with R, Basic
Data Types.
UNIT - II
Data Types & Statistical Description
Types of Data: Attributes and Measurement, What is an Attribute? The Type of an Attribute,
The Different Types of Attributes, Describing Attributes by the Number of Values,
Asymmetric Attributes, Binary Attribute, Nominal Attributes, Ordinal Attributes, Numeric
Attributes, Discrete versus Continuous Attributes. Basic Statistical Descriptions of Data:
Measuring the Central Tendency: Mean, Median, and Mode, Measuring the Dispersion of
Data: Range, Quartiles, Variance, Standard Deviation, and Interquartile Range, Graphic
Displays of Basic Statistical Descriptions of Data.
UNIT - III
Vectors: Creating and Naming Vectors, Vector Arithmetic, Vector sub setting, Matrices:
Creating and Naming Matrices, Matrix Sub setting, Arrays, Class. Factors and Data
Frames: Introduction to Factors: Factor Levels, Summarizing a Factor, Ordered Factors,
Comparing Ordered Factors, Introduction to Data Frame, subsetting of Data Frames,
Extending Data Frames, Sorting Data Frames.
Lists: Introduction, creating a List: Creating a Named List, Accessing List Elements,
Manipulating List Elements, Merging Lists, Converting Lists to Vectors
UNIT - IV
Conditionals and Control Flow: Relational Operators, Relational Operators and Vectors,
Logical Operators, Logical Operators and Vectors, Conditional Statements. Iterative
Programming in R:
Introduction, While Loop, For Loop, Looping Over List. Functions in R: Introduction,
writing a Function in R, Nested Functions, Function Scoping, Recursion, Loading an R
Package, Mathematical Functions in R.
UNIT - V
Data Reduction: Overview of Data Reduction Strategies, Wavelet Transforms, Principal
Components Analysis, Attribute Subset Selection, Regression and Log-Linear Models:
Parametric Data Reduction, Histograms, Clustering, Sampling, Data Cube Aggregation. Data
Visualization: Pixel-Oriented Visualization Techniques, Geometric Projection Visualization
Techniques, Icon-Based Visualization Techniques, Hierarchical Visualization Techniques,
Visualizing Complex Data and Relations.
Introduction
Unstructured Data
Unstructured data is not organized. We must organize the data for analysis purposes.
Structured Data
Structured data is organized and easier to work with.
How to Structure Data?
We can use an array or a database table to structure or present data.
Example of an array: [80, 85, 90, 95, 100, 105, 110, 115, 120, 125]
Big Data and Data Science hype
Big Data:
It is huge, large, or voluminous data, information, or the relevant statistics acquired by large
organizations and ventures. Many software and data storages is created and prepared as it is
difficult to compute the big data manually. It is used to discover patterns and trends and make
decisions related to human behavior and interaction technology.
Advantages of Big Data:
Able to handle and process large and complex data sets that cannot be easily managed
with traditional database systems
Provides a platform for advanced analytics and machine learning applications
Enables organizations to gain insights and make data-driven decisions based on large
amounts of data
Offers potential for significant cost savings through efficient data management and
analysis
Disadvantages of Big Data:
Requires specialized skills and expertise in data engineering, data management, and
big data tools and technologies
Can be expensive to implement and maintain due to the need for specialized
infrastructure and software
May face privacy and security concerns when handling sensitive data
Can be challenging to integrate with existing systems and processes
Data Science:
Data Science is a field or domain which includes and involves working with a huge
amount of data and using it for building predictive, prescriptive, and prescriptive
analytical models. It’s about digging, capturing, (building the model)
analyzing(validating the model), and utilizing the data(deploying the best model). It is
an intersection of Data and computing. It is a blend of the field of Computer Science,
Business, and Statistics together.
Advantages of Data Science:
Provides a framework for extracting insights and knowledge from data through
statistical analysis, machine learning, and
data visualization techniques
Offers a wide range of applications in various fields such as finance, healthcare, and
marketing
Helps organizations make informed decisions by extracting meaningful insights from
data
Offers potential for significant cost savings through efficient data management and
analysis
Disadvantages of Data Science:
Requires specialized skills and expertise in statistical analysis, machine learning, and
data visualization
Can be time-consuming and resource-intensive due to the need for data cleaning and
preprocessing
May face ethical concerns when dealing with sensitive data
Can be challenging to integrate with existing systems and processes
The crux is, “Datafication” is the process of turning everything into data. It is the act of
taking something that was once unquantifiable and turning it into quantitative data.
Data science is part of the computer sciences. It comprises the disciplines of i) analytics, ii)
statistics and iii) machine learning.
The Data Science Landscape —
Analytics
Data analytics focuses on processing and performing statistical analysis of existing datasets.
Analysts concentrate on creating methods to capture, process, and organize data to uncover
actionable insights for current problems, and establishing the best way to present this data.
More simply, the field of data and analytics is directed toward solving problems for questions
we know we don’t know the answers to. More importantly, it’s based on producing results
that can lead to immediate improvements.
Data analytics also encompasses a few different branches of broader statistics and analysis
which help combine diverse sources of data and locate connections while simplifying the
results.
Statistics
In many instances, analytics may be sufficient to address a given problem. In other instances,
the issue is more complex and requires a more sophisticated approach to provide an answer,
especially if there is a high-stakes decision to be made under uncertainty. This is when
statistics comes into play. Statistics provides a methodological approach to answer questions
raised by the analysts with a certain level of confidence.
Sometimes simple descriptive statistics are sufficient to provide the necessary insight. Yet, on
other occasions, more sophisticated inferential statistics such as regression analysis are
required to reveal relationships between cause and effect for a certain phenomenon. The
limitation of statistics is that it is traditionally conducted with software packages, such as
SPSS and SAS, which require a distinct calculation for a specific problem by a statistician or
trained professional. The degree of automation is rather limited.
Machine Learning
Artificial intelligence refers to the broad idea that machines can perform tasks normally
requiring human intelligence, such as visual perception, speech recognition, decision-making
and translation between languages. In the context of data science, machine learning can be
considered as a sub-field of artifical intelligence that is concerned with decision making. In
fact, in its most essential form, machine learning is decision making at scale. Machine
learning is the field of study of computer algorithms that allow computer programs to identify
and extract patterns from data. A common purpose of machine learning algorithms is
therefore to generalize and learn from data in order to perform certain tasks.
Statistical Inference
Statistical Inference
Using data analysis and statistics to make conclusions about a population is called statistical
inference.
The main types of statistical inference are:
Estimation
Hypothesis testing
Estimation
Statistics from a sample are used to estimate population parameters. The most likely value is
called a point estimate. There is always uncertainty when estimating. The uncertainty is often
expressed as confidence intervals defined by a likely lowest and highest value for the
parameter.
An example could be a confidence interval for the number of bicycles a Dutch person owns:
"The average number of bikes a Dutch person owns is between 3.5 and 6."
Hypothesis Testing
Hypothesis testing is a method to check if a claim about a population is true. More precisely,
it checks how likely it is that a hypothesis is true is based on the sample data.
Population Sample
For good statistical analysis, the sample needs to be as "similar" as possible to the population.
If they are similar enough, we say that the sample is representative of the population.
The sample is used to make conclusions about the whole population. If the sample is not
similar enough to the whole population, the conclusions could be useless.
statistical modeling
What is statistical modeling?
The statistical modeling process is a way of applying statistical analysis to datasets in
data science. The statistical model involves a mathematical relationship between
random and non-random variables.
A statistical model can provide intuitive visualizations that aid data scientists in
identifying relationships between variables and making predictions by applying
statistical models to raw data.
Examples of common data sets for statistical analysis include census data, public
health data, and social media data.
Statistical modeling techniques
Data gathering is the foundation of statistical modeling. The data may come from the
cloud, spreadsheets, databases, or other sources. There are two categories of statistical
modeling methods used in data analysis. These are:
Supervised learning
In the supervised learning model, the algorithm uses a labeled data set for learning,
with an answer key the algorithm uses to determine accuracy as it trains on the
data. Supervised learning techniques in statistical modeling include:
Regression model: A predictive model designed to analyze the relationship between
independent and dependent variables. The most common regression models are
logistical, polynomial, and linear. These models determine the relationship between
variables, forecasting, and modeling.
Classification model: An algorithm analyzes and classifies a large and complex set
of data points. Common models include decision trees, Naive Bayes, the nearest
neighbor, random forests, and neural networking models.
Unsupervised learning
In the unsupervised learning model, the algorithm is given unlabeled data and
attempts to extract features and determine patterns independently. Clustering
algorithms and association rules are examples of unsupervised learning. Here are two
examples:
K-means clustering: The algorithm combines a specified number of data points into
specific groupings based on similarities.
Reinforcement learning: This technique involves training the algorithm to iterate
over many attempts using deep learning, rewarding moves that result in favorable
outcomes, and penalizing activities that produce undesired effects.
Probability distributions
What is Probability Distribution?
A Probability Distribution of a random variable is a list of all possible outcomes with
corresponding probability values.
Note : The value of the probability always lies between 0 to 1.
What is an example of Probability Distribution?
Let’s understand the probability distribution by an example:
When two dice are rolled with six sided dots, let the possible outcome of rolling is
denoted by (a, b), where
a : number on the top of first dice
b : number on the top of second dice
Then, sum of a + b are:
Sum of a + b (a, b)
2 (1,1)
3 (1,2), (2,1)
4 (1,3), (2,2), (3,1)
5 (1,4), (2,3), (3,2), (4,1)
6 (1,5), (2,4), (3,3), (4,2), (5,1)
7 (1,6), (2,5), (3,4),(4,3), (5,2), (6,1)
8 (2,6), (3,5), (4,4), (5,3), (6,2)
9 (3,6), (4,5), (5,4), (6,3)
10 (4,6), (5,5), (6,4)
11 (5,6), (6,5)
12 (6,6)
Basics of R
Introduction:
R is a popular programming language used for statistical computing and graphical
presentation.
Its most common use is to analyze and visualize data.
Why Use R?
It is a great resource for data analysis, data visualization, data science and machine
learning
It provides many statistical techniques (such as statistical tests, classification,
clustering and data reduction)
It is easy to draw graphs in R, like pie charts, histograms, box plot, scatter plot, etc++
It works on different platforms (Windows, Mac, Linux)
It is open-source and free
It has a large community support
It has many packages (libraries of functions) that can be used to solve different
problems
REnvironment Setup
Downloading and Installing R
• R is free available from the comprehensive R Archive Network (CRAN) at
http://cran.r-project.org
• Precompiled binaries are available for Linux, Mac OS X and windows.
• R latest release R-3.4.0
• Installing R on windows and Mac is just like installing any other program.
• Install R Studio: a free IDE for R at http://www.rstudio.com/
• If we install R and R Studio, then we need to run R Studio only.
• R is case-sensitive.
• R scripts are simply text files with a .R extension.
Programming with R
Once we are inside the R session, we can directly execute R language commands by
typing them line by line. Pressing the enter key terminates typing of command and brings the
> prompt again. In the example session below, we declare 2 variables 'a' and 'b' to have
values 5 and 6 respectively, and assign their sum to another variable called 'c':
>a=5
>b=6
>c=a+b
> c The value of the variable 'c' is printed as, [1] 11 In R session, typing a variable
name prints its value on the screen.
Get help inside R session
To get help on any function of R, type help(function-name) in R prompt. For example, if we
need help on "if" logic, type, > help("if") then, help lines for the "if" statement are printed.
Exit the R session
To exit the R session, type quit() in the R prompt, and say 'n' (no) for saving the workspace
image. This means, we do not want to save the memory of all the commands we typed in the
current session: > quit() Save workspace image? [y/n/c]: n >
Saving the R session
Note that by not saving the current session, we loose all the memory of current session
commands and the variables and objects created when we exit R prompt. When we work in
R, the R objects we created and loaded are stored in a memory portion called workspace.
When we say 'no' to saving the workspace, we all these objects are wiped out from the
workspace memory. If we say 'yes', they are saved into a file called ".RData" is written to the
present working directory.
Listing the objects in the current R session
We can list the names of the objects in the current R session by ls() command. For example,
start R session fresh and proceed as follows:
>>a=5
>b=6
>c=8
> sum = a+b+c
> sum [1] 19
> ls() [1] "a" "b" "c" "sum" Here, the objects we created have been listed.
Removing objects from the current R session
Specific objects created in the current session can be removed using rm() command. If we
specify the name of an object, it will be removed. If we just say rm(list = las()) , all objects
created so far will be removed. See below: > a = 5 > b = 6 > c = 8 > sum = a+b+c > sum [1]
19 > ls() [1] "a" "b" "c" "sum"
> > rm(list=c("sum"))
> ls() [1] "a" "b" "c"
> > rm(list = ls())
> ls() character(0)
Getting and setting the current working directories
From R prompt, we can get information about the current working directory using getwd()
command: > getwd() [1] "/home/user" Similarly, we can set the current wor directory by
calling setwd() function: > setwd("/home/user/prog") After this, "/home/user/prog" will be
the working directory.
Comments Comments are like helping text in your R program and they are ignored by the
interpreter while executing your actual program. Single comment is written using # in the
beginning of the statement as follows: # My first program in R Programming R does not
support multi-line comments
R Reserved Words
Reserved words in R programming are a set of words that have special
meaning and cannot be used as an identifier (variable name, function name
etc.). Here is a list of reserved words in the R's parser. Reserved words in R
if else repeat while function
for in next break TRUE
FALSE NULL Inf NaN NA
NA_integer_ NA_real_ NA_complex_ NA_character_ ...
Variables in R
Variables are used to store data, whose value can be changed according to our need. Unique
name given to variable (function and objects as well) is identifier.
Rules for writing Identifiers in R
1. Identifiers can be a combination of letters, digits, period (.) and underscore (_).
2. It must start with a letter or a period. If it starts with a period, it cannot be followed by a
digit.
3. Reserved words in R cannot be used as identifiers.
Valid identifiers in R
total, Sum, .fine.with.dot, this_is_acceptable, Number5
Invalid identifiers in R
tot@l, 5um, _fine, TRUE, .0ne
Constants in R
Constants, as the name suggests, are entities whose value cannot be altered. Basic types of
constant are numeric constants and character constants.
Numeric Constants
All numbers fall under this category. They can be of type integer, double or complex. It can
be checked with the typeof() function. Numeric constants followed by L are regarded as
integer and those followed by i are regarded as complex.
> typeof(5) [1] "double"
> typeof(5L) [1] "integer"
> typeof(5i) [1] "complex"
Numeric constants preceded by 0x or 0X are interpreted as hexadecimal numbers.
> 0xff [1] 255
> 0XF + 1 [1] 16
Character Constants
Character constants can be represented using either single quotes (') or double quotes (") as
delimiters. > 'example' [1] "example" > typeof("5") [1] "character"
Built-in Constants
Some of the built-in constants defined in R along with their values is shown below. >
LETTERS [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R"
"S" [20] "T" "U" "V" "W" "X" "Y" "Z" > letters [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k"
"l" "m" "n" "o" "p" "q" "r" "s" [20] "t" "u" "v" "w" "x" "y" "z" > pi [1] 3.141593
> month.name [1] "January" "February" "March" "April" "May" "June" [7] "July" "August"
"September" "October" "November" "December" > month.abb [1] "Jan" "Feb" "Mar" "Apr"
"May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec" But it is not good to rely on these, as they
are implemented as variables whose values can be changed.
> pi [1] 3.141593
> pi =56
> pi
[1] 56
Example: Hello World Program
> # We can use the print() function > print("Hello World!") [1] "Hello World!" >
# integer
x = 1000L
class(x)
# complex
x = 9i + 3
class(x)
# character/string
x = "R is exciting"
class(x)
# logical/boolean
x = TRUE
class(x)
Numbers
There are three number types in R:
numeric
integer
complex
Variables of number types are created when you assign a value to them:
Example
x = 10.5 # numeric
y = 10L # integer
z = 1i # complex
Numeric
A numeric data type is the most common type in R, and contains any number with or without
a decimal, like: 10.5, 55, 787:
Example
x = 10.5
y = 55
Integer
Integers are numeric data without decimals. This is used when you are certain that you will
never create a variable that should contain decimals. To create an integer variable, you must
use the letter L after the integer value:
Example
x = 1000L
y = 55L
# Print values of x and y
x
y
# Print the class name of x and y
class(x)
class(y)
Complex
A complex number is written with an "i" as the imaginary part:
Example
x = 3+5i
y = 5i
as.numeric()
as.integer()
as.complex()
Example
x = 1L # integer
y = 2 # numeric
Simple Math
In R, you can use operators to perform common mathematical operations on numbers.
Example
10 – 5
Built-in Math Functions
R also has many built-in math functions that allows you to perform mathematical tasks on
numbers.
For example, the min() and max() functions can be used to find the lowest or highest number
in a set:
Example
max(5, 10, 15)
Example
sqrt(16)
abs()
The abs() function returns the absolute (positive) value of a number:
Example
abs(-4.7)
ceiling() and floor()
The ceiling() function rounds a number upwards to its nearest integer, and the floor() function
rounds a number downwards to its nearest integer, and returns the result:
Example
ceiling(1.4)
floor(1.4)
String Literals
Strings are used for storing text.
Example
"hello"
'hello'
Assign a String to a Variable
Assigning a string to a variable is done with the variable followed by the <- operator and the
string:
Example
str = "Hello"
str # print the value of str
String Length
There are many usesful string functions in R.
For example, to find the number of characters in a string, use the nchar() function:
Example
str = "Hello World!"
nchar(str)
Combine Two Strings
Use the paste() function to merge/concatenate two strings:
Example
str1 = "Hello"
str2 = "World"
paste(str1, str2)
Check a String
Use the grepl() function to check if a character or a sequence of characters are present in a
string:
Example
str = "Hello World!"
grep("H", str)
grep("Hello", str)
grep("X", str)
You can evaluate any expression in R, and get one of two answers, TRUE or FALSE.
When you compare two values, the expression is evaluated and R returns the logical answer:
Example
a =10
b=9
a>b
You can also run a condition in an if statement, which you will learn much more about in the
if..else chapter.
Example
a = 200
b = 33
if (b > a) {
print ("b is greater than a")
} else {
print("b is not greater than a")
}