KEMBAR78
3 Data Visualization | PDF | Normal Distribution | Statistics
0% found this document useful (0 votes)
5 views75 pages

3 Data Visualization

Uploaded by

mmahajanme25
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views75 pages

3 Data Visualization

Uploaded by

mmahajanme25
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

DIGITAL NOTES

ON
DATA VISUALIZATION

B. TECH II YEAR - II SEM


(2020-21)

DEPARTMENT OF INFORMATION TECHNOLOGY

MALLA REDDY COLLEGE OF ENGINEERING & TECHNOLOGY


(Autonomous Institution – UGC, Govt. of India)
(Affiliated to JNTUH, Hyderabad, Approved by AICTE - Accredited by NBA & NAAC – ‘A’ Grade - ISO 9001:2015 Certified)
Maisammaguda, Dhulapally (Post Via. Hakimpet), Secunderabad – 500100, Telangana State, INDIA.
MALLA REDDY COLLEGE OF ENGINEERING & TECHNOLOGY
DEPARTMENT OF INFORMATION TECHNOLOGY

(R18A0555) DATA VISUALIZATION

Course Objectives:
 To learn different statistical methods for Datavisualization.
 To understand the basics of R andPython.
 To learn usage of Watsonstudio.
 To understand the usage of the packages likeNumpy, pandas andmatplotlib.
 To know the functionalities and usages ofSeaborn.
UNIT I
Introduction to Statistics : Introduction to Statistics, Difference between inferential statistics
and Descriptive statistics, Inferential Statistics- Drawing Inferences from Data, Random
Variables, Normal Probability Distribution, Sampling, Sample Statistics and Sampling
Distributions.
R overview and Installation- Overview and About R, R and R studio Installation, Descriptive
Data analysis using R, Description of basic functions used to describe data in R.

UNIT II
Data manipulation with R: Data manipulation packages-dplyr,data.table, reshape2, tidyr,
Lubridate, Data visualization with R.
Data visualization in Watson Studio: Adding data to datarefinery, Visualization of Data on
Watson Studio.

UNIT III
Python: Introduction to Python, How to Install, Introduction to Jupyter Notebook, Python
Scripting basics, Numpy and Pandas-Creating and Accessing Numpy Arrays, Introduction to
pandas, read and write csv, Descriptive statistics using pandas, Working with text data and
datetime columns, Indexing and selecting data, groupby, Merge / Joindatasets

UNIT IV
Data Visualization Tools in Python- Introduction to Matplotlib, Basic plots using matplotlib,
Specialized Visualization Tools usingMatplotlib, Advanced Visualization Tools using
Matplotlib Waffle Charts, Word Clouds.

UNIT V
Introduction to Seaborn: Seaborn functionalities and usage, Spatial Visualizations and
Analysis in Python with Folium, Case Study.

TEXT BOOKS:
1. Core Python Programming - Second Edition,R. Nageswara Rao, DreamtechPress.
2. Hands on programming with R by Garrett Grolemund,Shroff/O'Reilly; Firstedition
3. Fundamentals of Mathematical Statistics by S.C. Gupta, Sultan Chand &Son9

REFERENCE BOOKS:
1. Learn R for Applied Statistics: With Data Visualizations, Regressions, and Statistics by
Eric Goh Ming Hui,Apress
2. Python for Data Analysis by William McKinney, Second Edition, O’Reilly Media Inc.\
3. The Comprehensive R Archive Network-https://cran.r-project.org
MALLA REDDY COLLEGE OF ENGINEERING & TECHNOLOGY
DEPARTMENT OF INFORMATION TECHNOLOGY

INDEX

S. No Topic Page no
Unit

1 Introduction to Statistics 1
I

2 Normal Distribution 9
I

3 I Sampling 11

4 I R and R Studio Installation 13

5 I Descriptive Data Analysis using R 19

6 II Data Manipulation packages – dplyr 25

7 II Data.table 31

8 II Reshape2 38

9 II Tidyr 39

10 II Lubridate 42

11 II IBM Watson Studio 44

12 III Introduction to Python, Jupyter 49

13 III Numpy- creating and accessing Numpy arrays 75

14 III Introduction to Pandas 74

15 III Descriptive Statistics using Pandas 77

16 III Groupby 86

17 III merge/join data sets 90

18 IV Introduction to Matplotlib 94
19 IV Basic Plots using Matplotlib 102

20 IV Specialized Data Visualization tools 111

21 IV Advanced Data Visualization tools 120

22 V Introduction to Seaborn 128

23 V Spatial Visualizations and Analysis using Folium 138


MALLA REDDY COLLEGE OF ENGINEERING & TECHNOLOGY
DEPARTMENT OF INFORMATION TECHNOLOGY

UNIT-1
Introduction to Statistics
Statistics is a mathematical science that includes methods for collecting, organizing, analyzing
and visualizing data in such a way that meaningful conclusions can be drawn.
Statistics is also a field of study that summarizes the data, interpret the data making decisions
based on the data.
Statistics is composed of two broad categories:
1. Descriptive Statistics
2. Inferential Statistics

1. Descriptive Statistics
Descriptive statistics describes the characteristics or properties of the data. It helps to
summarize the data in a meaningful data in a meaningful way. It allows important patterns
to emerge from the data. Data summarization techniques are used to identify the properties
of data. It is helpful in understanding the distribution of data. They do not involve in
generalizing beyond the data.

1.1 Two types of descriptive statistics

1. Measures of Central Tendency: (Mean , Median , Mode)


2. Measures of data spread or dispersion (range, quartiles, variance and standard deviation)

1.1.1 Measures of Central Tendency: (Mean , Median , Mode)

A measure of central tendency is a single value that attempts to describe a set of data by
identifying the central position within that set of data. The mean, median and mode are all valid
measures of central tendency.

Mean (Arithmetic)
The mean (or average) is the most popular and well known measure of central tendency. It can
be used with both discrete and continuous data, although its use is most often with continuous
data.
The mean is equal to the sum of all the values in the data set divided by the number of values
in the data set. So, if we have values in a data set and they have values x1,x2,…xn, the sample
mean, usually denoted by ̿𝒙 .

𝑥̿ = (x1,x2,…xn )/ n .
An important property of the mean is that it includes every value in the data set as part of the
calculation. In addition, the mean is the only measure of central tendency where the sum of the
deviations of each value from the mean is always zero.

Median:
DATA VISUALIZATION 1
The median is the middle score for a set of data that has been arranged in order of
magnitude. The median is less affected by outliers and skewed data. It is a holistic measure. It
is easy method of approximation of median value of a large data set.

Mode
The mode is the most frequent score in our data set. The mode is used for categorical data
where we want to know which is the most common category occurring in the population. There
are possibilities for the greatest frequency to correspond to different values. This results in
more than one,two or more modes in a dataset. They are called as unimodal, bimodal and
multimodal datasets. If each data occurs only once then the mode is equal to zero.

Unimodal frequency curve with symmetric data distribution , the mean median and mode are
all the same.

In real applications the data is not symmetrical and they are asymmetric.It might be positively
skewed or negatively skewed. If positively skewed then mode is smaller than median and in
negatively skewed the mode occurs at a value greater than the median.

1.1.2 Measures of spread:

Measures of spread are the ways of summarizing a group of data by describing how scores are
spread out. To describe this spread, a number of statistics are available to us, including the
range, quartiles, absolute deviation, variance and standard deviation.
• The degree to which numerical data tend to spread is called the dispersion, or variance of
the data. The common measures of data dispersion: Range, Quartiles, Outliers, and
Boxplots.
Range : Range of the set is the difference between the largest (max()) and smallest (min()) values.
Ex: Step 1: Sort the numbers in order, from smallest to largest: 7, 10, 21, 33, 43, 45,
45, 65, 67, 87, 98, 99

Step 2: Subtract the smallest number in the set from the largest number in the set:
99 – 7 = 92

DATA VISUALIZATION 2
The range is 92

Quartiles : Percentile : kth percentile of a set of data in numerical order is the value xi
having the property that k percent of the data entries lie at or below xi

• The first quartile (Q1) is the 25th percentile;


• The third quartile (Q3) is the 75th percentile
• The distance between the first and third quartiles is the range covered by the middle
half of the data.
• Interquartile range (IQR) and is defined as IQR = Q3 - Q1.
• Outliers is to single out values falling at least 1.5 *IQR above the third quartile or below
the first quartile.
• Five-number summary: median, the quartiles Q1 and Q3, and the smallest and largest
individual observations comprise the five number summary: Minimum; Q1; Median;
Q3; Maximum
Example : Quartiles
• Start with the following data set:
• 1, 2, 2, 3, 4, 6, 6, 7, 7, 7, 8, 11, 12, 15, 15, 15, 17, 17, 18, 20
• There are a total of twenty data points in the set. There is an even number of data
values, hence the median is the mean of the tenth and eleventh values.
• the median is: (7 + 8)/2 = 7.5.
• The median of the first half of the set is found between the fifth and sixth values of:
• 1, 2, 2, 3, 4, 6, 6, 7, 7, 7
• Thus the first quartile is found to equal Q1 = (4 + 6)/2 = 5
• To find the third quartile, examine the top half of the original data set. The median of
• 8, 11, 12, 15, 15, 15, 17, 17, 18, 20
• is (15 + 15)/2 = 15. Thus the third quartile Q3 = 15.
A small interquartile range indicates data that is clumped about the median. A larger
interquartile range shows that the data is more spread out
Variance and Standard Deviation

DATA VISUALIZATION 3
Inferential Statistics – Definition and Types
Inferential statistics is generally used when the user needs to make a conclusion about the whole
population at hand, and this is done using the various types of tests available. It is a technique
which is used to understand trends and draw the required conclusions about a large population
by taking and analyzing a sample from it. Descriptive statistics, on the other hand, is only about
the smaller sized data set at hand – it usually does not involve large populations. Using
variables and the relationships between them from the sample, we will be able to make
generalizations and predict other relationships within the whole population, regardless of how
large it is.

With inferential statistics, data is taken from samples and generalizations are made about
a population. Inferential statistics use statistical models to compare sample data to other
samples or to previous research.

There are two main areas of inferential statistics:

1. Estimating parameters:
This means taking a statistic from the sample data (for example the sample mean) and using it
to infer about a population parameter (i.e. the population mean).There may be sampling
variations because of chance fluctuations, variations in sampling techniques, and other
sampling errors. Estimation about population characteristics may be influenced by such factors.
Therefore, in estimation the important point is that to what extent our estimate is close to the
true value.
Characteristics of Good Estimator: A good statistical estimator should have the following
characteristics, (i) Unbiased (ii) Consistent (iii) Accuracy
i) Unbiased
An unbiased estimator is one in which, if we were to obtain an infinite number of random
samples of a certain size, the mean of the statistic would be equal to the parameter. The sample
mean, ( x ) is an unbiased estimate of population mean (μ)because if we look at possible random
samples of size N from a population, then mean of the sample would be equal to μ.
ii) Consistent
A consistent estimator is one that as the sample size increased, the probability that estimate has
a value close to the parameter also increased. Because it is a consistent estimator, a sample
mean based on 20 scores has a greater probability of being closer to (μ) than does a sample
mean based upon only 5 scores
iii) Accuracy
The sample mean is an unbiased and consistent estimator of population mean (μ).But we should
not over look the fact that an estimate is just a rough or approximate calculation. It is unlikely
in any estimate that ( x ) will be exactly equal to population mean (μ). Whether or not x is a
good estimate of (μ) depends upon the representativeness of sample, the sample size, and the
variability of scores in the population.

DATA VISUALIZATION 4
2. Hypothesis tests. This is where sample data can be used to answer research questions.
For example, we might be interested in knowing if a new cancer drug is effective. Or if
breakfast helps children perform better in schools.

Inferential statistics is closely tied to the logic of hypothesis testing. We hypothesize that this
value characterise the population of observations. The question is whether that hypothesis is
reasonable evidence from the sample. Sometimes hypothesis testing is referred to as statistical
decision-making process. In day-to-day situations we are required to take decisions about the
population on the basis of sample information.

2.6.1 Statement of Hypothesis


A statistical hypothesis is defined as a statement, which may or may not be true about the
population parameter or about the probability distribution of the parameter that we wish to
validate on the basis of sample information. Most times, experiments are performed with
random samples instead of the entire population and inferences drawn from the observed results
are then generalised over to the entire population. But before drawing inferences about the
population it should be always kept in mind that the observed results might have come due to
chance factor. In order to have an accurate or more precise inference, the chance factor should
be ruled out.
Null Hypothesis
The probability of chance occurrence of the observed results is examined by the null hypothesis
(H0 ). Null hypothesis is a statement of no differences. The other way to state null hypothesis
is that the two samples came from the same population. Here, we assume that population is
normally distributed and both the groups have equal means and standard deviations.
Since the null hypothesis is a testable proposition, there is counter proposition to it known as
alternative hypothesis and denoted by H1 . In contrast to null hypothesis, the alternative
hypothesis (H1) proposes that
i) the two samples belong to two different populations,
ii) their means are estimates of two different parametric means of the respective
population, and
iii) there is a significant difference between their sample means.
The alternative hypothesis (H1 ) is not directly tested statistically; rather its acceptance or
rejection is determined by the rejection or retention of the null hypothesis. The probability ‘p’
of the null hypothesis being correct is assessed by a statistical test. If probability ‘p’ is too low,
H0 is rejected and H1 is accepted.
It is inferred that the observed difference is significant. If probability ‘p’ is high, H0 is accepted
and it is inferred that the difference is due to the chance factor and not due to the variable factor.

DATA VISUALIZATION 5
2.6.2 Level of Significance
The level of significance is defined as the probability of rejecting a null hypothesis by the test
when it is really true, which is denoted as α. That is, P (Type I error) = α.

Confidence level:
Confidence level refers to the possibility of a parameter that lies within a specified range of
values, which is denoted as c. Moreover, the confidence level is connected with the level of
significance. The relationship between level of significance and the confidence level is c=1−α.
The common level of significance and the corresponding confidence level are given below:

• The level of significance 0.10 is related to the 90% confidence level.


• The level of significance 0.05 is related to the 95% confidence level.
• The level of significance 0.01 is related to the 99% confidence level.
The rejection rule is as follows:

Rejection region:
The rejection region is the values of test statistic for which the null hypothesis is rejected.

Non rejection region:


The set of all possible values for which the null hypothesis is not rejected is called the
rejection region.
The rejection region for two-tailed test is shown below:

The rejection region for one-tailed test is given below:


In the left-tailed test, the rejection region is shaded in left side.
In the right-tailed test, the rejection region is shaded in right side.

2.6.3 One-tail and Two-tail Test


Depending upon the statement in alternative hypothesis (H1 ), either a one-tail or two tail test
is chosen for knowing the statistical significance. A one-tail test is a directional test. It is
formulated to find the significance of both the magnitude and the direction (algebraic sign) of
the observed difference between two statistics. Thus, in two-tailed tests researcher is interested
in testing whether one sample mean is significantly higher (alternatively lower) than the other
sample mean.

Types of Inferential Statistics Tests

DATA VISUALIZATION 6
There are many tests in this field, of which some of the most important are mentioned below.
1. Linear Regression Analysis
In this test, a linear algorithm is used to understand the relationship between two variables from
the data set. One of those variables is the dependent variable, while there can be one or more
independent variables used. In simpler terms, we try to predict the value of the dependent
variable based on the available values of the independent variables. This is usually represented
by using a scatter plot, although we can also use other types of graphs too.

2. Analysis of Variance
This is another statistical method which is extremely popular in data science. It is used to test
and analyse the differences between two or more means from the data set. The significant
differences between the means are obtained, using this test.

3. Analysis of Co-variance
This is only a development on the Analysis of Variance method and involves the inclusion of
a continuous co-variance in the calculations. A co-variate is an independent variable which is
continuous, and is used as regression variables. This method is used extensively in statistical
modelling, in order to study the differences present between the average values of dependent
variables.

4. Statistical Significance (T-Test)


A relatively simple test in inferential statistics, this is used to compare the means of two groups
and understand if they are different from each other. The order of difference, or how significant
the differences are can be obtained from this.

5. Correlation Analysis
Another extremely useful test, this is used to understand the extent to which two variables are
dependent on each other. The strength of any relationship, if they exist, between the two
variables can be obtained from this. You will be able to understand whether the variables have
a strong correlation or a weak one. The correlation can also be negative or positive, depending
upon the variables. A negative correlation means that the value of one variable decreases while
the value of the other increases and positive correlation means that the value both variables
decrease or increase simultaneously.

Differences between Descriptive and Inferential Statistics

Descriptive Statistics Inferential Statistics


Concerned with describing the target population Make inferences from the sample and generalize
them to the population

DATA VISUALIZATION 7
Organise, analyse, present the data in a Compare, tests and predicts future outcomes
meaningful way
The analysed results are in the form of graphs, The analysed results are the probability scores
charts etc

Describes the data which is already known Tries to make conclusions about the population
beyond the data available
Tools: Measures of central tendency and Tools: Hypothesis tests, analysis of variance etc
measures of spread

Random Variables

A random variable, X, is a variable whose possible values are numerical outcomes of a random
phenomenon. There are two types of random variables, discrete and continuous.

Example of Random variable

- A person’s blood type


- Number of leaves on a tree
- Number of times a user visits LinkedIn in a day
- Length of a tweet.

Discrete Random Variables :


A discrete random variable is one which may take on only a countable number of distinct
values such as 0,1,2,3,4,........ Discrete random variables are usually counts. If a random
variable can take only a finite number of distinct values, then it must be discrete. Examples of
discrete random variables include the number of children in a family, the Friday night
attendance at a cinema, the number of patients in a doctor's surgery, the number of defective
light bulbs in a box of ten.
The probability distribution of a discrete random variable is a list of probabilities associated
with each of its possible values. It is also sometimes called the probability function or the
probability mass function
Suppose a random variable X may take k different values, with the probability that X =
xi defined to be P(X = xi) = pi. The probabilities pi must satisfy the following:
1: 0 < pi < 1 for each i
2: p1 + p2 + ... + pk = 1.
Example

DATA VISUALIZATION 8
Suppose a variable X can take the values 1, 2, 3, or 4.
The probabilities associated with each outcome are described
by the following table:
Outcome 1 2 3 4
Probability 0.1 0.3 0.4 0.2
The probability that X is equal to 2 or 3 is the sum of the two
probabilities: P(X = 2 or X = 3) = P(X = 2) + P(X = 3) = 0.3 +
0.4 = 0.7. Similarly, the probability that X is greater than 1 is
equal to 1 - P(X = 1) = 1 - 0.1 = 0.9, by the complement rule.

Continuous Random Variables


A continuous random variable is one which takes an infinite number of possible values.
Continuous random variables are usually measurements. Examples include height, weight, the
amount of sugar in an orange, the time required to run a mile.
A continuous random variable is not defined at specific values. Instead, it is defined over
an interval of values, and is represented by the area under a curve (known as an integral). The
probability of observing any single value is equal to 0, since the number of values which may
be assumed by the random variable is infinite.
Suppose a random variable X may take all values over an interval of real numbers. Then the
probability that X is in the set of outcomes A, P(A), is defined to be the area above A and
under a curve. The curve, which represents a function p(x), must satisfy the following:
1: The curve has no negative values (p(x) > 0 for all x)
2: The total area under the curve is equal to 1.
A curve meeting these requirements is known as a density curve.
All random variables (discrete and continuous) have a cumulative distribution function. It is
a function giving the probability that the random variable X is less than or equal to x, for
every value x. For a discrete random variable, the cumulative distribution function is found
by
summing up the probabilities.
Normal Probability Distribution

The Bell-Shaped Curve


The Bell-shaped Curve is commonly called the normal curve and is mathematically referred
to as the Gaussian probability distribution. Unlike Bernoulli trials which are based on discrete
counts, the normal distribution is used to determine the probability of a continuous random
variable.

DATA VISUALIZATION 9
The normal or Gaussian Probability Distribution is most popular and important because of its
unique mathematical properties which facilitate its application to practically any physical
problem in the real world. The constants μ and σ2 are the parameters;
 “μ” is the population true mean (or expected value) of the subject phenomenon
characterized by the continuous random variable, X,
 “σ2” is the population true variance characterized by the continuous random
variable, X.
 Hence, “σ” the population standard deviation characterized by the continuous random
variable X;
 the points located at μ−σ and μ+σ are the points of inflection; that is, where the graph
changes from cupping up to cupping down
The normal curve graph of the normal probability distribution) is symmetric with
respect to the mean μ as the central position. That is, the area between μ and κ units to
the left of μ is equal to the area between μ and κ units to the right of μ.

There is not a unique normal probability distribution. The figure below is a graphical
representation of the normal distribution for a fixed value of σ2 with μ varying.

The figure below is a graphical representation of the normal distribution for a fixed value
of μ with varying σ2.

DATA VISUALIZATION 10
SAMPLING and SAMPLING DISTRIBUTION

Sampling is a process used in statistical analysis in which a predetermined number of


observations are taken from a larger population. It helps us to make statistical inferences
about the population. A population can be defined as a whole that includes all items and
characteristics of the research taken into study. However, gathering all this information is
time consuming and costly. We therefore make inferences about the population with the
help of samples.
Random sampling:
In data collection, every individual observation has equal probability to be selected into a
sample. In random sampling, there should be no pattern when drawing a sample.
Probability sampling:
It is the sampling technique in which every individual unit of the population has greater
than zero probability of getting selected into a sample.
Non-probability sampling:
It is the sampling technique in which some elements of the population have no probability
of getting selected into a sample.
Cluster samples:
It divides the population into groups (clusters). Then a random sample is chosen from the
clusters.
Systematic sampling : select sample elements from an ordered frame. A sampling
frame is just a list of participants that we want to get a sample from.
Stratified sampling : sample each subpopulation independently. First, divide the
population into homogeneous (very similar) subgroups before getting the sample. Each
population member only belongs to one group. Then apply simple random or a systematic
method within each group to choose the sample.

Sampling Distribution
A sampling distribution is a probability distribution of a statistic. It is obtained through a large
number of samples drawn from a specific population. It is the distribution of all possible values
taken by the statistic when all possible samples of a fixed size n are taken from the population.

DATA VISUALIZATION 11
Sampling Distributions and Inferential Statistics
Sampling distributions are important for inferential statistics. A population is specified and the
sampling distribution of the mean and the range were determined. In practice, the process
proceeds the other way: the sample data is collected and from these data we estimate parameters
of the sampling distribution. This knowledge of the sampling distribution can be very useful.
Knowing the degree to which means from different samples would differ from each other
and from the population mean ( this would give an idea of how close the particular sample
mean is likely to be to the population mean )
The most common measure of how much sample means differ from each other is the
standard deviation of the sampling distribution of the mean. This standard deviation is
called the standard error of the mean.
If all the sample means were very close to the population mean, then the standard error of
the mean would be small. On the other hand, if the sample means varied considerably, then
the standard error of the mean would be large.

Sampling distribution of the sample mean

1. We take many random


samples of a given size n
from a population with mean
µ and standard deviation σ.
2. Some sample means will be
above the population mean µ
and some will be below,
making up the sampling
distribution.

DATA VISUALIZATION 12
R overview and Installation
R is a programming language and software environment for statistical analysis, graphics
representation and reporting. R was created by Ross Ihaka and Robert Gentleman at the
University of Auckland, New Zealand, and is currently developed by the R Development Core
Team.
The core of R is an interpreted computer language which allows branching and looping as
well as modular programming using functions. R allows integration with the procedures
written in the C, C++, .Net, Python or FORTRAN languages for efficiency.
R is freely available under the GNU General Public License, and pre-compiled binary versions
are provided for various operating systems like Linux, Windows and Mac.
R is free software distributed under a GNU-style copy left, and an official part of the GNU
project called GNUs.

Features of R

 R is a well-developed, simple and effective programming language which includes


conditionals, loops, user defined recursive functions and input and output facilities.
 R has an effective data handling and storage facility,
 R provides a suite of operators for calculations on arrays, lists, vectors and matrices.

DATA VISUALIZATION 13
 R provides a large, coherent and integrated collection of tools for data analysis.
 R provides graphical facilities for data analysis and display either directly at the
computer or printing at the papers.

To Install R:
1. Open an internet browser and go to www.r-project.org.
2. Click the "download R" link in the middle of the page under "Getting Started."
3. Select a CRAN location (a mirror site) and click the corresponding link.
4. Click on the "Download R for Windows" link at the top of the page.
5. Click on the "install R for the first time" link at the top of the page.
6. Click "Download R for Windows" and save the executable file somewhere on computer. Run
the .exe file and follow the installation instructions.
7. Now that R is installed, next step is to download and install RStudio.

To Install RStudio

1. Go to www.rstudio.com and click on the "Download RStudio" button.


2. Click on "Download RStudio Desktop."
3. Click on the version recommended for your system, or the latest Windows version, and save
the executable file. Run the .exe file and follow the installation instructions.

R Command Prompt

Once R environment setup is done, then it’s easy to start R command prompt by just typing
the following command at command prompt – “$ R”
This will launch R interpreter and will get a prompt > where we can start typing your program
as follows −
> myString <- "Hello, World!"
> print ( myString)

[1] "Hello, World!"


Here first statement defines a string variable myString, where we assign a string "Hello,
World!" and then next statement print() is being used to print the value stored in variable
myString.

R Script File

execute scripts at command prompt with the help of R interpreter called Rscript.
# My first program in R Programming
myString <- "Hello, World!"
print ( myString)
Save the above code in a file test.R and execute it at command prompt as given below.
DATA VISUALIZATION 14
$ Rscript test.R
When we run the above program, it produces the following result.
"Hello, World!"
Comments

Comments are like helping text in your R program and they are ignored by the interpreter
while executing actual program. Single comment is written using # in the beginning of the
statement as follows −
# My first program in R Programming
R does not support multi-line comments but they can be written as follows:
"This is a demo for multi-line comments and it should be put inside either a
single OR double quote"

myString <- "Hello, World!"


print ( myString)
Result for above code is:
"Hello, World!"

R data types:

The variables are assigned with R-Objects and the data type of the R-object becomes the data
type of the variable. There are many types of R-objects. The frequently used ones are −

 Vectors
 Lists
 Matrices
 Arrays
 Factors
 Data Frames
The simplest of these objects is the vector object and there are six data types of these atomic
vectors, also termed as six classes of vectors. The other R-Objects are built upon the atomic
vectors.
Data Type Example Verify

Logical TRUE, FALSE v <- TRUE


print(class(v))
[1] "logical"
Numeric 12.3, 5, 999 v <- 23.5
print(class(v))
[1] "numeric"
Integer 2L, 34L, 0L v <- 2L

DATA VISUALIZATION 15
print(class(v))
[1] "integer"

Complex 3 + 2i v <- 2+5i


print(class(v))
[1] "complex"
Character 'a' , '"good", "TRUE", '23.4' v <- "TRUE"
print(class(v))
[1] "character"
Raw "Hello" is stored as 48 65 6c 6c 6f v<-charToRaw("Hello")
print(class(v))
[1] "raw"
In R programming, the very basic data types are the R-objects called vectors which hold
elements of different classes as shown above.

Vectors

When you want to create vector with more than one element, you should use c() function
which means to combine the elements into a vector.
# Create a vector.
apple <- c('red','green',"yellow")
print(apple)
# Get the class of the vector.
print(class(apple))
When we execute the above code, it produces the following result −
"red" "green" "yellow"
"character"

Lists

A list is an R-object which can contain many different types of elements inside it like vectors,
functions and even another list inside it.
# Create a list.
list1 <- list(c(2,5,3),21.3,sin)
# Print the list.
print(list1)
When we execute the above code, it produces the following result −
[[1]]
[1] 2 5 3
[[2]]
[1] 21.3
[[3]]
DATA VISUALIZATION 16
function (x) .Primitive("sin")
Matrices
A matrix is a two-dimensional rectangular data set. It can be created using a vector input to
the matrix function.
# Create a matrix.
M = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE)
print(M)
When we execute the above code, it produces the following result −
[,1] [,2] [,3]
[1,] "a" "a" "b"
[2,] "c" "b" "a"

Arrays

While matrices are confined to two dimensions, arrays can be of any number of dimensions.
The array function takes a dim attribute which creates the required number of dimension. In
the below example we create an array with two elements which are 3x3 matrices each.
# Create an array.
a <- array(c('green','yellow'),dim = c(3,3,2))
print(a)
When we execute the above code, it produces the following result −
,,1
[,1] [,2] [,3]
[1,] "green" "yellow" "green"
[2,] "yellow" "green" "yellow"
[3,] "green" "yellow" "green"
,,2
[,1] [,2] [,3]
[1,] "yellow" "green" "yellow"
[2,] "green" "yellow" "green"
[3,] "yellow" "green" "yellow"

Data Frames

Data frames are tabular data objects. Unlike a matrix in data frame each column can contain
different modes of data. The first column can be numeric while the second column can be
character and third column can be logical. It is a list of vectors of equal length. Data Frames
are created using the data.frame() function.
# Create the data frame.
BMI <- data.frame( gender = c("Male", "Male","Female"), height = c(152, 171.5, 165),
weight = c(81,93, 78), Age = c(42,38,26) )
print(BMI)
DATA VISUALIZATION 17
Result −
gender height weight Age
1 Male 152.0 81 42
2 Male 171.5 93 38
3 Female 165.0 78 26

R - Variables

A variable provides us with named storage that our programs can manipulate. A variable in R
can store an atomic vector, group of atomic vectors or a combination of many R objects. A
valid variable name consists of letters, numbers and the dot or underline characters. The
variable name starts with a letter or the dot not followed by a number.
Variable Name Validity Reason

var_name2. valid Has letters, numbers, dot and underscore

var_name% Invalid Has the character '%'. Only dot(.) and underscore allowed.

2var_name invalid Starts with a number

.var_name, valid Can start with a dot(.) but the dot(.)should not be followed by a
var.name number.

.2var_name invalid The starting dot is followed by a number making it invalid.

_var_name invalid Starts with _ which is not valid

R - Operators

An operator is a symbol that tells the compiler to perform specific mathematical or logical
manipulations. R language is rich in built-in operators and provides following types of
operators.

Types of Operators

types of operators in R programming −

 Arithmetic Operators
 Relational Operators
 Logical Operators
 Assignment Operators
 Miscellaneous Operators

DATA VISUALIZATION 18
Descriptive Data analysis using R:
R provides a wide range of functions for obtaining summary statistics. One method of obtaining
descriptive statistics is to use the sapply( ) function with a specified summary statistic.
sapply(mydata, mean, na.rm=TRUE)

Possible functions used in sapply include mean, sd, var, min, max, median, range, and
quantile.
Check your data

You can inspect your data using the functions head() and tails(), which will display the first
and the last part of the data, respectively.
# Print the first 6 rows
head(my_data, 6)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa

R functions for computing descriptive statistics

Some R functions for computing descriptive statistics:

Description R function
Mean mean()
Standard deviation sd()
Variance var()
Minimum min()
Maximum maximum()
Median median()
Range of values (minimum and maximum) range()
Sample quantiles quantile()
Generic function summary()
Interquartile range IQR()

Descriptive statistics for a single group

Measure of central tendency: mean, median, mode

Roughly speaking, the central tendency measures the “average” or the “middle” of your data.
The most commonly used measures include:

DATA VISUALIZATION 19
 the mean: the average value. It’s sensitive to outliers.
 the median: the middle value. It’s a robust alternative to mean.
 and the mode: the most frequent value
In R,

 The function mean() and median() can be used to compute the mean and the median,
respectively;
 The function mfv() [in the modeest R package] can be used to compute the mode of a
variable.
The R code below computes the mean, median and the mode of the
variable Sepal.Length [in my_data data set]:

# Compute the mean value


mean(my_data$Sepal.Length)
[1] 5.843333

# Compute the median value


median(my_data$Sepal.Length)
[1] 5.8

# Compute the mode


# install.packages("modeest")
require(modeest)
mfv(my_data$Sepal.Length)
[1] 5

Measure of variability

Measures of variability gives how “spread out” the data are.

Range: minimum & maximum

 Range corresponds to biggest value minus the smallest value. It gives you the full spread
of the data.
# Compute the minimum value
min(my_data$Sepal.Length)
[1] 4.3
# Compute the maximum value
max(my_data$Sepal.Length)
[1] 7.9
# Range
range(my_data$Sepal.Length)
[1] 4.3 7.9

Interquartile range

DATA VISUALIZATION 20
The interquartile range (IQR) - corresponding to the difference between the first and third
quartiles - is sometimes used as a robust alternative to the standard deviation.
 R function:
quantile(x, probs = seq(0, 1, 0.25))
 x: numeric vector whose sample quantiles are wanted.
 probs: numeric vector of probabilities with values in [0,1].
 Example:
quantile(my_data$Sepal.Length)
0% 25% 50% 75% 100%
4.3 5.1 5.8 6.4 7.9
To compute deciles (0.1, 0.2, 0.3, …., 0.9), use this:

quantile(my_data$Sepal.Length, seq(0, 1, 0.1))


To compute the interquartile range, type this:

IQR(my_data$Sepal.Length)
[1] 1.3
Variance and standard deviation

The variance represents the average squared deviation from the mean. The standard deviation
is the square root of the variance. It measures the average deviation of the values, in the data,
from the mean value.

# Compute the variance


var(my_data$Sepal.Length)

# Compute the standard deviation =


# square root of th variance
sd(my_data$Sepal.Length)

Computing an overall summary of a variable and an entire data frame


summary() function
 Summary of a single variable. Five values are returned: the mean, median, 25th and 75th
quartiles, min and max in one single line call:
summary(my_data$Sepal.Length)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.300 5.100 5.800 5.843 6.400 7.900
 Summary of a data frame. In this case, the function summary() is automatically applied
to each column. The format of the result depends on the type of the data contained in the
column. For example:
o If the column is a numeric variable, mean, median, min, max and quartiles are returned.
o If the column is a factor variable, the number of observations in each group is returned.
summary(my_data, digits = 1)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
DATA VISUALIZATION 21
Min. :4 Min. :2 Min. :1 Min. :0.1 setosa :50
1st Qu.:5 1st Qu.:3 1st Qu.:2 1st Qu.:0.3 versicolor:50
Median :6 Median :3 Median :4 Median :1.3 virginica :50
Mean :6 Mean :3 Mean :4 Mean :1.2
3rd Qu.:6 3rd Qu.:3 3rd Qu.:5 3rd Qu.:1.8
Max. :8 Max. :4 Max. :7 Max. :2.5

sapply() function

# Compute the mean of each column


sapply(my_data[, -5], mean)
Sepal.Length Sepal.Width Petal.Length Petal.Width
5.843333 3.057333 3.758000 1.199333
# Compute quartiles
sapply(my_data[, -5], quantile)
Sepal.Length Sepal.Width Petal.Length Petal.Width
0% 4.3 2.0 1.00 0.1
25% 5.1 2.8 1.60 0.3
50% 5.8 3.0 4.35 1.3
75% 6.4 3.3 5.10 1.8
100% 7.9 4.4 6.90 2.5
Descriptive Data Analysis using R > Description of Basic Functions used to Describe Data in R

builtins() # List all built-in functions


help() or ? or ?? #i.e. help(boxplot)
getwd() and setwd() # working with a file directory
q() #To close R
ls() #Lists all user defined objects.
rm() #Removes objects from an environment.
demo() #Lists the demonstrations in the packages that are loaded.
demo(package = #Lists the demonstrations in all installed packages.
.packages(all.available = TRUE))
?NA # Help page on handling of missing data values
abs(x) # The absolute value of "x"
append() # Add elements to a vector
cat(x) # Prints the arguments
cbind() # Combine vectors by row/column (cf. "paste" in Unix)
grep() # Pattern matching
identical() # Test if 2 objects are *exactly* equal
length(x) # Return no. of elements in vector x
ls() # List objects in current environment
mat.or.vec() # Create a matrix or vector
paste(x) # Concatenate vectors after converting to character
range(x) # Returns the minimum and maximum of x
rep(1,5) # Repeat the number 1 five times
DATA VISUALIZATION 22
rev(x) # List the elements of "x" in reverse order
seq(1,10,0.4) # Generate a sequence (1 -> 10, spaced by 0.4)
sequence() # Create a vector of sequences
sign(x) # Returns the signs of the elements of x
sort(x) # Sort the vector x
order(x) # list sorted element numbers of x
tolower(),toupper() # Convert string to lower/upper case letters
unique(x) # Remove duplicate entries from vector
vector() # Produces a vector of given length and mode
formatC(x) # Format x using 'C' style formatting specifications
floor(x), ceiling(x), # rounding functions
round(x), signif(x), trunc(x)
Sys.time() # Return system time
Sys.Date() # Return system date
getwd() # Return working directory
setwd() # Set working directory

Inferential statistics using R


Simple linear regression analysis
• Regression analysis is a very widely used statistical tool to establish a relationship
model between two variables
• One of these variable is called predictor variable
• The other variable is called response variable
• The general mathematical equation for a linear regression is y = mx + b

Register_no Name Dept CGPA Height Weight


1 18N312001 JOHN IT 8.5 151 63
2 18N312005 SIM CSE 9.2 174 81
3 18N312011 TIM IT 9.5 138 56
4 18N312061 LILLY IT 9.34 186 91
5 18N312099 CARL MECH 8.12 128 47

DATA VISUALIZATION 23
• lm() Function
• This function creates the relationship
model between the predictor and the
response variable.
• The basic syntax for lm() function in linear
regression is −
• lm(formula,data)
• # Apply the lm() function.
• relation <- lm(stud.data$weight ~
stud.data$height)
• print(relation)
Output
Coefficients:
(Intercept (m)) x
-38.4551 0.6746

DATA VISUALIZATION 24
UNIT-II

Introduction

Data Manipulation is an important phase of predictive modeling. A robust predictive model


cannot be built using machine learning algorithms. But, with an approach to understand the
business problem, the underlying data and extracting business insights are done performing
required data manipulations. Among several phases of model building, most of the time is
usually spent in understanding underlying data and performing required manipulations.

Data Manipulation

It involves ‘manipulating’ data using available set of variables. This is done to enhance
accuracy and precision associated with data. Actually, the data collection process can
have many loopholes. There are various uncontrollable factors which lead to inaccuracy in data
such as mental situation of respondents, personal biases, difference / error in readings of
machines etc. To lessen these inaccuracies, data manipulation is done to increase the possible
(highest) accuracy in data. This stage is also known as data wrangling or data cleaning.

Different Ways to Manipulate / Treat Data:

 Manipulating data using inbuilt base R functions. This is the first step, but is often
repetitive and time consuming. Hence, it is a less efficient way to solve the problem.
 Use of packages for data manipulation. CRAN has more than 8000 packages available
today. These packages are a collection of pre-written commonly used pieces of codes.
They helps to perform the repetitive tasks fasts, reduce errors in coding and take help of
code written by experts (across the open source eco-system for R) to make code more
efficient. This is usually the most common way of performing data manipulation.
 Use of Machine Learning(ML) algorithms for data manipulation. ML algorithms like
tree based boosting algorithms to take care of missing data & outliers. These algorithms
are less time consuming,

Note: Install packages using:

install.packages('package name')

List of Packages

1. dplyr
2. data.table
3. ggplot2
4. reshape2
5. readr
6. tidyr

DATA VISUALIZATION 25
7. lubridate

dplyr Package

This package is created and maintained by Hadley Wickham. This package has everything
(almost) to accelerate data manipulation efforts. It is known best for data exploration and
transformation. Its chaining syntax makes it highly adaptive to use. It includes 5 major data
manipulation commands:

1. filter – It filters the data based on a condition


2. select – It is used to select columns of interest from a data set
3. arrange – It is used to arrange data set values on ascending or descending order
4. mutate – It is used to create new variables from existing variables
5. summarise (with group_by) – It is used to perform analysis by commonly used operations
such as min, max, mean count etc

Note : 2 pre-installed R data sets namely mtcars and iris.

> library(dplyr)

> data("mtcars")
> data('iris')

> mydata <- mtcars


#read data
> head(mydata)

#creating a local dataframe.

Local data frame are easier to read

> mynewdata <- tbl_df(mydata)

> myirisdata <- tbl_df(iris)

#now data will be in tabular structure

> mynewdata

DATA VISUALIZATION 26
> myirisdata

#use filter to filter data with required condition


> filter(mynewdata, cyl > 4 & gear > 4 )

DATA VISUALIZATION 27
> filter(mynewdata, cyl > 4)

> filter(myirisdata, Species %in% c('setosa', 'virginica'))

#use select to pick columns by name


> select(mynewdata, cyl,mpg,hp)

#here you can use (-) to hide columns


> select(mynewdata, -cyl, -mpg )

DATA VISUALIZATION 28
#hide a range of columns
> select(mynewdata, -c(cyl,mpg))

#select series of columns

> select(mynewdata, cyl:gear)

#chaining or pipelining - a way to perform multiple operations #in one line


> mynewdata %>% select(cyl, wt, gear)%>% filter(wt > 2)

DATA VISUALIZATION 29
#arrange can be used to reorder rows
> mynewdata%>% select(cyl, wt, gear)%>% arrange(wt)

> mynewdata%>% select(cyl, wt, gear)%>% arrange(desc(wt))

#mutate - create new variables

> mynewdata %>% select(mpg, cyl)%>% mutate(newvariable = mpg*cyl)

> newvariable <- mynewdata %>% mutate(newvariable = mpg*cyl)

#summarise - this is used to find insights from data

> myirisdata%>% group_by(Species)%>% summarise(Average = mean(Sepal.Length,


na.rm = TRUE))

DATA VISUALIZATION 30
#summarise each

> myirisdata%>% group_by(Species)%>% summarise_each(funs(mean, n()),


Sepal.Length, Sepal.Width)

rename the variables using rename command

> mynewdata %>% rename(miles = mpg)

data.table Package
This package allows to perform faster manipulation in a data set. A data table has 3 parts
namely DT[i,j,by]. We can tell R to subset the rows using ‘i’, to calculate ‘j’ which is grouped
by ‘by’. Most of the times, ‘by’ relates to categorical variable.
#load data

> data("airquality")
> mydata <- airquality
> head(airquality,6)

#load package

> library(data.table)

DATA VISUALIZATION 31
> mydata <- data.table(mydata)
> mydata

> myiris <- data.table(myiris)

> myiris

#subset rows - select 2nd to 4th row

> mydata[2:4,]

#select columns with particular values

> myiris[Species == 'setosa']

DATA VISUALIZATION 32
#select columns with multiple values. This will give you columns with Setosa #and virginica
species

> myiris[Species %in% c('setosa', 'virginica')]

#select columns. Returns a vector

> mydata[,Temp]

> mydata[,.(Temp,Month)]

#returns sum of selected column

> mydata[,sum(Ozone, na.rm = TRUE)]

[1]4887

#returns sum and standard deviation

> mydata[,.(sum(Ozone, na.rm = TRUE), sd(Ozone, na.rm = TRUE))]

DATA VISUALIZATION 33
#print and plot

> myiris[,{print(Sepal.Length)

> plot(Sepal.Width) NULL}]

#grouping by a variable

> myiris[,.(sepalsum = sum(Sepal.Length)), by=Species

#select a column for computation, hence need to set the key on column

> setkey(myiris, Species)

#selects all the rows associated with this data point

> myiris['setosa']
> myiris[c('setosa', 'virginica')]
ggplot2 Package

ggplot offers a whole new world of colors and patterns. Plotting 3 graphs: Scatter Plot, Bar
Plot, Histogram. ggplot is enriched with customized features to make visualization better. It
becomes even more powerful when grouped with other packages like cowplot, gridExtra.

DATA VISUALIZATION 34
Scatter Plot :
A Scatter Plot is a graph in which the values of two variables are plotted along two axes,
the pattern of the resulting points revealing any correlation present.

With scatter plots we can explain how the variables relate to each other. Which is defined
as correlation. Positive, Negative, and None (no correlation) are the three types of
correlation.
Limitations of a Scatter Diagram
Below are the few limitations of a scatter diagram:
• With Scatter diagrams we cannot get the exact extent of correlation.
• Quantitative measure of the relationship between the variable cannot be viewed. Only
shows the quantitative expression.
• The relationship can only show for two variables.
Advantages of a Scatter Diagram
Below are the few advantages of a scatter diagram:
• Relationship between two variables can be viewed.
• For non-linear pattern, this is the best method.
• Maximum and minimum value, can be easily determined.
• Observation and reading is easy to understand
• Plotting the diagram is very simple.

Bar Plot
A barplot (or barchart) is one of the most common type of graphic. It shows the
relationship between a numeric variable and a categoric variable.
Bar Plot are classified into four types of graphs - bar graph or bar chart, line graph, pie
chart, and diagram.
Limitations of Bar Plot:
When we try to display changes in speeds such as acceleration, Bar graphs wont help us.
Advantages of Bar plot:
• Bar charts are easy to understand and interpret.
• Relationship between size and value helps for in easy comparison.
• They're simple to create.
• They can help in presenting very large or very small values easily.

Histogram
A histogram represents the frequency distribution of continuous variables. while, a bar

DATA VISUALIZATION 35
graph is a diagrammatic comparison of discrete variables.
Histogram presents numerical data whereas bar graph shows categorical data.
The histogram is drawn in such a way that there is no gap between the bars.

Limitations of Histogram:
A histogram can present data that is misleading as it has many bars.
Only two sets of data are used, but to analyze certain types of statistical data, more than
two sets of data are necessary

Advantages of Histogram:
Histogram helps to identify different data, the frequency of the data occurring in the dataset
and categories which are difficult to interpret in a tabular form. It helps to visualize the
distribution of the data.

> library(ggplot2)
> library(gridExtra)
> df <- ToothGrowth
> df$dose <- as.factor(df$dose)
> head(df)

BOX PLOT

> bp <- ggplot(df, aes(x = dose, y = len, color = dose)) + geom_boxplot() +


theme(legend.position = 'none')

> bp

#add gridlines

DATA VISUALIZATION 36
> bp + background_grid(major = "xy", minor = 'none')

SCATTER PLOT

> sp <- ggplot(mpg, aes(x = cty, y = hwy, color = factor(cyl)))+geom_point(size = 2.5)

> sp

BAR PLOT

> bp <- ggplot(diamonds, aes(clarity, fill = cut)) + geom_bar() +theme(axis.text.x =


element_text(angle = 70, vjust = 0.5))

> bp

#compare two plots

DATA VISUALIZATION 37
> plot_grid(sp, bp, labels = c("A","B"), ncol = 2, nrow = 1)

#histogram
> ggplot(diamonds, aes(x = carat)) + geom_histogram(binwidth = 0.25, fill =
'steelblue')+scale_x_continuous(breaks=seq(0,3, by=0.5))

reshape2 Package
As the name suggests, this package is useful in reshaping data. The data come in many forms.
Hence, we are required to shape it according to our need. Usually, the process of reshaping
data in R is tedious. R base functions consist of ‘Aggregation’ option using which data can
be reduced and rearranged into smaller forms, but with reduction in amount of information.
Aggregation includes tapply, by and aggregate base functions. The reshape package
overcomes these problems. It has 2 functions namely melt and cast.

melt : This function converts data from wide format to long format. It’s a form of
restructuring where multiple categorical columns are ‘melted’ into unique rows.
#create a data
> ID <- c(1,2,3,4,5)
> Names <- c('Joseph','Matrin','Joseph','James','Matrin')
> DateofBirth <- c(1993,1992,1993,1994,1992)
> Subject<- c('Maths','Biology','Science','Psycology','Physics')
> thisdata <- data.frame(ID, Names, DateofBirth, Subject)
> data.table(thisdata)

DATA VISUALIZATION 38
#load package
> install.packages('reshape2')
> library(reshape2)
#melt
> mt <- melt(thisdata, id=(c('ID','Names')))
> mt

cast : This function converts data from long format to wide format. It starts with melted data
and reshapes into long format. It’s just the reverse of melt function. It has two functions
namely, dcast and acast. dcast returns a data frame as output. acast returns a
vector/matrix/array as the output.
> mcast <- dcast(mt, DateofBirth + Subject ~ variable)
> mcast

tidyr Package

This package can make the data look ‘tidy’. It has 4 major functions to accomplish this task.
The 4 functions are:

DATA VISUALIZATION 39
 gather() – it ‘gathers’ multiple columns. Then, it converts them into key:value pairs.
This function will transform wide from of data to long form. You can use it as in
alternative to ‘melt’ in reshape package.
 spread() – It does reverse of gather. It takes a key:value pair and converts it into
separate columns.
 separate() – It splits a column into multiple columns.

unite() – It does reverse of separate. It unites multiple columns into single column
#load package
> library(tidyr)
#create a dummy data set
> names <- c('A','B','C','D','E','A','B')
> weight <- c(55,49,76,71,65,44,34)
> age <- c(21,20,25,29,33,32,38)

> Class <- c('Maths','Science','Social','Physics','Biology','Economics','Accounts')

#create data frame


> tdata <- data.frame(names, age, weight, Class)

> tdata

#using gather function


> long_t <- tdata %>% gather(Key, Value, weight:Class)
> long_t

Separate Command
#create a data set

DATA VISUALIZATION 40
Time <- c("27/01/2015 15:44","23/02/2015 23:24", "31/03/2015 19:15", "20/01/2015
20:52", "23/02/2015 07:46", "31/01/2015 01:55")
#build a data frame
> d_set <- data.frame(Humidity, Rain, Time)
#using separate function we can separate date, month, year
> separate_d <- d_set %>% separate(Time, c('Date', 'Month','Year'))

> separate_d

Unite Command
#using unite function - reverse of separate
> unite_d <- separate_d%>% unite(Time, c(Date, Month, Year), sep = "/")
> unite_d

Spread Function ( reverse of gather command)


#using spread function - reverse of gather
> wide_t <- long_t %>% spread(Key, Value)

> wide_t

DATA VISUALIZATION 41
readr Package

‘readr’ helps in reading various forms of data into R. With 10x faster speed. Here, characters
are never converted to factors. This package can replace the traditional read.csv() and
read.table() base R functions. It helps in reading the following data:

 Delimited files with read_delim(), read_csv(), read_tsv(), and read_csv2().


 Fixed width files with read_fwf(), and read_table().
 Web log files with read_log()
If the data loading time is more than 5 seconds, this function will show you a progress bar too.

> install.packages('readr') > library(readr)

> read_csv('test.csv',col_names = TRUE)

specify the data type of every > read_csv("iris.csv", col_types = list(


column loaded in data Sepal.Length = col_double(),
Sepal.Width = col_double(),
Petal.Length = col_double(),
Petal.Width = col_double(),
Species = col_factor(c("setosa", "versicolor", "virginica"))
))
choose to omit unimportant > read_csv("iris.csv", col_types = list(
columns Species = col_factor (c("setosa", "versicolor", "virginica"))
)

Lubridate Package

Lubridate package reduces the pain of working of data time variable in R. The inbuilt function
of this package offers a nice way to make easy parsing in dates and times. This package is
frequently used with data comprising of timely data.
> install.packages('lubridate')
> library(lubridate)

#current date and time > now() [1] "2015-12-11 13:23:48 IST"

#assigning current > n_time <- now()


date and time to
variable n_time
#using update >n_update<-update(n_time, year = [1] "2013-10-11 13:24:28 IST"
function 2013, month = 10)
> n_update

DATA VISUALIZATION 42
#add days, months, > d_time <- now() [1] "2015-12-12 13:24:54 IST"
year, seconds > d_time + ddays(1)

> d_time + dweeks(2) [1] "2015-12-12 13:24:54 IST"

> d_time + dyears(3) [1] "2018-12-10 13:24:54 IST"

> d_time + dhours(2) [1] "2015-12-11 15:24:54 IST"

> d_time + dminutes(50) [1] "2015-12-11 14:14:54 IST"

> d_time + dseconds(60) [1] "2015-12-11 13:25:54 IST"

#extract date,time > n_time$hour <- hour(now())


> n_time$minute <- minute(now())
> n_time$second <- second(now())
> n_time$month <- month(now())
> n_time$year <- year(now())
#check the extracted > new_data <-
dates in separate data.frame(n_time$hour,
columns n_time$minute, n_time$second,
n_time$month, n_time$year)
> new_data

DATA VISUALIZATION 43
WATSON STUDIO

Watson Studio provides you with the environment and tools to solve your business problems
by collaboratively working with data. You can choose the tools you need to analyze and
visualize data, to cleanse and shape data, to ingest streaming data, or to create and train
machine learning models.

This illustration shows how the architecture of Watson Studio is centered around the project.
A project is where you organize your resources and work with data.

Visualizing information in graphical ways can give you insights into your data. By enabling
you to look at and explore data from different perspectives, visualizations can help you
identify patterns, connections, and relationships within that data as well as understand large
amounts of information very quickly.

Create a project -
To create a project :
Click New project on the Watson Studio home page or your My Projects page.

Choose whether to create an empty project or to create a project based on an exported project
file or a sample project.

If you chose to create a project from a file or a sample, upload a project file or select a sample
project. See Importing a project.

On the New project screen, add a name and optional description for the
DATA VISUALIZATION 44
project.

Select the Restrict who can be a collaborator check box to restrict collaborators to members
of your organization or integrate with a catalog. The check box is selected by default if you
are a member of a catalog. You can’t change this setting after you create the project.

If prompted, choose or add any required services.

Choose an existing object storage service instance or create a new one.

Click Create. You can start adding resources if your project is empty or begin working with
the resources you imported.

To add data files to a project:

From your project’s Assets page, click Add to project > Data or click the Find and add data
icon ().You can also click the Find and add data icon from within a notebook or canvas.

In the Load pane that opens, browse for the files or drag them onto the pane. You must stay
on the page until the load is complete. You can cancel an ongoing load process if you want to
stop loading a file.

DATA VISUALIZATION 45
Case Study:

Let us take the Iris Data set to see how we can visualize the data in Watson studio.

DATA VISUALIZATION 46
Adding Data to Data Refinery

Visualizing information in graphical ways can give you insights into your data. By enabling
you to look at and explore data from different perspectives, visualizations can help you
identify patterns, connections, and relationships within that data as well as understand large
amounts of information very quickly. You can also visualize your data with these same charts
in an SPSS Modeler flow. Right-click a node and select Profile.

To visualize your data:

From Data Refinery, click the Visualizations tab.

Start with a chart or select columns.

DATA VISUALIZATION 47
1. Click any of the available charts. Then add columns in the DETAILS panel that opens on the
left side of the page.

2. Select the columns that you want to work with. Suggested charts will be indicated with a dot
next to the chart name. Click a chart to visualize your data.

Click on refine

Click on Visualization tab:

Add the columns by selecting.

DATA VISUALIZATION 48
UNIT – III

Introduction to Anaconda -

Anaconda is a package manager, an environment manager, and Python distribution that


contains a collection of many open source packages.

Anaconda Installation -

Go to the Anaconda Website and choose a Python 3.x graphical installer (A) or a Python 2.x
graphical installer.

When the screen below appears, click on Next

DATA VISUALIZATION 49
Click on Next.

DATA VISUALIZATION 50
Note your installation location and then click Next.

Choose whether to add Anaconda to your PATH environment variable. We recommend not
adding Anaconda to the PATH environment variable, since this can interfere with other
software. Instead, use Anaconda software by opening Anaconda Navigator or the Anaconda
Prompt from the Start Menu.

After that click on next.

DATA VISUALIZATION 51
Click Finish.

We need to set anaconda path to system environmental variables.

Open a Command Prompt. Check if you already have Anaconda added to your path.
Enter the commands below into your Command Prompt.

Conda –version
Python –version
This is checking if you already have Anaconda added to your path. If you get a command not
recognized, then we need to set Anaconda path

If you don't know where your conda and/or python is, open an Anaconda Prompt and type in
the following commands. This is telling you where conda and python are located on your
computer.

Add conda and python to your PATH. You can do this by going to your System Environment
Variables and adding the output of step 3 (enclosed in the red )

DATA VISUALIZATION 52
Open a new Command Prompt. Try typing conda --version and python --version into
the Command Prompt to check to see if everything went well.

Conda installation is successful

Introduction to Jupyter Notebook

What is Jupyter

The Jupyter Notebook is an open source web application that you can use to create and share
documents that contain live code, equations, visualizations, and text. Jupyter Notebook is
maintained by the people at Project Jupyter.

Jupyter Notebooks are a spin-off project from the IPython project, which used to have an
IPython Notebook project itself. The name, Jupyter, comes from the core supported
programming languages that it supports: Julia, Python, and R. Jupyter ships with the IPython
kernel, which allows you to write your programs in Python, but there are currently over 100
other kernels that you can also use.

How to access Jupyter Notebook

DATA VISUALIZATION 53
Installing Anaconda Distribution will also include Jupyter Notebook.
To access the Jupyter Notebook go to anaconda prompt and run below command

Or go to Command Prompt and first activate root before launching jupyter notebook

Then you'll see the application opening in the web browser on the following address:
http://localhost:8888.

DATA VISUALIZATION 54
Python Scripting Basics
First Program in Python

A statement or expression is an instruction the computer will run or execute. Perhaps the
simplest program you can write is a print statement. When you run the print statement, Python
will simply display the value in the parentheses. The value in the parentheses is called the
argument.

If you are using a Jupyter notebook, you will see a small rectangle with the statement. This is
called a cell. If you select this cell with your mouse, then click the run cell button. The statement
will execute. The result will be displayed beneath the cell.

It’s customary to comment your code. This tells other people what your code does. You simply
put a hash symbol proceeding your comment. When you run the code, Python will ignore the
comment.

Data Types

A type is how Python represents different types of data. You can have different types in Python.
They can be integers like 11, real numbers like 21.213. They can even be words.

DATA VISUALIZATION 55
The following chart summarizes three data types for the last examples. The first coslumn
indicates the expression. The second Column indicates the data type. We can see the actual
data type in Python by using the type command. We can have int, which stands for an integer,
and float that stands for float, essentially a real number. The type string is a sequence of
characters.

Integers can be negative or positive. It should be noted that there is a finite range of integers,
but it is quite large. Floats are real numbers; they include the integers but also numbers in
between the integers. Consider the numbers between 0 and 1. We can select numbers in
between them; these numbers are floats. Similarly, consider the numbers between 0.5 and 0.6.
We can select numbers in-between them; these are floats as well.

Nothing really changes. If you cast a float to an integer, you must be careful. For example, if
you cast the float 1.1 to 1, you will lose some information. If a string contains an integer
value, you can convert it to int. If we convert a string that contains a non-integer value, we
get an error. You can convert an int to a string or a float to a string.

DATA VISUALIZATION 56
Boolean is another important type in Python. A Boolean can take on two values. The first
value is true, just remember we use an uppercase T. Boolean values can also be false, with an
uppercase F. Using the type command on a Boolean value, we obtain the term bool, this is
short for Boolean. If we cast a Boolean true to an integer or float, we will get a 1.

If we cast a Boolean false to an integer or float, we get a zero. If you cast a 1 to a Boolean,
you get a true. Similarly, if you cast a 0 to a Boolean, you get a false.

String Operations In Python

In Python, a string is a sequence of characters. A string is contained within two quotes: You
could also use single quotes. A string can be spaces, or digits. A string can also be special
characters. We can bind or assign a string to another variable. It is helpful to think of a string
as an ordered sequence. Each element in the sequence can be accessed using an index
represented by the array of numbers. The first index can be accessed as

follows. We can access index 6. Moreover, we can access the 13th index. We can also use
negative indexing with strings. The last element is given by the index -1. The first element can
be obtained by index -15, and so on.

DATA VISUALIZATION 57
We can bind a string to another variable. It is helpful to think of string as a list or tuple. We
can treat the string as a sequence and perform sequence operations. We can also input a string
value as follows. The 2 indicates we select every second variable. We can also incorporate
slicing.

In this case. we return every second value up to index four. We can use the “Len” command to
obtain the length of the string. As there are 15 elements, the result is 15.

We can concatenate or combine strings. We use the addition symbols. The result is a new string
that is a combination of both.

We can replicate values of a string. We simply multiply the string by the number of times we
would like to replicate it, in this case, three. The result is a new string. The new string consists
DATA VISUALIZATION 58
of three copies of the original string. This means you cannot change the value of the string, but
you can create a new string.

Python COLLECTION (or)Arrays

There are four collection data types in the Python programming language:
Tuple is a collection which is ordered and unchangeable. Allows duplicate members.

List is a collection which is ordered and changeable. Allows duplicate members.


Set is a collection which is unordered and unindexed. No duplicate members.

Dictionary is a collection which is unordered, changeable and indexed. No duplicate members.

Tuple:
tuples are expressed as comma-separated elements within parentheses.

In Python, there are different types: strings, integer, float. They can all be contained in a tuple,
but the type of the variable is tuple

DATA VISUALIZATION 59
Each element of a tuple can be accessed via an index. The element in the tuple can be accessed
by the name of the tuple followed by a square bracket with the index number. Use the square
brackets for slicing along with the index or indices to obtain value available at that index.

Tuples are immutable, which means we can't change them.

To see why this is important, let's see what happens when we set the variable Ratings 1 to
ratings. Each variable does not contain a tuple, but references the same immutable tuple
object.

Let's say we want to change the element at index 2. Because tuples are immutable, we can't.
Therefore, Ratings 1 will not be affected by a change in Rating because the tuple is
immutable i.e., we can't change it.

We can assign a different tuple to the Ratings variable. The variable Ratings now references
another tuple.

DATA VISUALIZATION 60
There are many built-in functions that take tuple as a parameter and perform some task. for
example, we can find length of the tuple with len () function, minimum value with min ()
function... etc.

if we would like to sort a tuple, we use the function sorted. The input is the original tuple. The
output is a new sorted tuple.

A tuple can contain other tuples as well as other complex data types; this is called nesting.

For Example: NestedTuple = (5,2, ("A","B"),(1,2),(8896,("x","y","z")))


We can access these elements using the standard indexing methods.

For example, we could access the second element. We can apply this indexing directly to the
tuple variable NT. It is helpful to visualize this as a tree. We can visualize this nesting as a tree.
The tuple has the following indexes. If we consider indexes with other tuples, we see the tuple
at index 2 contains a tuple with two elements. We can access those two indexes. The same
convention applies to index 3. We can access the elements in those tuples as well. We can
DATA VISUALIZATION 61
continue the process. We can even access deeper levels of the tree by adding another square
bracket like NestedTuple
List:
A list is a collection which is ordered and changeable. A list is represented with square brackets.
In many respects’ lists are like tuples, one key difference is they are mutable. Lists can contain
strings, floats, integers We can nest other lists.

We can also nest tuples and other data structures; the same indexing conventions apply for
nesting Like tuples, each element of a list can be accessed via an index.

The following table represents the relationship between the index and the elements in the list.
The first element can be accessed by the name of the list followed by a square bracket with the
index number, in this case zero. We can access the second element as follows. We can also
access the last element. In Python, we can use a negative index.

The index conventions for lists and tuples are identical for accessing and slicing the elements.

We can concatenate or combine lists by adding them. Lists are mutable; therefore, we can
DATA VISUALIZATION 62
change them. For example, we apply the method Extends by adding a "dot" followed by the
name of the method, then parenthesis.

The argument inside the parenthesis is a new list that we are going to concatenate to the original
list. In this case, instead of creating a new list, the original list List1 is modified by adding four
new elements.

Another similar method is append. If we apply append instead of extended, we add one element
to the list. If we look at the index, there is only one more element. Index 4 contains the list we
appended.
Every time we apply a method, the lists changes.

As lists are mutable, we can change them. For example, we can change the Second element as
DATA VISUALIZATION 63
follows. The list now becomes [ 1,” CHANGED”,3,4]

We can delete an element of a list using the "del" command; we simply indicate the list item
we would like to remove as an argument.

For example, if we would like to remove the Second element, then perform del List [1]
command This operation removes the second element of the list then the result becomes [1,3,4]

LISTS: Aliasing

When we set one variable, B equal to A, both A and B are referencing the same list. Multiple
names referring to the same object is known as aliasing.

If we change the first element in “A” to “banana” we get a side effect; the value of B will
change as a consequence. “A" and “B” are referencing the same list, therefore if we change
"A“, list "B" also changes. If we check the first element of B after changing list ”A” we get
banana instead of hard rock

You can clone list “A” by using the following syntax. Variable "A" references one list. Variable
“B” references a new copy or clone of the original list.

Now if you change “A”, "B" will not change We can get more info on lists, tuples and many
other objects in Python using the help command.

DATA VISUALIZATION 64
Simply pass in the list, tuple or any other Python object example: help(list),help(tuple)..etc.

Set:

Sets are a type of collection. Unlike lists and tuples, they are unordered. You cannot access
items in a set by referring to an index, since sets are unordered the items has no index. To
define a set, you use curly brackets You place the elements of a set within the curly brackets.

You notice there are duplicate items. When the actual set is created, duplicate items will not be
present.

To add one item to a set, use the add () method.

To add more than one item to a set use the update () method with list of values.

To remove an item from the set we can use the pop () method. Remember sets are unordered
so it will remove the first item in the set.

To remove an item from the set, use the remove method, we simply indicate the set item we
would like to remove as an argument.

DATA VISUALIZATION 65
There are lots of useful mathematical operations we can do between sets. like union,
intersection, difference, symmetric difference from two sets.

DICTIONARIES:

Python dictionary is an unordered collection of items. While other compound data types have
only value as an element, a dictionary has a key: value pair. Dictionaries are optimized to
retrieve values when the key is known. Creating a dictionary is as simple as placing items inside
curly braces {} separated by comma. An item has a key and the corresponding value expressed
as a pair, key: value. While values can be of any data type and can repeat, keys must be of
immutable type (string, number or tuple with immutable elements) and must be unique.

We can access the elements from the dictionary using keys.

DATA VISUALIZATION 66
We can get the value using keys either inside square brackets or with get( ) method.

Dictionary is mutable. We can add new items or change the value of existing items using
assignment operator. If the key is already present, value gets updated, else a new key: value
pair is added to the dictionary.

We can delete an entry as follows. This gets rid of the key "address" and its value from my_dict
dictionary.

We can verify if an element is in the dictionary using the in command as follows.


Syntax: ‘KEY_NAME’ in DictionaryName
The command checks the keys. If they are in the dictionary, they return a true. If we

DATA VISUALIZATION 67
try the same command with a key that is not in the dictionary, we get a false. If we
try with another key that is not in the dictionary, we get a false.

In order to see all the keys in a dictionary, we can use the method keys to get the keys. The
output is a list like object with all keys. In the same way, we can obtain the values.

Conditional Statements
What is Control or Conditional Statements -

In programming languages, most of the time we have to control the flow of execution of your
program, you want to execute some set of statements only if the given condition is satisfied,
and a different set of statements when it’s not satisfied. Which we also call it as control
statements or decision-making statements.

Conditional statements are also known as decision-making statements. We use these statements
when we want to execute a block of code when the given condition is true or false.

Usually Condition will be in a form of Expression with some relational operators. Refer some
below operators mentioned in the chart

DATA VISUALIZATION 68
In Python we achieve the decision-making statements by using below statements -

If statements

If-else statements

Elif statements

Nested if and if-else statements

If statements -

If statement is one of the most commonly used conditional statement in most of the
programming languages. It decides whether certain statements need to be executed or not. If
statement checks for a given condition, if the condition is true, then the set of code present
inside the if block will be executed.

The If condition evaluates a Boolean expression and executes the block of code only when the
Boolean expression becomes TRUE. Check the Syntax first the controller will come to an if
condition and evaluate the condition if it is true, then the statements will be executed, otherwise
the code present outside the block will be executed.

Let’s take an example to implement the if statement, in this example we have a variable name
which stores the string “Srikar” and we also have names list with some names

DATA VISUALIZATION 69
We can use if statement to check whether the name is present in the names list or not, if
condition is true then it will also print block of statements inside the ‘if’ block. If condition is
false, then it will skip the execution of the ‘if’ block statements.

If-else statements:

The statement itself tells that if a given condition is true then execute the statements present
inside if block and if the condition is false then execute the else block.

Else block will execute only when the condition becomes false, this is the block where you will
perform some actions when the condition is not true.

If-else statement evaluates the Boolean expression and executes the block of code present
inside the if block if the condition becomes TRUE and executes a block of code present in the
else block if the condition becomes FALSE.

Let’s take an example to implement the if-else statement, in this example the if block will get
executed if the given condition is true or else it will execute the else block.

DATA VISUALIZATION 70
elif statements:

In python, we have one more conditional statement called elif statements. Elif statement is used
to check multiple conditions only if the given if condition false. It's like an if-else statement
and the only difference is that in else we will not check the condition but in elif we will do
check the condition.

Elif statements are similar to if-else statements but elif statements evaluate multiple conditions.

DATA VISUALIZATION 71

You might also like