3 Data Visualization
3 Data Visualization
ON
DATA VISUALIZATION
Course Objectives:
To learn different statistical methods for Datavisualization.
To understand the basics of R andPython.
To learn usage of Watsonstudio.
To understand the usage of the packages likeNumpy, pandas andmatplotlib.
To know the functionalities and usages ofSeaborn.
UNIT I
Introduction to Statistics : Introduction to Statistics, Difference between inferential statistics
and Descriptive statistics, Inferential Statistics- Drawing Inferences from Data, Random
Variables, Normal Probability Distribution, Sampling, Sample Statistics and Sampling
Distributions.
R overview and Installation- Overview and About R, R and R studio Installation, Descriptive
Data analysis using R, Description of basic functions used to describe data in R.
UNIT II
Data manipulation with R: Data manipulation packages-dplyr,data.table, reshape2, tidyr,
Lubridate, Data visualization with R.
Data visualization in Watson Studio: Adding data to datarefinery, Visualization of Data on
Watson Studio.
UNIT III
Python: Introduction to Python, How to Install, Introduction to Jupyter Notebook, Python
Scripting basics, Numpy and Pandas-Creating and Accessing Numpy Arrays, Introduction to
pandas, read and write csv, Descriptive statistics using pandas, Working with text data and
datetime columns, Indexing and selecting data, groupby, Merge / Joindatasets
UNIT IV
Data Visualization Tools in Python- Introduction to Matplotlib, Basic plots using matplotlib,
Specialized Visualization Tools usingMatplotlib, Advanced Visualization Tools using
Matplotlib Waffle Charts, Word Clouds.
UNIT V
Introduction to Seaborn: Seaborn functionalities and usage, Spatial Visualizations and
Analysis in Python with Folium, Case Study.
TEXT BOOKS:
1. Core Python Programming - Second Edition,R. Nageswara Rao, DreamtechPress.
2. Hands on programming with R by Garrett Grolemund,Shroff/O'Reilly; Firstedition
3. Fundamentals of Mathematical Statistics by S.C. Gupta, Sultan Chand &Son9
REFERENCE BOOKS:
1. Learn R for Applied Statistics: With Data Visualizations, Regressions, and Statistics by
Eric Goh Ming Hui,Apress
2. Python for Data Analysis by William McKinney, Second Edition, O’Reilly Media Inc.\
3. The Comprehensive R Archive Network-https://cran.r-project.org
MALLA REDDY COLLEGE OF ENGINEERING & TECHNOLOGY
DEPARTMENT OF INFORMATION TECHNOLOGY
INDEX
S. No Topic Page no
Unit
1 Introduction to Statistics 1
I
2 Normal Distribution 9
I
3 I Sampling 11
7 II Data.table 31
8 II Reshape2 38
9 II Tidyr 39
10 II Lubridate 42
16 III Groupby 86
18 IV Introduction to Matplotlib 94
19 IV Basic Plots using Matplotlib 102
UNIT-1
Introduction to Statistics
Statistics is a mathematical science that includes methods for collecting, organizing, analyzing
and visualizing data in such a way that meaningful conclusions can be drawn.
Statistics is also a field of study that summarizes the data, interpret the data making decisions
based on the data.
Statistics is composed of two broad categories:
1. Descriptive Statistics
2. Inferential Statistics
1. Descriptive Statistics
Descriptive statistics describes the characteristics or properties of the data. It helps to
summarize the data in a meaningful data in a meaningful way. It allows important patterns
to emerge from the data. Data summarization techniques are used to identify the properties
of data. It is helpful in understanding the distribution of data. They do not involve in
generalizing beyond the data.
A measure of central tendency is a single value that attempts to describe a set of data by
identifying the central position within that set of data. The mean, median and mode are all valid
measures of central tendency.
Mean (Arithmetic)
The mean (or average) is the most popular and well known measure of central tendency. It can
be used with both discrete and continuous data, although its use is most often with continuous
data.
The mean is equal to the sum of all the values in the data set divided by the number of values
in the data set. So, if we have values in a data set and they have values x1,x2,…xn, the sample
mean, usually denoted by ̿𝒙 .
𝑥̿ = (x1,x2,…xn )/ n .
An important property of the mean is that it includes every value in the data set as part of the
calculation. In addition, the mean is the only measure of central tendency where the sum of the
deviations of each value from the mean is always zero.
Median:
DATA VISUALIZATION 1
The median is the middle score for a set of data that has been arranged in order of
magnitude. The median is less affected by outliers and skewed data. It is a holistic measure. It
is easy method of approximation of median value of a large data set.
Mode
The mode is the most frequent score in our data set. The mode is used for categorical data
where we want to know which is the most common category occurring in the population. There
are possibilities for the greatest frequency to correspond to different values. This results in
more than one,two or more modes in a dataset. They are called as unimodal, bimodal and
multimodal datasets. If each data occurs only once then the mode is equal to zero.
Unimodal frequency curve with symmetric data distribution , the mean median and mode are
all the same.
In real applications the data is not symmetrical and they are asymmetric.It might be positively
skewed or negatively skewed. If positively skewed then mode is smaller than median and in
negatively skewed the mode occurs at a value greater than the median.
Measures of spread are the ways of summarizing a group of data by describing how scores are
spread out. To describe this spread, a number of statistics are available to us, including the
range, quartiles, absolute deviation, variance and standard deviation.
• The degree to which numerical data tend to spread is called the dispersion, or variance of
the data. The common measures of data dispersion: Range, Quartiles, Outliers, and
Boxplots.
Range : Range of the set is the difference between the largest (max()) and smallest (min()) values.
Ex: Step 1: Sort the numbers in order, from smallest to largest: 7, 10, 21, 33, 43, 45,
45, 65, 67, 87, 98, 99
Step 2: Subtract the smallest number in the set from the largest number in the set:
99 – 7 = 92
DATA VISUALIZATION 2
The range is 92
Quartiles : Percentile : kth percentile of a set of data in numerical order is the value xi
having the property that k percent of the data entries lie at or below xi
DATA VISUALIZATION 3
Inferential Statistics – Definition and Types
Inferential statistics is generally used when the user needs to make a conclusion about the whole
population at hand, and this is done using the various types of tests available. It is a technique
which is used to understand trends and draw the required conclusions about a large population
by taking and analyzing a sample from it. Descriptive statistics, on the other hand, is only about
the smaller sized data set at hand – it usually does not involve large populations. Using
variables and the relationships between them from the sample, we will be able to make
generalizations and predict other relationships within the whole population, regardless of how
large it is.
With inferential statistics, data is taken from samples and generalizations are made about
a population. Inferential statistics use statistical models to compare sample data to other
samples or to previous research.
1. Estimating parameters:
This means taking a statistic from the sample data (for example the sample mean) and using it
to infer about a population parameter (i.e. the population mean).There may be sampling
variations because of chance fluctuations, variations in sampling techniques, and other
sampling errors. Estimation about population characteristics may be influenced by such factors.
Therefore, in estimation the important point is that to what extent our estimate is close to the
true value.
Characteristics of Good Estimator: A good statistical estimator should have the following
characteristics, (i) Unbiased (ii) Consistent (iii) Accuracy
i) Unbiased
An unbiased estimator is one in which, if we were to obtain an infinite number of random
samples of a certain size, the mean of the statistic would be equal to the parameter. The sample
mean, ( x ) is an unbiased estimate of population mean (μ)because if we look at possible random
samples of size N from a population, then mean of the sample would be equal to μ.
ii) Consistent
A consistent estimator is one that as the sample size increased, the probability that estimate has
a value close to the parameter also increased. Because it is a consistent estimator, a sample
mean based on 20 scores has a greater probability of being closer to (μ) than does a sample
mean based upon only 5 scores
iii) Accuracy
The sample mean is an unbiased and consistent estimator of population mean (μ).But we should
not over look the fact that an estimate is just a rough or approximate calculation. It is unlikely
in any estimate that ( x ) will be exactly equal to population mean (μ). Whether or not x is a
good estimate of (μ) depends upon the representativeness of sample, the sample size, and the
variability of scores in the population.
DATA VISUALIZATION 4
2. Hypothesis tests. This is where sample data can be used to answer research questions.
For example, we might be interested in knowing if a new cancer drug is effective. Or if
breakfast helps children perform better in schools.
Inferential statistics is closely tied to the logic of hypothesis testing. We hypothesize that this
value characterise the population of observations. The question is whether that hypothesis is
reasonable evidence from the sample. Sometimes hypothesis testing is referred to as statistical
decision-making process. In day-to-day situations we are required to take decisions about the
population on the basis of sample information.
DATA VISUALIZATION 5
2.6.2 Level of Significance
The level of significance is defined as the probability of rejecting a null hypothesis by the test
when it is really true, which is denoted as α. That is, P (Type I error) = α.
Confidence level:
Confidence level refers to the possibility of a parameter that lies within a specified range of
values, which is denoted as c. Moreover, the confidence level is connected with the level of
significance. The relationship between level of significance and the confidence level is c=1−α.
The common level of significance and the corresponding confidence level are given below:
Rejection region:
The rejection region is the values of test statistic for which the null hypothesis is rejected.
DATA VISUALIZATION 6
There are many tests in this field, of which some of the most important are mentioned below.
1. Linear Regression Analysis
In this test, a linear algorithm is used to understand the relationship between two variables from
the data set. One of those variables is the dependent variable, while there can be one or more
independent variables used. In simpler terms, we try to predict the value of the dependent
variable based on the available values of the independent variables. This is usually represented
by using a scatter plot, although we can also use other types of graphs too.
2. Analysis of Variance
This is another statistical method which is extremely popular in data science. It is used to test
and analyse the differences between two or more means from the data set. The significant
differences between the means are obtained, using this test.
3. Analysis of Co-variance
This is only a development on the Analysis of Variance method and involves the inclusion of
a continuous co-variance in the calculations. A co-variate is an independent variable which is
continuous, and is used as regression variables. This method is used extensively in statistical
modelling, in order to study the differences present between the average values of dependent
variables.
5. Correlation Analysis
Another extremely useful test, this is used to understand the extent to which two variables are
dependent on each other. The strength of any relationship, if they exist, between the two
variables can be obtained from this. You will be able to understand whether the variables have
a strong correlation or a weak one. The correlation can also be negative or positive, depending
upon the variables. A negative correlation means that the value of one variable decreases while
the value of the other increases and positive correlation means that the value both variables
decrease or increase simultaneously.
DATA VISUALIZATION 7
Organise, analyse, present the data in a Compare, tests and predicts future outcomes
meaningful way
The analysed results are in the form of graphs, The analysed results are the probability scores
charts etc
Describes the data which is already known Tries to make conclusions about the population
beyond the data available
Tools: Measures of central tendency and Tools: Hypothesis tests, analysis of variance etc
measures of spread
Random Variables
A random variable, X, is a variable whose possible values are numerical outcomes of a random
phenomenon. There are two types of random variables, discrete and continuous.
DATA VISUALIZATION 8
Suppose a variable X can take the values 1, 2, 3, or 4.
The probabilities associated with each outcome are described
by the following table:
Outcome 1 2 3 4
Probability 0.1 0.3 0.4 0.2
The probability that X is equal to 2 or 3 is the sum of the two
probabilities: P(X = 2 or X = 3) = P(X = 2) + P(X = 3) = 0.3 +
0.4 = 0.7. Similarly, the probability that X is greater than 1 is
equal to 1 - P(X = 1) = 1 - 0.1 = 0.9, by the complement rule.
DATA VISUALIZATION 9
The normal or Gaussian Probability Distribution is most popular and important because of its
unique mathematical properties which facilitate its application to practically any physical
problem in the real world. The constants μ and σ2 are the parameters;
“μ” is the population true mean (or expected value) of the subject phenomenon
characterized by the continuous random variable, X,
“σ2” is the population true variance characterized by the continuous random
variable, X.
Hence, “σ” the population standard deviation characterized by the continuous random
variable X;
the points located at μ−σ and μ+σ are the points of inflection; that is, where the graph
changes from cupping up to cupping down
The normal curve graph of the normal probability distribution) is symmetric with
respect to the mean μ as the central position. That is, the area between μ and κ units to
the left of μ is equal to the area between μ and κ units to the right of μ.
There is not a unique normal probability distribution. The figure below is a graphical
representation of the normal distribution for a fixed value of σ2 with μ varying.
The figure below is a graphical representation of the normal distribution for a fixed value
of μ with varying σ2.
DATA VISUALIZATION 10
SAMPLING and SAMPLING DISTRIBUTION
Sampling Distribution
A sampling distribution is a probability distribution of a statistic. It is obtained through a large
number of samples drawn from a specific population. It is the distribution of all possible values
taken by the statistic when all possible samples of a fixed size n are taken from the population.
DATA VISUALIZATION 11
Sampling Distributions and Inferential Statistics
Sampling distributions are important for inferential statistics. A population is specified and the
sampling distribution of the mean and the range were determined. In practice, the process
proceeds the other way: the sample data is collected and from these data we estimate parameters
of the sampling distribution. This knowledge of the sampling distribution can be very useful.
Knowing the degree to which means from different samples would differ from each other
and from the population mean ( this would give an idea of how close the particular sample
mean is likely to be to the population mean )
The most common measure of how much sample means differ from each other is the
standard deviation of the sampling distribution of the mean. This standard deviation is
called the standard error of the mean.
If all the sample means were very close to the population mean, then the standard error of
the mean would be small. On the other hand, if the sample means varied considerably, then
the standard error of the mean would be large.
DATA VISUALIZATION 12
R overview and Installation
R is a programming language and software environment for statistical analysis, graphics
representation and reporting. R was created by Ross Ihaka and Robert Gentleman at the
University of Auckland, New Zealand, and is currently developed by the R Development Core
Team.
The core of R is an interpreted computer language which allows branching and looping as
well as modular programming using functions. R allows integration with the procedures
written in the C, C++, .Net, Python or FORTRAN languages for efficiency.
R is freely available under the GNU General Public License, and pre-compiled binary versions
are provided for various operating systems like Linux, Windows and Mac.
R is free software distributed under a GNU-style copy left, and an official part of the GNU
project called GNUs.
Features of R
DATA VISUALIZATION 13
R provides a large, coherent and integrated collection of tools for data analysis.
R provides graphical facilities for data analysis and display either directly at the
computer or printing at the papers.
To Install R:
1. Open an internet browser and go to www.r-project.org.
2. Click the "download R" link in the middle of the page under "Getting Started."
3. Select a CRAN location (a mirror site) and click the corresponding link.
4. Click on the "Download R for Windows" link at the top of the page.
5. Click on the "install R for the first time" link at the top of the page.
6. Click "Download R for Windows" and save the executable file somewhere on computer. Run
the .exe file and follow the installation instructions.
7. Now that R is installed, next step is to download and install RStudio.
To Install RStudio
R Command Prompt
Once R environment setup is done, then it’s easy to start R command prompt by just typing
the following command at command prompt – “$ R”
This will launch R interpreter and will get a prompt > where we can start typing your program
as follows −
> myString <- "Hello, World!"
> print ( myString)
R Script File
execute scripts at command prompt with the help of R interpreter called Rscript.
# My first program in R Programming
myString <- "Hello, World!"
print ( myString)
Save the above code in a file test.R and execute it at command prompt as given below.
DATA VISUALIZATION 14
$ Rscript test.R
When we run the above program, it produces the following result.
"Hello, World!"
Comments
Comments are like helping text in your R program and they are ignored by the interpreter
while executing actual program. Single comment is written using # in the beginning of the
statement as follows −
# My first program in R Programming
R does not support multi-line comments but they can be written as follows:
"This is a demo for multi-line comments and it should be put inside either a
single OR double quote"
R data types:
The variables are assigned with R-Objects and the data type of the R-object becomes the data
type of the variable. There are many types of R-objects. The frequently used ones are −
Vectors
Lists
Matrices
Arrays
Factors
Data Frames
The simplest of these objects is the vector object and there are six data types of these atomic
vectors, also termed as six classes of vectors. The other R-Objects are built upon the atomic
vectors.
Data Type Example Verify
DATA VISUALIZATION 15
print(class(v))
[1] "integer"
Vectors
When you want to create vector with more than one element, you should use c() function
which means to combine the elements into a vector.
# Create a vector.
apple <- c('red','green',"yellow")
print(apple)
# Get the class of the vector.
print(class(apple))
When we execute the above code, it produces the following result −
"red" "green" "yellow"
"character"
Lists
A list is an R-object which can contain many different types of elements inside it like vectors,
functions and even another list inside it.
# Create a list.
list1 <- list(c(2,5,3),21.3,sin)
# Print the list.
print(list1)
When we execute the above code, it produces the following result −
[[1]]
[1] 2 5 3
[[2]]
[1] 21.3
[[3]]
DATA VISUALIZATION 16
function (x) .Primitive("sin")
Matrices
A matrix is a two-dimensional rectangular data set. It can be created using a vector input to
the matrix function.
# Create a matrix.
M = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE)
print(M)
When we execute the above code, it produces the following result −
[,1] [,2] [,3]
[1,] "a" "a" "b"
[2,] "c" "b" "a"
Arrays
While matrices are confined to two dimensions, arrays can be of any number of dimensions.
The array function takes a dim attribute which creates the required number of dimension. In
the below example we create an array with two elements which are 3x3 matrices each.
# Create an array.
a <- array(c('green','yellow'),dim = c(3,3,2))
print(a)
When we execute the above code, it produces the following result −
,,1
[,1] [,2] [,3]
[1,] "green" "yellow" "green"
[2,] "yellow" "green" "yellow"
[3,] "green" "yellow" "green"
,,2
[,1] [,2] [,3]
[1,] "yellow" "green" "yellow"
[2,] "green" "yellow" "green"
[3,] "yellow" "green" "yellow"
Data Frames
Data frames are tabular data objects. Unlike a matrix in data frame each column can contain
different modes of data. The first column can be numeric while the second column can be
character and third column can be logical. It is a list of vectors of equal length. Data Frames
are created using the data.frame() function.
# Create the data frame.
BMI <- data.frame( gender = c("Male", "Male","Female"), height = c(152, 171.5, 165),
weight = c(81,93, 78), Age = c(42,38,26) )
print(BMI)
DATA VISUALIZATION 17
Result −
gender height weight Age
1 Male 152.0 81 42
2 Male 171.5 93 38
3 Female 165.0 78 26
R - Variables
A variable provides us with named storage that our programs can manipulate. A variable in R
can store an atomic vector, group of atomic vectors or a combination of many R objects. A
valid variable name consists of letters, numbers and the dot or underline characters. The
variable name starts with a letter or the dot not followed by a number.
Variable Name Validity Reason
var_name% Invalid Has the character '%'. Only dot(.) and underscore allowed.
.var_name, valid Can start with a dot(.) but the dot(.)should not be followed by a
var.name number.
R - Operators
An operator is a symbol that tells the compiler to perform specific mathematical or logical
manipulations. R language is rich in built-in operators and provides following types of
operators.
Types of Operators
Arithmetic Operators
Relational Operators
Logical Operators
Assignment Operators
Miscellaneous Operators
DATA VISUALIZATION 18
Descriptive Data analysis using R:
R provides a wide range of functions for obtaining summary statistics. One method of obtaining
descriptive statistics is to use the sapply( ) function with a specified summary statistic.
sapply(mydata, mean, na.rm=TRUE)
Possible functions used in sapply include mean, sd, var, min, max, median, range, and
quantile.
Check your data
You can inspect your data using the functions head() and tails(), which will display the first
and the last part of the data, respectively.
# Print the first 6 rows
head(my_data, 6)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Description R function
Mean mean()
Standard deviation sd()
Variance var()
Minimum min()
Maximum maximum()
Median median()
Range of values (minimum and maximum) range()
Sample quantiles quantile()
Generic function summary()
Interquartile range IQR()
Roughly speaking, the central tendency measures the “average” or the “middle” of your data.
The most commonly used measures include:
DATA VISUALIZATION 19
the mean: the average value. It’s sensitive to outliers.
the median: the middle value. It’s a robust alternative to mean.
and the mode: the most frequent value
In R,
The function mean() and median() can be used to compute the mean and the median,
respectively;
The function mfv() [in the modeest R package] can be used to compute the mode of a
variable.
The R code below computes the mean, median and the mode of the
variable Sepal.Length [in my_data data set]:
Measure of variability
Range corresponds to biggest value minus the smallest value. It gives you the full spread
of the data.
# Compute the minimum value
min(my_data$Sepal.Length)
[1] 4.3
# Compute the maximum value
max(my_data$Sepal.Length)
[1] 7.9
# Range
range(my_data$Sepal.Length)
[1] 4.3 7.9
Interquartile range
DATA VISUALIZATION 20
The interquartile range (IQR) - corresponding to the difference between the first and third
quartiles - is sometimes used as a robust alternative to the standard deviation.
R function:
quantile(x, probs = seq(0, 1, 0.25))
x: numeric vector whose sample quantiles are wanted.
probs: numeric vector of probabilities with values in [0,1].
Example:
quantile(my_data$Sepal.Length)
0% 25% 50% 75% 100%
4.3 5.1 5.8 6.4 7.9
To compute deciles (0.1, 0.2, 0.3, …., 0.9), use this:
IQR(my_data$Sepal.Length)
[1] 1.3
Variance and standard deviation
The variance represents the average squared deviation from the mean. The standard deviation
is the square root of the variance. It measures the average deviation of the values, in the data,
from the mean value.
sapply() function
DATA VISUALIZATION 23
• lm() Function
• This function creates the relationship
model between the predictor and the
response variable.
• The basic syntax for lm() function in linear
regression is −
• lm(formula,data)
• # Apply the lm() function.
• relation <- lm(stud.data$weight ~
stud.data$height)
• print(relation)
Output
Coefficients:
(Intercept (m)) x
-38.4551 0.6746
DATA VISUALIZATION 24
UNIT-II
Introduction
Data Manipulation
It involves ‘manipulating’ data using available set of variables. This is done to enhance
accuracy and precision associated with data. Actually, the data collection process can
have many loopholes. There are various uncontrollable factors which lead to inaccuracy in data
such as mental situation of respondents, personal biases, difference / error in readings of
machines etc. To lessen these inaccuracies, data manipulation is done to increase the possible
(highest) accuracy in data. This stage is also known as data wrangling or data cleaning.
Manipulating data using inbuilt base R functions. This is the first step, but is often
repetitive and time consuming. Hence, it is a less efficient way to solve the problem.
Use of packages for data manipulation. CRAN has more than 8000 packages available
today. These packages are a collection of pre-written commonly used pieces of codes.
They helps to perform the repetitive tasks fasts, reduce errors in coding and take help of
code written by experts (across the open source eco-system for R) to make code more
efficient. This is usually the most common way of performing data manipulation.
Use of Machine Learning(ML) algorithms for data manipulation. ML algorithms like
tree based boosting algorithms to take care of missing data & outliers. These algorithms
are less time consuming,
install.packages('package name')
List of Packages
1. dplyr
2. data.table
3. ggplot2
4. reshape2
5. readr
6. tidyr
DATA VISUALIZATION 25
7. lubridate
dplyr Package
This package is created and maintained by Hadley Wickham. This package has everything
(almost) to accelerate data manipulation efforts. It is known best for data exploration and
transformation. Its chaining syntax makes it highly adaptive to use. It includes 5 major data
manipulation commands:
> library(dplyr)
> data("mtcars")
> data('iris')
> mynewdata
DATA VISUALIZATION 26
> myirisdata
DATA VISUALIZATION 27
> filter(mynewdata, cyl > 4)
DATA VISUALIZATION 28
#hide a range of columns
> select(mynewdata, -c(cyl,mpg))
DATA VISUALIZATION 29
#arrange can be used to reorder rows
> mynewdata%>% select(cyl, wt, gear)%>% arrange(wt)
DATA VISUALIZATION 30
#summarise each
data.table Package
This package allows to perform faster manipulation in a data set. A data table has 3 parts
namely DT[i,j,by]. We can tell R to subset the rows using ‘i’, to calculate ‘j’ which is grouped
by ‘by’. Most of the times, ‘by’ relates to categorical variable.
#load data
> data("airquality")
> mydata <- airquality
> head(airquality,6)
#load package
> library(data.table)
DATA VISUALIZATION 31
> mydata <- data.table(mydata)
> mydata
> myiris
> mydata[2:4,]
DATA VISUALIZATION 32
#select columns with multiple values. This will give you columns with Setosa #and virginica
species
> mydata[,Temp]
> mydata[,.(Temp,Month)]
[1]4887
DATA VISUALIZATION 33
#print and plot
> myiris[,{print(Sepal.Length)
#grouping by a variable
#select a column for computation, hence need to set the key on column
> myiris['setosa']
> myiris[c('setosa', 'virginica')]
ggplot2 Package
ggplot offers a whole new world of colors and patterns. Plotting 3 graphs: Scatter Plot, Bar
Plot, Histogram. ggplot is enriched with customized features to make visualization better. It
becomes even more powerful when grouped with other packages like cowplot, gridExtra.
DATA VISUALIZATION 34
Scatter Plot :
A Scatter Plot is a graph in which the values of two variables are plotted along two axes,
the pattern of the resulting points revealing any correlation present.
With scatter plots we can explain how the variables relate to each other. Which is defined
as correlation. Positive, Negative, and None (no correlation) are the three types of
correlation.
Limitations of a Scatter Diagram
Below are the few limitations of a scatter diagram:
• With Scatter diagrams we cannot get the exact extent of correlation.
• Quantitative measure of the relationship between the variable cannot be viewed. Only
shows the quantitative expression.
• The relationship can only show for two variables.
Advantages of a Scatter Diagram
Below are the few advantages of a scatter diagram:
• Relationship between two variables can be viewed.
• For non-linear pattern, this is the best method.
• Maximum and minimum value, can be easily determined.
• Observation and reading is easy to understand
• Plotting the diagram is very simple.
Bar Plot
A barplot (or barchart) is one of the most common type of graphic. It shows the
relationship between a numeric variable and a categoric variable.
Bar Plot are classified into four types of graphs - bar graph or bar chart, line graph, pie
chart, and diagram.
Limitations of Bar Plot:
When we try to display changes in speeds such as acceleration, Bar graphs wont help us.
Advantages of Bar plot:
• Bar charts are easy to understand and interpret.
• Relationship between size and value helps for in easy comparison.
• They're simple to create.
• They can help in presenting very large or very small values easily.
Histogram
A histogram represents the frequency distribution of continuous variables. while, a bar
DATA VISUALIZATION 35
graph is a diagrammatic comparison of discrete variables.
Histogram presents numerical data whereas bar graph shows categorical data.
The histogram is drawn in such a way that there is no gap between the bars.
Limitations of Histogram:
A histogram can present data that is misleading as it has many bars.
Only two sets of data are used, but to analyze certain types of statistical data, more than
two sets of data are necessary
Advantages of Histogram:
Histogram helps to identify different data, the frequency of the data occurring in the dataset
and categories which are difficult to interpret in a tabular form. It helps to visualize the
distribution of the data.
> library(ggplot2)
> library(gridExtra)
> df <- ToothGrowth
> df$dose <- as.factor(df$dose)
> head(df)
BOX PLOT
> bp
#add gridlines
DATA VISUALIZATION 36
> bp + background_grid(major = "xy", minor = 'none')
SCATTER PLOT
> sp
BAR PLOT
> bp
DATA VISUALIZATION 37
> plot_grid(sp, bp, labels = c("A","B"), ncol = 2, nrow = 1)
#histogram
> ggplot(diamonds, aes(x = carat)) + geom_histogram(binwidth = 0.25, fill =
'steelblue')+scale_x_continuous(breaks=seq(0,3, by=0.5))
reshape2 Package
As the name suggests, this package is useful in reshaping data. The data come in many forms.
Hence, we are required to shape it according to our need. Usually, the process of reshaping
data in R is tedious. R base functions consist of ‘Aggregation’ option using which data can
be reduced and rearranged into smaller forms, but with reduction in amount of information.
Aggregation includes tapply, by and aggregate base functions. The reshape package
overcomes these problems. It has 2 functions namely melt and cast.
melt : This function converts data from wide format to long format. It’s a form of
restructuring where multiple categorical columns are ‘melted’ into unique rows.
#create a data
> ID <- c(1,2,3,4,5)
> Names <- c('Joseph','Matrin','Joseph','James','Matrin')
> DateofBirth <- c(1993,1992,1993,1994,1992)
> Subject<- c('Maths','Biology','Science','Psycology','Physics')
> thisdata <- data.frame(ID, Names, DateofBirth, Subject)
> data.table(thisdata)
DATA VISUALIZATION 38
#load package
> install.packages('reshape2')
> library(reshape2)
#melt
> mt <- melt(thisdata, id=(c('ID','Names')))
> mt
cast : This function converts data from long format to wide format. It starts with melted data
and reshapes into long format. It’s just the reverse of melt function. It has two functions
namely, dcast and acast. dcast returns a data frame as output. acast returns a
vector/matrix/array as the output.
> mcast <- dcast(mt, DateofBirth + Subject ~ variable)
> mcast
tidyr Package
This package can make the data look ‘tidy’. It has 4 major functions to accomplish this task.
The 4 functions are:
DATA VISUALIZATION 39
gather() – it ‘gathers’ multiple columns. Then, it converts them into key:value pairs.
This function will transform wide from of data to long form. You can use it as in
alternative to ‘melt’ in reshape package.
spread() – It does reverse of gather. It takes a key:value pair and converts it into
separate columns.
separate() – It splits a column into multiple columns.
unite() – It does reverse of separate. It unites multiple columns into single column
#load package
> library(tidyr)
#create a dummy data set
> names <- c('A','B','C','D','E','A','B')
> weight <- c(55,49,76,71,65,44,34)
> age <- c(21,20,25,29,33,32,38)
> tdata
Separate Command
#create a data set
DATA VISUALIZATION 40
Time <- c("27/01/2015 15:44","23/02/2015 23:24", "31/03/2015 19:15", "20/01/2015
20:52", "23/02/2015 07:46", "31/01/2015 01:55")
#build a data frame
> d_set <- data.frame(Humidity, Rain, Time)
#using separate function we can separate date, month, year
> separate_d <- d_set %>% separate(Time, c('Date', 'Month','Year'))
> separate_d
Unite Command
#using unite function - reverse of separate
> unite_d <- separate_d%>% unite(Time, c(Date, Month, Year), sep = "/")
> unite_d
> wide_t
DATA VISUALIZATION 41
readr Package
‘readr’ helps in reading various forms of data into R. With 10x faster speed. Here, characters
are never converted to factors. This package can replace the traditional read.csv() and
read.table() base R functions. It helps in reading the following data:
Lubridate Package
Lubridate package reduces the pain of working of data time variable in R. The inbuilt function
of this package offers a nice way to make easy parsing in dates and times. This package is
frequently used with data comprising of timely data.
> install.packages('lubridate')
> library(lubridate)
#current date and time > now() [1] "2015-12-11 13:23:48 IST"
DATA VISUALIZATION 42
#add days, months, > d_time <- now() [1] "2015-12-12 13:24:54 IST"
year, seconds > d_time + ddays(1)
DATA VISUALIZATION 43
WATSON STUDIO
Watson Studio provides you with the environment and tools to solve your business problems
by collaboratively working with data. You can choose the tools you need to analyze and
visualize data, to cleanse and shape data, to ingest streaming data, or to create and train
machine learning models.
This illustration shows how the architecture of Watson Studio is centered around the project.
A project is where you organize your resources and work with data.
Visualizing information in graphical ways can give you insights into your data. By enabling
you to look at and explore data from different perspectives, visualizations can help you
identify patterns, connections, and relationships within that data as well as understand large
amounts of information very quickly.
Create a project -
To create a project :
Click New project on the Watson Studio home page or your My Projects page.
Choose whether to create an empty project or to create a project based on an exported project
file or a sample project.
If you chose to create a project from a file or a sample, upload a project file or select a sample
project. See Importing a project.
On the New project screen, add a name and optional description for the
DATA VISUALIZATION 44
project.
Select the Restrict who can be a collaborator check box to restrict collaborators to members
of your organization or integrate with a catalog. The check box is selected by default if you
are a member of a catalog. You can’t change this setting after you create the project.
Click Create. You can start adding resources if your project is empty or begin working with
the resources you imported.
From your project’s Assets page, click Add to project > Data or click the Find and add data
icon ().You can also click the Find and add data icon from within a notebook or canvas.
In the Load pane that opens, browse for the files or drag them onto the pane. You must stay
on the page until the load is complete. You can cancel an ongoing load process if you want to
stop loading a file.
DATA VISUALIZATION 45
Case Study:
Let us take the Iris Data set to see how we can visualize the data in Watson studio.
DATA VISUALIZATION 46
Adding Data to Data Refinery
Visualizing information in graphical ways can give you insights into your data. By enabling
you to look at and explore data from different perspectives, visualizations can help you
identify patterns, connections, and relationships within that data as well as understand large
amounts of information very quickly. You can also visualize your data with these same charts
in an SPSS Modeler flow. Right-click a node and select Profile.
DATA VISUALIZATION 47
1. Click any of the available charts. Then add columns in the DETAILS panel that opens on the
left side of the page.
2. Select the columns that you want to work with. Suggested charts will be indicated with a dot
next to the chart name. Click a chart to visualize your data.
Click on refine
DATA VISUALIZATION 48
UNIT – III
Introduction to Anaconda -
Anaconda Installation -
Go to the Anaconda Website and choose a Python 3.x graphical installer (A) or a Python 2.x
graphical installer.
DATA VISUALIZATION 49
Click on Next.
DATA VISUALIZATION 50
Note your installation location and then click Next.
Choose whether to add Anaconda to your PATH environment variable. We recommend not
adding Anaconda to the PATH environment variable, since this can interfere with other
software. Instead, use Anaconda software by opening Anaconda Navigator or the Anaconda
Prompt from the Start Menu.
DATA VISUALIZATION 51
Click Finish.
Open a Command Prompt. Check if you already have Anaconda added to your path.
Enter the commands below into your Command Prompt.
Conda –version
Python –version
This is checking if you already have Anaconda added to your path. If you get a command not
recognized, then we need to set Anaconda path
If you don't know where your conda and/or python is, open an Anaconda Prompt and type in
the following commands. This is telling you where conda and python are located on your
computer.
Add conda and python to your PATH. You can do this by going to your System Environment
Variables and adding the output of step 3 (enclosed in the red )
DATA VISUALIZATION 52
Open a new Command Prompt. Try typing conda --version and python --version into
the Command Prompt to check to see if everything went well.
What is Jupyter
The Jupyter Notebook is an open source web application that you can use to create and share
documents that contain live code, equations, visualizations, and text. Jupyter Notebook is
maintained by the people at Project Jupyter.
Jupyter Notebooks are a spin-off project from the IPython project, which used to have an
IPython Notebook project itself. The name, Jupyter, comes from the core supported
programming languages that it supports: Julia, Python, and R. Jupyter ships with the IPython
kernel, which allows you to write your programs in Python, but there are currently over 100
other kernels that you can also use.
DATA VISUALIZATION 53
Installing Anaconda Distribution will also include Jupyter Notebook.
To access the Jupyter Notebook go to anaconda prompt and run below command
Or go to Command Prompt and first activate root before launching jupyter notebook
Then you'll see the application opening in the web browser on the following address:
http://localhost:8888.
DATA VISUALIZATION 54
Python Scripting Basics
First Program in Python
A statement or expression is an instruction the computer will run or execute. Perhaps the
simplest program you can write is a print statement. When you run the print statement, Python
will simply display the value in the parentheses. The value in the parentheses is called the
argument.
If you are using a Jupyter notebook, you will see a small rectangle with the statement. This is
called a cell. If you select this cell with your mouse, then click the run cell button. The statement
will execute. The result will be displayed beneath the cell.
It’s customary to comment your code. This tells other people what your code does. You simply
put a hash symbol proceeding your comment. When you run the code, Python will ignore the
comment.
Data Types
A type is how Python represents different types of data. You can have different types in Python.
They can be integers like 11, real numbers like 21.213. They can even be words.
DATA VISUALIZATION 55
The following chart summarizes three data types for the last examples. The first coslumn
indicates the expression. The second Column indicates the data type. We can see the actual
data type in Python by using the type command. We can have int, which stands for an integer,
and float that stands for float, essentially a real number. The type string is a sequence of
characters.
Integers can be negative or positive. It should be noted that there is a finite range of integers,
but it is quite large. Floats are real numbers; they include the integers but also numbers in
between the integers. Consider the numbers between 0 and 1. We can select numbers in
between them; these numbers are floats. Similarly, consider the numbers between 0.5 and 0.6.
We can select numbers in-between them; these are floats as well.
Nothing really changes. If you cast a float to an integer, you must be careful. For example, if
you cast the float 1.1 to 1, you will lose some information. If a string contains an integer
value, you can convert it to int. If we convert a string that contains a non-integer value, we
get an error. You can convert an int to a string or a float to a string.
DATA VISUALIZATION 56
Boolean is another important type in Python. A Boolean can take on two values. The first
value is true, just remember we use an uppercase T. Boolean values can also be false, with an
uppercase F. Using the type command on a Boolean value, we obtain the term bool, this is
short for Boolean. If we cast a Boolean true to an integer or float, we will get a 1.
If we cast a Boolean false to an integer or float, we get a zero. If you cast a 1 to a Boolean,
you get a true. Similarly, if you cast a 0 to a Boolean, you get a false.
In Python, a string is a sequence of characters. A string is contained within two quotes: You
could also use single quotes. A string can be spaces, or digits. A string can also be special
characters. We can bind or assign a string to another variable. It is helpful to think of a string
as an ordered sequence. Each element in the sequence can be accessed using an index
represented by the array of numbers. The first index can be accessed as
follows. We can access index 6. Moreover, we can access the 13th index. We can also use
negative indexing with strings. The last element is given by the index -1. The first element can
be obtained by index -15, and so on.
DATA VISUALIZATION 57
We can bind a string to another variable. It is helpful to think of string as a list or tuple. We
can treat the string as a sequence and perform sequence operations. We can also input a string
value as follows. The 2 indicates we select every second variable. We can also incorporate
slicing.
In this case. we return every second value up to index four. We can use the “Len” command to
obtain the length of the string. As there are 15 elements, the result is 15.
We can concatenate or combine strings. We use the addition symbols. The result is a new string
that is a combination of both.
We can replicate values of a string. We simply multiply the string by the number of times we
would like to replicate it, in this case, three. The result is a new string. The new string consists
DATA VISUALIZATION 58
of three copies of the original string. This means you cannot change the value of the string, but
you can create a new string.
There are four collection data types in the Python programming language:
Tuple is a collection which is ordered and unchangeable. Allows duplicate members.
Tuple:
tuples are expressed as comma-separated elements within parentheses.
In Python, there are different types: strings, integer, float. They can all be contained in a tuple,
but the type of the variable is tuple
DATA VISUALIZATION 59
Each element of a tuple can be accessed via an index. The element in the tuple can be accessed
by the name of the tuple followed by a square bracket with the index number. Use the square
brackets for slicing along with the index or indices to obtain value available at that index.
To see why this is important, let's see what happens when we set the variable Ratings 1 to
ratings. Each variable does not contain a tuple, but references the same immutable tuple
object.
Let's say we want to change the element at index 2. Because tuples are immutable, we can't.
Therefore, Ratings 1 will not be affected by a change in Rating because the tuple is
immutable i.e., we can't change it.
We can assign a different tuple to the Ratings variable. The variable Ratings now references
another tuple.
DATA VISUALIZATION 60
There are many built-in functions that take tuple as a parameter and perform some task. for
example, we can find length of the tuple with len () function, minimum value with min ()
function... etc.
if we would like to sort a tuple, we use the function sorted. The input is the original tuple. The
output is a new sorted tuple.
A tuple can contain other tuples as well as other complex data types; this is called nesting.
For example, we could access the second element. We can apply this indexing directly to the
tuple variable NT. It is helpful to visualize this as a tree. We can visualize this nesting as a tree.
The tuple has the following indexes. If we consider indexes with other tuples, we see the tuple
at index 2 contains a tuple with two elements. We can access those two indexes. The same
convention applies to index 3. We can access the elements in those tuples as well. We can
DATA VISUALIZATION 61
continue the process. We can even access deeper levels of the tree by adding another square
bracket like NestedTuple
List:
A list is a collection which is ordered and changeable. A list is represented with square brackets.
In many respects’ lists are like tuples, one key difference is they are mutable. Lists can contain
strings, floats, integers We can nest other lists.
We can also nest tuples and other data structures; the same indexing conventions apply for
nesting Like tuples, each element of a list can be accessed via an index.
The following table represents the relationship between the index and the elements in the list.
The first element can be accessed by the name of the list followed by a square bracket with the
index number, in this case zero. We can access the second element as follows. We can also
access the last element. In Python, we can use a negative index.
The index conventions for lists and tuples are identical for accessing and slicing the elements.
We can concatenate or combine lists by adding them. Lists are mutable; therefore, we can
DATA VISUALIZATION 62
change them. For example, we apply the method Extends by adding a "dot" followed by the
name of the method, then parenthesis.
The argument inside the parenthesis is a new list that we are going to concatenate to the original
list. In this case, instead of creating a new list, the original list List1 is modified by adding four
new elements.
Another similar method is append. If we apply append instead of extended, we add one element
to the list. If we look at the index, there is only one more element. Index 4 contains the list we
appended.
Every time we apply a method, the lists changes.
As lists are mutable, we can change them. For example, we can change the Second element as
DATA VISUALIZATION 63
follows. The list now becomes [ 1,” CHANGED”,3,4]
We can delete an element of a list using the "del" command; we simply indicate the list item
we would like to remove as an argument.
For example, if we would like to remove the Second element, then perform del List [1]
command This operation removes the second element of the list then the result becomes [1,3,4]
LISTS: Aliasing
When we set one variable, B equal to A, both A and B are referencing the same list. Multiple
names referring to the same object is known as aliasing.
If we change the first element in “A” to “banana” we get a side effect; the value of B will
change as a consequence. “A" and “B” are referencing the same list, therefore if we change
"A“, list "B" also changes. If we check the first element of B after changing list ”A” we get
banana instead of hard rock
You can clone list “A” by using the following syntax. Variable "A" references one list. Variable
“B” references a new copy or clone of the original list.
Now if you change “A”, "B" will not change We can get more info on lists, tuples and many
other objects in Python using the help command.
DATA VISUALIZATION 64
Simply pass in the list, tuple or any other Python object example: help(list),help(tuple)..etc.
Set:
Sets are a type of collection. Unlike lists and tuples, they are unordered. You cannot access
items in a set by referring to an index, since sets are unordered the items has no index. To
define a set, you use curly brackets You place the elements of a set within the curly brackets.
You notice there are duplicate items. When the actual set is created, duplicate items will not be
present.
To add more than one item to a set use the update () method with list of values.
To remove an item from the set we can use the pop () method. Remember sets are unordered
so it will remove the first item in the set.
To remove an item from the set, use the remove method, we simply indicate the set item we
would like to remove as an argument.
DATA VISUALIZATION 65
There are lots of useful mathematical operations we can do between sets. like union,
intersection, difference, symmetric difference from two sets.
DICTIONARIES:
Python dictionary is an unordered collection of items. While other compound data types have
only value as an element, a dictionary has a key: value pair. Dictionaries are optimized to
retrieve values when the key is known. Creating a dictionary is as simple as placing items inside
curly braces {} separated by comma. An item has a key and the corresponding value expressed
as a pair, key: value. While values can be of any data type and can repeat, keys must be of
immutable type (string, number or tuple with immutable elements) and must be unique.
DATA VISUALIZATION 66
We can get the value using keys either inside square brackets or with get( ) method.
Dictionary is mutable. We can add new items or change the value of existing items using
assignment operator. If the key is already present, value gets updated, else a new key: value
pair is added to the dictionary.
We can delete an entry as follows. This gets rid of the key "address" and its value from my_dict
dictionary.
DATA VISUALIZATION 67
try the same command with a key that is not in the dictionary, we get a false. If we
try with another key that is not in the dictionary, we get a false.
In order to see all the keys in a dictionary, we can use the method keys to get the keys. The
output is a list like object with all keys. In the same way, we can obtain the values.
Conditional Statements
What is Control or Conditional Statements -
In programming languages, most of the time we have to control the flow of execution of your
program, you want to execute some set of statements only if the given condition is satisfied,
and a different set of statements when it’s not satisfied. Which we also call it as control
statements or decision-making statements.
Conditional statements are also known as decision-making statements. We use these statements
when we want to execute a block of code when the given condition is true or false.
Usually Condition will be in a form of Expression with some relational operators. Refer some
below operators mentioned in the chart
DATA VISUALIZATION 68
In Python we achieve the decision-making statements by using below statements -
If statements
If-else statements
Elif statements
If statements -
If statement is one of the most commonly used conditional statement in most of the
programming languages. It decides whether certain statements need to be executed or not. If
statement checks for a given condition, if the condition is true, then the set of code present
inside the if block will be executed.
The If condition evaluates a Boolean expression and executes the block of code only when the
Boolean expression becomes TRUE. Check the Syntax first the controller will come to an if
condition and evaluate the condition if it is true, then the statements will be executed, otherwise
the code present outside the block will be executed.
Let’s take an example to implement the if statement, in this example we have a variable name
which stores the string “Srikar” and we also have names list with some names
DATA VISUALIZATION 69
We can use if statement to check whether the name is present in the names list or not, if
condition is true then it will also print block of statements inside the ‘if’ block. If condition is
false, then it will skip the execution of the ‘if’ block statements.
If-else statements:
The statement itself tells that if a given condition is true then execute the statements present
inside if block and if the condition is false then execute the else block.
Else block will execute only when the condition becomes false, this is the block where you will
perform some actions when the condition is not true.
If-else statement evaluates the Boolean expression and executes the block of code present
inside the if block if the condition becomes TRUE and executes a block of code present in the
else block if the condition becomes FALSE.
Let’s take an example to implement the if-else statement, in this example the if block will get
executed if the given condition is true or else it will execute the else block.
DATA VISUALIZATION 70
elif statements:
In python, we have one more conditional statement called elif statements. Elif statement is used
to check multiple conditions only if the given if condition false. It's like an if-else statement
and the only difference is that in else we will not check the condition but in elif we will do
check the condition.
Elif statements are similar to if-else statements but elif statements evaluate multiple conditions.
DATA VISUALIZATION 71