Introduction To Data Science - Ii-I
Introduction To Data Science - Ii-I
(DATA SCIENCE)
(2022-2023)
Data science combines math and statistics, specialized programming, advanced analytics,
artificial intelligence (AI), and machine learning with specific subject matter expertise to
uncover actionable insights hidden in an organization’s data. These insights can be used to
guide decision making and strategic planning.
The accelerating volume of data sources, and subsequently data, has made data science is one
of the fastest growing field across every industry. It's increasingly critical to businesses: The
insights that data science generates help organizations increase operational efficiency, identify
new business opportunities and improve marketing and sales programs, among other benefits.
Ultimately, they can lead to competitive advantages over business rivals.
So, what is eyebrow-raising about Big Data and data science? Let’s count the ways:
1. There’s a lack of definitions around the most basic terminology. What is “Big Data”
anyway? What does “data science” mean? What is the relationship between Big Data and data
science? Is data science the science of Big Data? Is data science only the stuff going on in
companies like Google and Facebook and tech companies? Why do many people refer to Big
Data as crossing disciplines (astronomy, finance, tech, etc.) and to data science as only taking
place in tech? Just how big is big? Or is it just a relative term? These terms are so ambiguous,
they’re well-nigh meaningless.
2. There’s a distinct lack of respect for the researchers in academia and industry labs who
have been working on this kind of stuff for years, and whose work is based on decades (in some
cases, centuries) of work by statisticians, computer scientists, mathematicians, engineers, and
scientists of all types. From the way the media describes it, machine learning algorithms were
just invented last week and data was never “big” until Google came along. This is simply not
the case. Many of the methods and techniques we’re using—and the challenges we’re facing
now—are part of the evolution of everything that’s come before. This doesn’t mean that there’s
not new and exciting stuff going on, but we think it’s important to show some basic respect for
everything that came before.
3. The hype is crazy—people throw around tired phrases straight out of the height of the
pre-financial crisis era like “Masters of the Universe” to describe data scientists, and that
doesn’t bode well. In general, hype masks reality and increases the noise-to-signal ratio. The
longer the hype goes on, the more many of us will get turned off by it, and the harder it will be
to see what’s good underneath it all, if anything.
4. Statisticians already feel that they are studying and working on the “Science of Data.”
That’s their bread and butter. Maybe you, dear reader, are not a statistician and don’t care, but
imagine that for the statistician, this feels a little bit like how identity theft might feel for you.
Although we will make the case that data science is not just a rebranding of statistics or machine
learning but rather a field unto itself, the media often describes data science in a way that makes
it sound like as if it’s simply statistics or machine learning in the context of the tech industry.
People have said to us, “Anything that has to call itself a science isn’t.” Although there might
be truth in there, that doesn’t mean that the term “data science” itself represents nothing, but of
course what it represents may not be science but more of a craft.
Rachel’s experience going from getting a PhD in statistics to working at Google is a great
example to illustrate why we thought, in spite of the aforementioned reasons to be dubious,
there might be some meat in the data science sandwich. In her words:
It was clear to me pretty quickly that the stuff I was working on at Google was different than
anything I had learned at school when I got my PhD in statistics. This is not to say that my
degree was useless; far from it—what I’d learned in school provided a framework and way of
thinking that I relied on daily, and much of the actual content provided a solid theoretical and
practical foundation necessary to do my work.
But there were also many skills I had to acquire on the job at Google that I hadn’t learned in
school. Of course, my experience is specific to me in the sense that I had a statistics background
and picked up more computation, coding, and visualization skills, as well as domain expertise
while at Google. Another person coming in as a computer scientist or a social scientist or a
physicist would have different gaps and would fill them in accordingly. But what is important
here is that, as individuals, we each had different strengths and gaps, yet we were able to solve
problems by putting ourselves together into a data team well-suited to solve the data problems
that came our way.
Here’s a reasonable response you might have to this story. It’s a general truism that, whenever
you go from school to a real job, you realize there’s a gap between what you learned in school
and what you do on the job. In other words, you were simply facing the difference between
academic statistics and industry statistics.
Sure, there’s is a difference between industry and academia. But does it really have to
be that way? Why do many courses in school have to be so intrinsically out of touch with
reality?
Even so, the gap doesn’t represent simply a difference between industry statistics and
academic statistics. The general experience of data scientists is that, at their job, they have
access to a larger body of knowledge and methodology, as well as a process, which we now
define as the data science process (details in Chapter 2), that has foundations in both statistics
and computer science.
Around all the hype, in other words, there is a ring of truth: this is something new. But at the
same time, it’s a fragile, nascent idea at real risk of being rejected prematurely. For one thing,
it’s being paraded around as a magic bullet, raising unrealistic expectations that will surely be
disappointed.
Rachel gave herself the task of understanding the cultural phenomenon of data science and how
others were experiencing it. She started meeting with people at Google, at startups and tech
companies, and at universities, mostly from within statistics departments.
From those meetings she started to form a clearer picture of the new thing that’s emerging. She
ultimately decided to continue the investigation by giving a course at Columbia called
“Introduction to Data Science,” which Cathy covered on her blog.
Datafication
In the May/June 2013 issue of Foreign Affairs, Kenneth Neil Cukier and Viktor Mayer-
Schoenberger wrote an article called “The Rise of Big Data”. In it they discuss the concept of
datafication, and their example is how we quantify friendships with “likes”: it’s the way
everything we do, online or otherwise, ends up recorded for later examination in someone’s
data storage units. Or maybe multiple storage units, and maybe also for sale.
They define datafication as a process of “taking all aspects of life and turning them into data.”
As examples, they mention that “Google’s augmented-reality glasses datafy the gaze. Twitter
datafies stray thoughts. LinkedIn datafies professional networks.”
Datafication is an interesting concept and led us to consider its importance with respect to
people’s intentions about sharing their own data. We are being datafied, or rather our actions
are, and when we “like” someone or something online, we are intending to be datafied, or at
least we should expect to be. But when we merely browse the Web, we are unintentionally, or
at least passively, being datafied through cookies that we might or might not be aware of. And
when we walk around in a store, or even on the street, we are being datafied in a completely
unintentional way, via sensors, cameras, or Google glasses.
This spectrum of intentionality ranges from us gleefully taking part in a social media
experiment we are proud of, to all-out surveillance and stalking. But it’s all datafication. Our
intentions may run the gamut, but the results don’t.
They follow up their definition in the article with a line that speaks volumes about their
perspective:
Once we datafy things, we can transform their purpose and turn the information into new forms
of value.
Here’s an important question that we will come back to throughout the book: who is “we” in
that case? What kinds of value do they refer to? Mostly, given their examples, the “we” is the
modelers and entrepreneurs making money from getting people to buy stuff, and the “value”
translates into something like increased efficiency through automation.
This is an ongoing discussion, but one way to understand what’s going on in this industry is to
look online and see what current discussions are taking place. This doesn’t necessarily tell us
what data science is, but it at least tells us what other people think it is, or how they’re
perceiving it. For example, on Quora there’s a discussion from 2010 about “What is Data
Science?” and here’s Metamarket CEO Mike
Driscoll’s answer:
Driscoll then refers to Drew Conway’s Venn diagram of data science from 2010, shown in
Figure 1-1.
He also mentions the sexy skills of data geeks from Nathan Yau’s 2009 post, “Rise of the Data
Scientist”, which include:
But wait, is data science just a bag of tricks? Or is it the logical extension of other fields like
statistics and machine learning? For one argument, see Cosma Shalizi’s posts here and here,
and Cathy’s posts here and here, which constitute an ongoing discussion of the difference
between a statistician and a data scientist. Cosma basically argues that any statistics department
worth its salt does all the stuff in the descriptions of data science that he sees, and therefore
data science is just a rebranding and unwelcome takeover of statistics. For a slightly different
perspective, see ASA President Nancy Geller’s 2011 Amstat News article, “Don’t shun the ‘S’
word”, in which she defends statistics:
We need to tell people that Statisticians are the ones who make sense of the data deluge
occurring in science, engineering, and medicine; that statistics provides methods for data
analysis in all fields, from art history to zoology; that it is exciting to be a Statistician in the
21st century because of the many challenges brought about by the data explosion in all of these
fields.
Though we get her point—the phrase “art history to zoology” is supposed to represent the
concept of A to Z—she’s kind of shooting herself in the foot with these examples because they
don’t correspond to the high-tech world where much of the data explosion is coming from.
Much of the development of the field is happening in industry, not academia. That is, there are
people with the job title data scientist in companies, but no professors of data science in
academia. (Though this may be changing.)
Not long ago, DJ Patil described how he and Jeff Hammerbacher— then at LinkedIn and
Facebook, respectively—coined the term “data scientist” in 2008. So that is when “data
scientist” emerged as a job title. (Wikipedia finally gained an entry on data science in 2012.) It
makes sense to us that once the skill set required to thrive at Google —working with a team on
problems that required a hybrid skill set of stats and computer science paired with personal
characteristics including curiosity and persistence—spread to other Silicon Valley tech
companies, it required a new job title. Once it became a pattern, it deserved a name. And once
it got a name, everyone and their mother wanted to be one. It got even worse when Harvard
Business Review declared data scientist to be the “Sexiest Job of the 21st Century”.
Both LinkedIn and Facebook are social network companies. Oftentimes a description or
definition of data scientist includes hybrid statistician, software engineer, and social scientist.
This made sense in the context of companies where the product was a social product and still
makes sense when we’re dealing with human or user behavior. But if you think about Drew
Conway’s Venn diagram, data science problems cross disciplines—that’s what the substantive
expertise is referring to. In other words, it depends on the context of the problems you’re trying
to solve.
If they’re social science-y problems like friend recommendations or people you know or user
segmentation, then by all means, bring on the social scientist! Social scientists also do tend to
be good question askers and have other good investigative qualities, so a social scientist who
also has the quantitative and programming chops makes a great data scientist. But it’s almost
a “historical” (historical is in quotes because 2008 isn’t that long ago) artifact to limit your
conception of a data scientist to someone who works only with online user behavior data.
There’s another emerging field out there called computational social sciences, which could be
thought of as a subset of data science. But we can go back even further. In 2001, William
Cleveland wrote a position paper about data science called “Data Science: An action plan to
expand the field of statistics.” So data science existed before data scientists? Is this semantics,
or does it make sense? This all begs a few questions: can you define data science by what data
scientists do? Who gets to define the field, anyway? There’s lots of buzz and hype—does the
media get to define it, or should we rely on the practitioners, the self-appointed data scientists?
Or is there some actual authority? Let’s leave these as open questions for now, though we will
return to them throughout the book.
Statistical Inference
The world we live in is complex, random, and uncertain. At the same time, it’s one big data-
generating machine. As we commute to work on subways and in cars, as our blood moves
through our bodies, as we’re shopping, emailing, procrastinating at work by browsing the
Internet and watching the stock market, as we’re building things, eating things, talking to our
friends and family about things, while factories are producing products, this all at least
potentially produces data.
Imagine spending 24 hours looking out the window, and for every minute, counting and
recording the number of people who pass by. Or gathering up everyone who lives within a mile
of your house and making them tell you how many email messages they receive every day for
the next year. Imagine heading over to your local hospital and rummaging around in the blood
samples looking for patterns in the DNA. That all sounded creepy, but it wasn’t supposed to.
The point here is that the processes in our lives are actually data-generating processes.
We’d like ways to describe, understand, and make sense of these processes, in part because as
scientists we just want to understand the world better, but many times, understanding these
processes is part of the solution to problems we’re trying to solve. Data represents the traces of
the real-world processes, and exactly which traces we gather are decided by our data collection
or sampling method. You, the data scientist, the observer, are turning the world into data, and
this is an utterly subjective, not objective, process. After separating the process from the data
collection, we can see clearly that there are two sources of randomness and uncertainty.
Namely, the randomness and uncertainty underlying the process itself, and the uncertainty
associated with your underlying data collection methods. Once you have all this data, you have
somehow captured the world, or certain traces of the world. But you can’t go walking around
with a huge Excel spreadsheet or database of millions of transactions and look at it and, with a
snap of a finger, understand the world and process that generated it. So you need a new idea,
and that’s to simplify those captured traces into something more comprehensible, to something
that somehow captures it all in a much more concise way, and that something could be
mathematical models or functions of the data, known as statistical estimators.
This overall process of going from the world to the data, and then from the data back to the
world, is the field of statistical inference. More precisely, statistical inference is the discipline
that concerns itself with the development of procedures, methods, and theorems that allow us
to extract meaning and information from data that has been generated by stochastic (random)
processes.
Populations and Samples
Let’s get some terminology and concepts in place to make sure we’re all talking about the same
thing.In classical statistical literature, a distinction is made between the population and the
sample. The word population immediately makes us think of the entire US population of 300
million people, or the entire world’s population of 7 billion people. But put that image out of
your head, because in statistical inference population isn’t used to simply describe only people.
It could be any set of objects or units, such as tweets or photographs or stars. If we could
measure the characteristics or extract characteristics of all those objects, we’d have a complete
set of observations, and the convention is to use N to represent the total number of observations
in the population. Suppose your population was all emails sent last year by employees at a huge
corporation, BigCorp. Then a single observation could be a list of things: the sender’s name,
the list of recipients, date sent, text of email, number of characters in the email, number of
sentences in the email, number of verbs in the email, and the length of time until first reply.
When we take a sample, we take a subset of the units of size n in order to examine the
observations to draw conclusions and make inferences about the population. There are different
ways you might go about getting this subset of data, and you want to be aware of this sampling
mechanism because it can introduce biases into the data, and distort it, so that the subset is not
a “mini-me” shrunk-down version of the population. Once that happens, any conclusions you
draw will simply be wrong and distorted.
In the BigCorp email example, you could make a list of all the employees and select 1/10th of
those people at random and take all the email they ever sent, and that would be your sample.
Alternatively, you could sample 1/10th of all email sent each day at random, and that would be
your sample. Both these methods are reasonable, and both methods yield the same sample size.
But if you took them and counted how many email messages each person sent, and used that
to estimate the underlying distribution of emails sent by all indiviuals at BigCorp, you might
get entirely different answers. So if even getting a basic thing down like counting can get
distorted when you’re using a reasonable-sounding sampling method, imagine what can happen
to more complicated algorithms and models if you haven’t taken into account the process that
got the data into your hands.
Modeling
In the next chapter, we’ll look at how we build models from the data we collect, but first we
want to discuss what we even mean by this term. Rachel had a recent phone conversation with
someone about a modelling workshop, and several minutes into it she realized the word
“model” meant completely different things to them. He was using it to mean data models—the
representation one is choosing to store one’s data, which is the realm of database managers—
whereas she was talking about statistical models, which is what much of this book is about.
One of Andrew Gelman’s blog posts on modeling was recently tweeted by people in the fashion
industry, but that’s a different issue. Even if you’ve used the terms statistical model or
mathematical model for years, is it even clear to yourself and to the people you’re talking to
what you mean? What makes a model a model? Also, while we’re asking fundamental
questions like this, what’s the difference between a statistical model and a machine learning
algorithm? Before we dive deeply into that, let’s add a bit of context with this deliberately
provocative Wired magazine piece, “The End of Theory: The Data Deluge Makes the Scientific
Method Obsolete,” published in 2008 by Chris Anderson, then editor-in-chief. Anderson
equates massive amounts of data to complete information and argues no models are necessary
and “correlation is enough”; e.g., that in the context of massive amounts of data, “they [Google]
don’t have to settle for models at all.”
Really? We don’t think so, and we don’t think you’ll think so either by the end of the book.
But the sentiment is similar to the Cukier and Mayer-Schoenberger article we just discussed
about N=ALL, so you might already be getting a sense of the profound confusion we’re
witnessing all around us. To their credit, it’s the press that’s currently raising awareness of
these questions and issues, and someone has to do it. Even so, it’s hard to take when the opinion
makers are people who don’t actually work with data. Think critically about whether you buy
what Anderson is saying; where you agree, disagree, or where you need more information to
form an opinion. Given that this is how the popular press is currently describing and influencing
public perception of data science and modeling, it’s incumbent upon us as data scientists to be
aware of it and to chime in with informed comments. With that context, then, what do we mean
when we say models? And how do we use them as data scientists? To get at these questions,
let’s dive in.
What is a model?
Humans try to understand the world around them by representing it in different ways.
Architects capture attributes of buildings through blueprints and three-dimensional, sca led-
down versions. Molecular biologists capture protein structure with three-dimensional
visualizations of the connections between amino acids. Statisticians and data scientists capture
the uncertainty and randomness of data-generating processes with mathematical functions that
express the shape and structure of the data itself. A model is our attempt to understand and
represent the nature of reality through a particular lens, be it architectural, biological, or
mathematical.
A model is an artificial construction where all extraneous detail has been removed or
abstracted. Attention must always be paid to these abstracted details after a model has been
analyzed to see what might have been overlooked. In the case of proteins, a model of the protein
backbone with side chains by itself is removed from the laws of quantum mechanics that govern
the behavior of the electrons, which ultimately dictate the structure and actions of proteins. In
the case of a statistical model, we may have mistakenly excluded key variables, included
irrelevant ones, or assumed a mathematical structure divorced from reality.
Statistical modelling
Before you get too involved with the data and start coding, it’s useful to draw a picture of what
you think the underlying process might be with your model. What comes first? What influences
what? What causes what? What’s a test of that? But different people think in different ways.
Some prefer to express these kinds of relationships in terms of math. The mathematical
expressions will be general enough that they have to include parameters, but the values of these
parameters are not yet known. In mathematical expressions, the convention is to use Greek
letters for parameters and Latin letters for data. So, for example, if you have two columns of
data, x and y, and you think there’s a linear relationship, you’d write down y = β0 +β1x. You
don’t know what β0 and β1 are in terms of actual numbers yet, so they’re the parameters.
Other people prefer pictures and will first draw a diagram of data flow, possibly with arrows,
showing how things affect other things or what happens over time. This gives them an abstract
picture of the relationships before choosing equations to express them.
Probability distributions
Probability distributions are the foundation of statistical models. When we get to linear
regression and Naive Bayes, you will see how this happens in practice. One can take multiple
semesters of courses on probability theory, and so it’s a tall challenge to condense it down for
you in a small section.
Back in the day, before computers, scientists observed real-world phenomenon, took
measurements, and noticed that certain mathematical shapes kept reappearing. The classical
example is the height of humans, following a normal distribution—a bell-shaped curve, also
called a Gaussian distribution, named after Gauss. Other common shapes have been named
after their observers as well (e.g., the Poisson distribution and the Weibull distribution), while
other shapes such as Gamma distributions or exponential distributions are named after
associated mathematical objects.
In addition to denoting distributions of single random variables with functions of one variable,
we use multivariate functions called joint distributions to do the same thing for more than one
random variable. So in the case of two random variables, for example, we could denote our
distribution by a function p(x, y), and it would take values in the plane and give us nonnegative
values. In keeping with its interpretation as a probability, its (double) integral over the whole
plane would be 1.
We also have what is called a conditional distribution, p(x|y) , which is to be interpreted as the
density function of x given a particular value of y.
When we observe data points, i.e., (x1, y1), (x2, y2), . . ., (xn, yn ), we are observing realizations
of a pair of random variables. When we have an entire dataset with n rows and k columns, we
are observing n realizations of the joint distribution of those k random variables.
Fitting a model
Fitting a model means that you estimate the parameters of the model using the observed data.
You are using your data as evidence to help approximate the real-world mathematical process
that generated the data. Fitting the model often involves optimization methods and algorithms,
such as maximum likelihood estimation, to help get the parameters.
Fitting the model is when you start actually coding: your code will read n the data, and you’ll
specify the functional form that you wrote down on the piece of paper. Then R or Python will
use built-in optimization methods to give you the most likely values of the parameters given
the data. As you gain sophistication, or if this is one of your areas of expertise, you’ll dig around
in the optimization methods yourself. Initially you should have an understanding that
optimization is taking place and how it works, but you don’t have to code this part yourself—
it underlies the R or Python functions.
Overfitting
Throughout the book you will be cautioned repeatedly about overfitting, possibly to the point
you will have nightmares about it. Overfitting is the term used to mean that you used a dataset
to estimate the parameters of your model, but your model isn’t that good at capturing reality
beyond your sampled data. You might know this because you have tried to use it to predict
labels for another set of data that you didn’t use to fit the model, and it doesn’t do a good job,
as measured by an evaluation metric such as accuracy.
Basics of R
Introduction
Features of R
As stated earlier, R is a programming language and software environment for statistical
analysis, graphics representation and reporting. The following are the important features of R
−
R is a well-developed, simple and effective programming language which includes
conditionals, loops, user defined recursive functions and input and output facilities.
R has an effective data handling and storage facility,
R provides a suite of operators for calculations on arrays, lists, vectors and matrices.
R provides a large, coherent and integrated collection of tools for data analysis.
R provides graphical facilities for data analysis and display either directly at the
computer or printing at the papers.
R - Environment Setup
In Windows:
Install R on windows
Programming with R
R - Basic Syntax
To output text in R, use single or double quotes:
"Hello World!"
Output:5
R Print Output
Print
Unlike many other programming languages, you can output code in R without using a print
function:
Example
"Hello!"
However, R does have a print() function available if you want to use it.
Example
print("Hello!")
And there are times you must use the print() function to output code, for example
x<-10
print(x)
R Comments
Comments
Comments can be used to explain R code, and to make it more readable. It can also be used to
prevent execution when testing alternative code.
Comments starts with a #. When executing code, R will ignore anything that starts with #.
Example
# This is a comment
"Hello!"
Example
"Hello World!" # This is a comment
Multiline Comments
Unlike other programming languages, such as Java, there are no syntax in R for multiline
comments. However, we can just insert a # for each line to create multiline comments:
Example
# This is a comment
# written in
# more than just one line
"Hello!"
R Variables
Creating Variables in R
R does not have a command for declaring a variable. A variable is created the moment you first
assign a value to it. To assign a value to a variable, use the <- sign. To output (or print) the
variable value, just type the variable name:
Example
name <- "John"
age <- 40
However, <- is preferred in most cases because the = operator can be forbidden in some context
in R.
Compared to many other programming languages, you do not have to use a function to
print/output variables in R. You can just type the name of the variable:
Example
name <- "John"
Concatenate Elements
You can also concatenate, or join, two or more elements, by using the paste() function.
Example
text <- "awesome"
paste(text1, text2)
Example
num1 <- 5
num2 <- 10
num1 + num2
If you try to combine a string (text) and a number, R will give you an error:
Example
num <- 5
text <- "Some text"
num + text
Result:
Error in num + text : non-numeric argument to binary operator
Multiple Variables
R allows you to assign the same value to multiple variables in one line:
Example
# Assign the same value to multiple variables in one line
var1 <- var2 <- var3 <- "Orange"
Variable Names
A variable can have a short name (like x and y) or a more descriptive name (age, carname,
total_volume). Rules for R variables are:
A variable name must start with a letter and can be a combination of letters, digits,
period(.) and underscore(_).
If it starts with period(.), it cannot be followed by a digit.
A variable name cannot start with a number or underscore (_)
Variable names are case-sensitive (age, Age and AGE are three different variables)
Reserved words cannot be used as variables (TRUE, FALSE, NULL, if...)
Variables can store data of different types, and different types can do different things.
In R, variables do not need to be declared with any particular type, and can even change type
after they have been set:
Example
my_var <- 30
my_var <- "raghul"
We can use the class() function to check the data type of a variable:
Example
# numeric
x <- 10.5
class(x)
# integer
x <- 1000L
class(x)
# complex
x <- 9i + 3
class(x)
# character/string
x <- "R is exciting"
class(x)
# logical/boolean
x <- TRUE
class(x)
R Numbers
Numbers
numeric
integer
complex
Variables of number types are created when you assign a value to them:
Example
x <- 10.5 # numeric
y <- 10L # integer
z <- 1i # complex
Numeric
A numeric data type is the most common type in R, and contains any number with or without
a decimal, like: 10.5, 55, 787:
Example
x <- 10.5
y <- 55
Integer
Integers are numeric data without decimals. This is used when you are certain that you will
never create a variable that should contain decimals. To create an integer variable, you must
use the letter L after the integer value:
Example
x <- 1000L
y <- 55L
Complex
Example
x <- 3+5i
y <- 5i
Type Conversion
You can convert from one type to another with the following functions:
as.numeric()
as.integer()
as.complex()
Example
x <- 1L # integer
y <- 2 # numeric
A data set can often be viewed as a collection of data objects. Other names for a data object are
record, point, vector, pattern, event, case, sample, observation, or entity. In turn, data objects
are described by a number of attributes that capture the basic characteristics of an object, such
as the mass of a physical object or the time at which an event occurred. Other names for an
attribute are variable, characteristic, field, feature, or dimension.
Example (Student Information). Often, a data set is a file, in which the objects are records (or
rows) in the file and each field (or column) corresponds to an attribute. For example, Table 2.1
shows a data set that consists of student information. Each row corresponds to a student and
each column is an attribute that describes some aspect of a student, such as grade point average
(GPA) or identification number (ID).
Although record-based data sets are common, either in flat files or relational database systems,
there are other important types of data sets and systems for storing data.
What is an attribute?
Definition 1. An attribute is a property or characteristic of an object that may vary; either from
one object to another or from one time to another.
For example, eye color varies from person to person, while the temperature of an object varies
over time. Note that eye color is a symbolic attribute with a small number of possible values
{brown, black, blue, green, hazel, etc.}, while temperature is a numerical attribute with a
potentially unlimited number of values.
At the most basic level, attributes are not about numbers or symbols. However, to discuss and
more precisely analyze the characteristics of objects, we assign numbers or symbols to them.
To do this in a well-defined way, we need a measurement scale.
In other words, the values used to represent an attribute may have properties that are not
properties of the attribute itself, and vice versa. This is illustrated with two examples.
Example 1 (Employee Age and ID Number). Two attributes that might be associated with an
employee are ID and age (in years). Both of these attributes can be represented as integers.
However, while it is reasonable to talk about the average age of an employee, it makes no sense
to talk about the average employee ID.
Indeed, the only aspect of employees that we want to capture with the ID attribute is that they
are distinct. Consequently, the only valid operation for employee IDs is to test whether they
are equal. There is no hint of this limitation, however, when integers are used to represent the
employee ID attribute. For the age attribute, the properties of the integers used to represent age
are very much the properties of the attribute. Even so, the correspondence is not complete since,
for example, ages have a maximum' while integers do not.
Consider below Figure, which shows some objects-line segments and how the length attribute
of these objects can be mapped to numbers in two different ways. Each successive line segment,
going from the top to the bottom, is formed by appending the topmost line segment to itself.
Thus, the second line segment from the top is formed by appending the topmost line segment
to itself twice, the third line segment from the top is formed by appending the topmost line
segment to itself three times, and so forth. In a very real (physical) sense, all the line segments
are multiples of the first. This fact is captured by the measurements on the right-hand side of
the figure, but not by those on the left hand-side.
More specifically, the measurement scale on the left-hand side captures only the ordering of
the length attribute, while the scale on the right-hand side captures both the ordering and
additivity properties. Thus, an attribute can be measured in a way that does not capture all the
properties of the attribute. The type of an attribute should tell us what properties of the attribute
are reflected in the values used to measure it. Knowing the type of an attribute is important
because it tells us which properties of the measured values are consistent with the underlying
properties of the attribute, and therefore, it allows us to avoid foolish actions, such as
computing the average employee ID.
Note that it is common to refer to the type of an attribute as the type of a measurement scale.
Given these properties, we can define four types of attributes: nominal, ordinal, interval, and
ratio. Table 2.2 gives the definitions of these types, along with information about the statistical
operations that are valid for each type. Each attribute type possesses all of the properties and
operations of the attribute types above it. Consequently, any property or operation that is valid
for nominal, ordinal, and interval attributes is also valid for ratio attributes. In other words, the
definition of the attribute types is cumulative. However, this does not mean that the operations
appropriate for one attribute type are appropriate for the attribute types above it.
Nominal and ordinal attributes are collectively referred to as categorical or qualitative
attributes. As the name suggests, qualitative attributes, such as employee ID, lack most of the
properties of numbers. Even if they are represented by numbers, i.e., integers, they should be
treated more like symbols. The remaining two types of attributes, interval and ratio, are
collectively referred to as quantitative or numeric attributes. Quantitative attributes are
represented by numbers and have most of the properties of numbers. Note that quantitative
attributes can be integer-valued or continuous. The types of attributes can also be described in
terms of transformations that do not change the meaning of an attribute. Indeed, S. Smith
Stevens, the psychologist who originally defined the types of attributes shown in Table 2.2,
defined them in terms of these permissible transformations.
For example, the meaning of a length attribute is unchanged if it is measured in meters instead
of feet.
The statistical operations that make sense for a particular type of attribute are those that will
yield the same results when the attribute is transformed using a transformation that preserves
the attribute's meaning. To illustrate, the average length of a set of objects is different when
measured in meters rather than in feet, but both averages represent the same length. Table 2.3
shows the permissible (meaning-preserving) transformations for the four attribute types of
Table 2.2.
Asymmetric Attributes
For asymmetric attributes, only presence a non-zero attribute value-is regarded as important.
Consider a data set where each object is a student and each attribute records whether or not a
student took a particular course at a university. For a specific student, an attribute has a value
of 1 if the student took the course associated with that attribute and a value of 0 otherwise.
Because students take only a small fraction of all available courses, most of the values in such
a data set would be 0. Therefore, it is more meaningful and more efficient to focus on the non-
zero values.
To illustrate, if students are compared on the basis of the courses they don't take, then most
students would seem very similar, at least if the number of courses is large. Binary attributes
where only non-zero values are important are called asymmetric binary attributes. This type of
attribute is particularly important for association analysis. It is also possible to have discrete or
continuous asymmetric features. For instance, if the number of credits associated with each
course is recorded, then the resulting data set will consist of asymmetric discrete or continuous
attributes.
Binary Attributes
A binary attribute is a nominal attribute with only two categories or states: 0 or 1, where 0
typically means that the attribute is absent, and 1 means that it is present. Binary attributes are
referred to as Boolean if the two states correspond to true and false.
Example 1 Binary attributes. Given the attribute smoker describing a patient object, 1
indicates that the patient smokes, while 0 indicates that the patient does not. Similarly, suppose
the patient undergoes a medical test that has two possible outcomes. The attribute medical test
is binary, where a value of 1 means the result of the test for the patient is positive, while 0
means the result is negative.
A binary attribute is symmetric if both of its states are equally valuable and carry the same
weight; that is, there is no preference on which outcome should be coded as 0 or 1. One such
example could be the attribute gender having the states male and female.
A binary attribute is asymmetric if the outcomes of the states are not equally important, such
as the positive and negative outcomes of a medical test for HIV. By convention, we code the
most important outcome, which is usually the rarest one, by 1 (e.g., HIV positive) and the other
by 0 (e.g., HIV negative).
Nominal Attributes
Nominal means “relating to names.” The values of a nominal attribute are symbols or names
of things. Each value represents some kind of category, code, or state, and so nominal attributes
are also referred to as categorical. The values do not have any meaningful order. In computer
science, the values are also known as enumerations.
Example 1 Nominal attributes. Suppose that hair color and marital status are two attributes
describing person objects. In our application, possible values for hair color are black, brown,
blond, red, auburn, gray, and white. The attribute marital status can take on the values single,
married, divorced, and widowed. Both hair color and marital status are nominal attributes.
Another example of a nominal attribute is occupation, with the values teacher, dentist,
programmer, farmer, and so on.
Although we said that the values of a nominal attribute are symbols or “names of things,” it is
possible to represent such symbols or “names” with numbers. With hair color, for instance, we
can assign a code of 0 for black, 1 for brown, and so on.
Another example is customor ID, with possible values that are all numeric. However, in such
cases, the numbers are not intended to be used quantitatively. That is, mathematical operations
on values of nominal attributes are not meaningful. It makes no sense to subtract one customer
ID number from another, unlike, say, subtracting an age value from another (where age is a
numeric attribute). Even though a nominal attribute may have integers as values, it is not
considered a numeric attribute because the integers are not meant to be used quantitatively.
Because nominal attribute values do not have any meaningful order about them and are not
quantitative, it makes no sense to find the mean (average) value or median (middle) value for
such an attribute, given a set of objects. One thing that is of interest, however, is the attribute’s
most commonly occurring value. This value, known as the mode, is one of the measures of
central tendency.
Ordinal Attributes
An ordinal attribute is an attribute with possible values that have a meaningful order or
ranking among them, but the magnitude between successive values is not known.
Example 1 Ordinal attributes. Suppose that drink size corresponds to the size of drinks
available at a fast-food restaurant. This nominal attribute has three possible values: small,
medium, and large. The values have a meaningful sequence (which corresponds to increasing
drink size); however, we cannot tell from the values how much bigger, say, a medium is than a
large. Other examples of ordinal attributes include grade and professional rank. Professional
ranks can be enumerated in a sequential order: for example, assistant, associate, and full for
professors, and private, private first class, specialist, corporal, and sergeant for army ranks.
Ordinal attributes are useful for registering subjective assessments of qualities that cannot be
measured objectively; thus ordinal attributes are often used in surveys for ratings. In one
survey, participants were asked to rate how satisfied they were as customers. Customer
satisfaction had the following ordinal categories: 0: very dissatisfied, 1: somewhat dissatisfied,
2: neutral, 3: satisfied, and 4: very satisfied.
Ordinal attributes may also be obtained from the discretization of numeric quantities by
splitting the value range into a finite number of ordered categories as described in Chapter 3
on data reduction. The central tendency of an ordinal attribute can be represented by its mode
and its median (the middle value in an ordered sequence), but the mean cannot be defined.
Note that nominal, binary, and ordinal attributes are qualitative. That is, they describe a feature
of an object without giving an actual size or quantity. The values of such qualitative attributes
are typically words representing categories. If integers are used, they represent computer codes
for the categories, as opposed to measurable quantities (e.g., 0 for small drink size, 1 for
medium, and 2 for large).
Numeric Attributes
Interval-Scaled Attributes
Interval-scaled attributes are measured on a scale of equal-size units. The values of interval-
scaled attributes have order and can be positive, 0, or negative. Thus, in addition to providing
a ranking of values, such attributes allow us to compare and quantify the difference between
values.
Temperatures in Celsius and Fahrenheit do not have a true zero-point, that is, neither 0oC nor
0” F indicates “no temperature.” (On the Celsius scale, for example, the unit of measurement
is 1/100 of the difference between the melting temperature and the boiling temperature of water
in atmospheric pressure.) Although we can compute the difference between temperature values,
we cannot talk of one temperature value as being a multiple of another. Without a true zero, we
cannot say, for instance, that 10C is twice as warm as 5C. That is, we cannot speak of the values
in terms of ratios. Similarly, there is no true zero-point for calendar dates. (The year 0 does not
correspond to the beginning of time.) This brings us to ratio-scaled attributes, for which a true
zero-point exits.
Because interval-scaled attributes are numeric, we can compute their mean value, in addition
to the median and mode measures of central tendency.
Ratio-Scaled Attributes
Example: Ratio-scaled attributes. Unlike temperatures in Celsius and Fahrenheit, the Kelvin
(K) temperature scale has what is considered a true zero-point It is the
point at which the particles that comprise matter have zero kinetic energy. Other examples of
ratio-scaled attributes include count attributes such as years of experience (e.g., the objects are
employees) and number of words (e.g., the objects are documents).
Additional examples include attributes to measure weight, height, latitude and longitude
coordinates (e.g., when clustering houses), and monetary quantities (e.g., you are 100 times
richer with $100 than with $1).
The attributes are organized as nominal, binary, ordinal, and numeric types. The types are not
mutually exclusive. Classification algorithms developed from the field of machine learning
often talk of attributes as being either discrete or continuous. Each type may be processed
differently.
A discrete attribute has a finite or countably infinite set of values, which may or may not be
represented as integers. The attributes hair color, smoker, medical test, and drink size each
have a finite number of values, and so are discrete. Note that discrete attributes may have
numeric values, such as 0 and 1 for binary attributes or, the values 0 to 110 for the attribute
age. An attribute is countably infinite if the set of possible values is infinite but the values can
be put in a one-to-one correspondence with natural numbers. For example, the attribute
customer ID is countably infinite. The number of customers can grow to infinity, but in reality,
the actual set of values is countable (where the values can be put in one-to-one correspondence
with the set of integers). Zip codes are another example.
If an attribute is not discrete, it is continuous. The terms numeric attribute and continuous
attribute are often used interchangeably in the literature. (This can be confusing because, in the
classic sense, continuous values are real numbers, whereas numeric values can be either
integers or real numbers.) In practice, real values are represented using a finite number of digits.
Continuous attributes are typically represented as floating-point variables.
Basic Statistical Descriptions of Data
For data pre-processing to be successful, it is essential to have an overall picture of your data.
Basic statistical descriptions can be used to identify properties of the data and highlight which
data values should be treated as noise or outliers.
This section discusses three areas of basic statistical descriptions. We start with measures of
central tendency, which measure the location of the middle or center of a data distribution.
Intuitively speaking, given an attribute, where do most of its values fall? In particular, we
discuss the mean, median, mode, and midrange. In addition to assessing the central tendency
of our data set, we also would like to have an idea of the dispersion of the data. That is, how
are the data spread out? The most common data dispersion measures are the range, quartiles,
and interquartile range; the five-number summary and boxplots; and the variance and standard
deviation of the data. These measures are useful for identifying outliers.
Finally, we can use many graphic displays of basic statistical descriptions to visually inspect
our data. Most statistical or graphical data presentation software packages include bar charts,
pie charts, and line graphs. Other popular displays of data summaries and distributions include
quantile plots, quantile–quantile plots, histograms, and scatter plots.
In this section, we look at various ways to measure the central tendency of data. Suppose that
we have some attribute X, like salary, which has been recorded for a set of objects. Let x1,
x2,… ,xN be the set of N observed values or observations for X. Here, these values may also
be referred to as the data set (for X). If we were to plot the observations for salary, where would
most of the values fall? This gives us an idea of the central tendency of the data. Measures of
central tendency include the mean, median, mode, and midrange.
The most common and effective numeric measure of the “center” of a set of data is the
(arithmetic) mean. Let x1,x2, … ,xN be a set of N values or observations, such as for some
numeric attribute X, like salary. The mean of this set of values is
This corresponds to the built-in aggregate function, average (avg() in SQL), provided in
relational database systems.
Example 1 Mean. Suppose we have the following values for salary (in thousands of dollars),
shown in increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110. Using mean Eqation,
we have
Thus, the mean salary is $58,000.
Sometimes, each value xi in a set may be associated with a weight wi for i D 1,….,N. The
weights reflect the significance, importance, or occurrence frequency attached to their
respective values. In this case, we can compute
Although the mean is the single most useful quantity for describing a data set, it is not always
the best way of measuring the center of the data. A major problem with the mean is its
sensitivity to extreme (e.g., outlier) values. Even a small number of extreme values can corrupt
the mean. For example, the mean salary at a company may be substantially pushed up by that
of a few highly paid managers.
Similarly, the mean score of a class in an exam could be pulled down quite a bit by a few very
low scores. To offset the effect caused by a small number of extreme values, we can instead
use the trimmed mean, which is the mean obtained after chopping off values at the high and
low extremes. For example, we can sort the values observed for salary and remove the top and
bottom 2% before computing the mean. We should avoid trimming too large a portion (such
as 20%) at both ends, as this can result in the loss of valuable information. For skewed
(asymmetric) data, a better measure of the center of data is the median, which is the middle
value in a set of ordered data values. It is the value that separates the higher half of a data set
from the lower half.
In probability and statistics, the median generally applies to numeric data; however, we may
extend the concept to ordinal data. Suppose that a given data set of N values for an attribute X
is sorted in increasing order. If N is odd, then the median is the middle value of the ordered set.
If N is even, then the median is not unique; it is the two middlemost values and any value in
between. If X is a numeric attribute in this case, by convention, the median is taken as the
average of the two middlemost values.
Example 2: Median. Let’s find the median of the data from Example 1. The data are already
sorted in increasing order. There is an even number of observations (i.e., 12); therefore, the
median is not unique. It can be any value within the two middlemost values of 52 and 56 (that
is, within the sixth and seventh values in the list). By convention, we assign the average of the
two middlemost values as the median; that is, Thus, the median is $54,000.
Suppose that we had only the first 11 values in the list. Given an odd number of values, the
median is the middlemost value. This is the sixth value in this list, which has a value of $52,000.
The median is expensive to compute when we have a large number of observations. For
numeric attributes, however, we can easily approximate the value. Assume that data are
grouped in intervals according to their xi data values and that the frequency (i.e., number of
data values) of each interval is known. For example, employees may be grouped according to
their annual salary in intervals such as $10–20,000, $20–30,000, and so on. Let the interval that
contains the median frequency be the median interval.
We can approximate the median of the entire data set (e.g., the median salary) by interpolation
using the formula
where L1 is the lower boundary of the median interval, N is the number of values in the entire
data set, is the sum of the frequencies of all of the intervals that are lower than the
median interval, freqmedian is the frequency of the median interval, and width is the width of the
median interval.
The mode is another measure of central tendency. The mode for a set of data is the value that
occurs most frequently in the set. Therefore, it can be determined for qualitative and
quantitative attributes. It is possible for the greatest frequency to correspond to several different
values, which results in more than one mode. Data sets with one, two, or three modes are
respectively called unimodal, bimodal, and trimodal. In general, a data set with two or more
modes ismultimodal. At the other extreme, if each data value occurs only once, then there is
no mode.
Example 3: Mode. The data from Example 1 are bimodal. The two modes are $52,000 and
$70,000. For unimodal numeric data that are moderately skewed (asymmetrical), we have the
following empirical relation:
This implies that the mode for unimodal frequency curves that are moderately skewed can
easily be approximated if the mean and median values are known. The midrange can also be
used to assess the central tendency of a numeric data set. It is the average of the largest and
smallest values in the set. This measure is easy to compute using the SQL aggregate functions,
max() and min().
In a unimodal frequency curve with perfect symmetric data distribution, the mean, median,
and mode are all at the same center value, as shown in Figure 2.1(a). Data in most real
applications are not symmetric. They may instead be either positively skewed, where the mode
occurs at a value that is smaller than the median (Figure 2.1b), or negatively skewed, where
the mode occurs at a value greater than the median (Figure 2.1c).
Measuring the Dispersion of Data: Range, Quartiles, Variance, Standard Deviation, and
Interquartile Range
We now look at measures to assess the dispersion or spread of numeric data. The measures
include range, quantiles, quartiles, percentiles, and the interquartile range. The five-number
summary, which can be displayed as a boxplot, is useful in identifying outliers. Variance and
standard deviation also indicate the spread of a data distribution.
To start off, let’s study the range, quantiles, quartiles, percentiles, and the interquartile range
as measures of data dispersion. Let x1,x2, : : : ,xN be a set of observations for some numeric
attribute, X. The range of the set is the difference between the largest (max()) and smallest
(min()) values. Suppose that the data for attribute X are sorted in increasing numeric order.
Imagine that we can pick certain data points so as to split the data distribution into equal-size
consecutive sets, as in below Figure. These data points are called quantiles.
Quantiles are points taken at regular intervals of a data distribution, dividing it into essentially
equal size consecutive sets. (We say “essentially” because there may not be data values of X
that divide the data into exactly equal-sized subsets. For readability, we will refer to them as
equal.) The kth q-quantile for a given data distribution is the value x such that at most k=q of
the data values are less than x and at most. (q-k)=q of the data values are more than x, where k
is an integer such that 0 < k < q. There are q-1 q-quantiles. The 2-quantile is the data point
dividing the lower and upper halves of the data distribution. It corresponds to the median. The
4-quantiles are the three data points that split the data distribution into four equal parts; each
part represents one-fourth of the data distribution. They are more commonly referred to as
quartiles. The 100-quantiles are more commonly referred to as percentiles; they divide the
data distribution into 100 equal-sized consecutive sets. The median, quartiles, and percentiles
are the most widely used forms of quantiles.
The quartiles give an indication of a distribution’s center, spread, and shape. The first quartile,
denoted by Q1, is the 25th percentile. It cuts off the lowest 25% of the data. The third quartile,
denoted by Q3, is the 75th percentile—it cuts off the lowest 75% (or highest 25%) of the data.
The second quartile is the 50th percentile. As the median, it gives the center of the data
distribution.
The distance between the first and third quartiles is a simple measure of spread that gives the
range covered by the middle half of the data. This distance is called the interquartile range
(IQR) and is defined as
Example 1: Interquartile range. The quartiles are the three values that split the sorted data set
into four equal parts. The data contain 12 observations, already sorted in increasing order. Thus,
the quartiles for this data are the third, sixth, and ninth values, respectively, in the sorted list.
Therefore, Q1 = $47,000 and Q3 is $63,000. Thus, the interquartile range is IQR = 63 -47=
$16,000. (Note that the sixth value is a median, $52,000, although this data set has two medians
since the number of data values is even.)
Because Q1, the median, and Q3 together contain no information about the endpoints (e.g.,
tails) of the data, a fuller summary of the shape of a distribution can be obtained by providing
the lowest and highest data values as well. This is known as the five-number summary. The
five-number summary of a distribution consists of the median (Q2), the quartiles Q1 and Q3,
and the smallest and largest individual observations, written in the order of Minimum, Q1,
Median, Q3, Maximum.
Boxplots are a popular way of visualizing a distribution. A boxplot incorporates the five-
number summary as follows:
Typically, the ends of the box are at the quartiles so that the box length is the
interquartile range.
The median is marked by a line within the box.
Two lines (called whiskers) outside the box extend to the smallest (Minimum) and
largest (Maximum) observations.
Variance and standard deviation are measures of data dispersion. They indicate how spread out
a data distribution is. A low standard deviation means that the data observations tend to be very
close to the mean, while a high standard deviation indicates that the data are spread out over a
large range of values. The variance of N observations, x1,x2, … ,xN, for a numeric attribute X
is
where x’ is the mean value of the observations, as defined in Eq. (2.1). The standard deviation,
σ, of the observations is the square root of the variance, σ2.
Example 3: Variance and standard deviation. From previous example, we found x’=$58,000
using mean equation. To determine the variance and standard deviation of the data from that
example,
The basic properties of the standard deviation, σ, as a measure of spread are as follows:
σ measures spread about the mean and should be considered only when the mean is chosen as
the measure of center. σ = 0 only when there is no spread, that is, when all observations have
the same value. Otherwise, σ > 0.
Importantly, an observation is unlikely to be more than several standard deviations away from
the mean. Mathematically, using Chebyshev’s inequality, it can be shown that at least
of the observations are no more than k standard deviations from the mean.
Therefore, the standard deviation is a good indicator of the spread of a data set. The
computation of the variance and standard deviation is scalable in large databases.
In this section, we study graphic displays of basic statistical descriptions. These include
quantile plots, quantile–quantile plots, histograms, and scatter plots. Such graphs are helpful
for the visual inspection of data, which is useful for data pre-processing. The first three of these
show univariate distributions (i.e., data for one attribute), while scatter plots show bivariate
distributions (i.e., involving two attributes).
Quantile Plot
In this and the following subsections, we cover common graphic displays of data distributions.
A quantile plot is a simple and effective way to have a first look at a univariate data
distribution. First, it displays all of the data for the given attribute (allowing the user to assess
both the overall behavior and unusual occurrences). Second, it plots quantile information. Let
xi , for i = 1 to N, be the data sorted in increasing order so that x1 is the smallest observation
and xN is the largest for some ordinal or numeric attribute X. Each observation, xi , is paired
with a percentage, fi , which indicates that approximately fi X 100% of the data are below the
value, xi .We say “approximately” because there may not be a value with exactly a fraction, fi
, of the data below xi . Note that the 0.25 percentile corresponds to quartile Q1, the 0.50
percentile is the median, and the 0.75 percentile is Q3.
1
These numbers increase in equal steps of 1/N, ranging from 1/2N (which is slightly above
2𝑁
1
0) to 1-2𝑁 (which is slightly below 1). On a quantile plot, xi is graphed against fi . This allows
us to compare different distributions based on their quantiles. For example, given the quantile
plots of sales data for two different time periods, we can compare their Q1, median, Q3, and
other fi values at a glance.
Example 2.13 Quantile plot. Figure 2.4 shows a quantile plot for the unit price data of Table
2.1.
Quantile–Quantile Plot
A quantile–quantile plot, or q-q plot, graphs the quantiles of one univariate distribution
against the corresponding quantiles of another. It is a powerful visualization tool in that it
allows the user to view whether there is a shift in going from one distribution to another.
Suppose that we have two sets of observations for the attribute or variable unit price, taken
from two different branch locations. Let x1, … ,xN be the data from the first branch, and y1,
…. , yM be the data from the second, where each data set is sorted in increasing order. If M =
N (i.e., the number of points in each set is the same), then we simply plot yi against xi , where
yi and xi are both .(i -0.5)/N quantiles of their respective data sets. If M < N (i.e., the second
branch has fewer observations than the first), there can be only M points on the q-q plot. Here,
yi is the .(i -0.5)=M quantile of the y data, which is plotted against the .(i -0.5)=M quantile of
the x data. This computation typically involves interpolation.
Example 1: Quantile–quantile plot. Below figure shows a quantile–quantile plot for unit price
data of items sold at two branches of AllElectronics during a given time period. Each point
corresponds to the same quantile for each data set and shows the unit price of items sold at
branch 1 versus branch 2 for that quantile. (To aid in comparison, the straight line represents
the case where, for each given quantile, the unit price at each branch is the same.
The darker points correspond to the data for Q1, the median, and Q3, respectively.) We see,
for example, that at Q1, the unit price of items sold at branch 1 was slightly less than that at
branch 2. In other words, 25% of items sold at branch 1 were less than or equal to $60, while
25% of items sold at branch 2 were less than or equal to $64. At the 50th percentile (marked
by the median, which is also Q2), we see that 50% of items sold at branch 1 were less than $78,
while 50% of items at branch 2 were less than $85.
In general, we note that there is a shift in the distribution of branch 1 with respect to branch 2
in that the unit prices of items sold at branch 1 tend to be lower than those at branch 2.
Histograms
Histograms (or frequency histograms) are at least a century old and are widely used. “Histos”
means pole or mast, and “gram” means chart, so a histogram is a chart of poles. Plotting
histograms is a graphical method for summarizing the distribution of a given attribute, X. If X
is nominal, such as automobile model or item type, then a pole or vertical bar is drawn for each
known value of X. The height of the bar indicates the frequency (i.e., count) of that X value.
The resulting graph is more commonly known as a bar chart.
If X is numeric, the term histogram is preferred. The range of values for X is partitioned into
disjoint consecutive subranges. The subranges, referred to as buckets or bins, are disjoint
subsets of the data distribution for X. The range of a bucket is known as the width. Typically,
the buckets are of equal width. For example, a price attribute with a value range of $1 to $200
(rounded up to the nearest dollar) can be partitioned into subranges 1 to 20, 21 to 40, 41 to 60,
and so on. For each subrange, a bar is drawn with a height that represents the total count of
items observed within the subrange. Histograms and partitioning rules are further discussed in
Chapter 3 on data reduction.
Example 1: Histogram. Below figure shows a histogram for the data set of Table 2.1, where
buckets (or bins) are defined by equal-width ranges representing $20 increments and the
frequency is the count of items sold.
Although histograms are widely used, they may not be as effective as the quantile plot, q-q
plot, and boxplot methods in comparing groups of univariate observations.
A scatter plot is one of the most effective graphical methods for determining if there appears
to be a relationship, pattern, or trend between two numeric attributes. To construct a scatter
plot, each pair of values is treated as a pair of coordinates in an algebraic sense and plotted as
points in the plane. Below figure shows a scatter plot for the set of data in Table 2.1.
The scatter plot is a useful method for providing a first look at bivariate data to see clusters of
points and outliers, or to explore the possibility of correlation relationships. Two attributes, X,
and Y, are correlated if one attribute implies the other. Correlations can be positive, negative,
or null (uncorrelated). Below figure shows examples of positive and negative correlations
between two attributes.
If the plotted points pattern slopes from lower left to upper right, this means that the values of
X increase as the values of Y increase, suggesting a positive correlation. If the pattern of plotted
points slopes from upper left to lower right, the values of X increase as the values of Y decrease,
suggesting a negative correlation. A line of best fit can be drawn to study the correlation
between the variables. The above figure shows three cases for which there is no correlation
relationship between the two attributes in each of the given data sets.
In conclusion, basic data descriptions (e.g., measures of central tendency and measures of
dispersion) and graphic statistical displays (e.g., quantile plots, histograms, and scatter plots)
provide valuable insight into the overall behavior of your data. By helping to identify noise and
outliers, they are especially useful for data cleaning.
R - Vectors
Vectors in R are the same as the arrays in C language which are used to hold multiple data
values of the same type. One major key point is that in R the indexing of the vector will start
from ‘1’ and not from ‘0’. We can create numeric vectors and character vectors as well.
Vector Creation
Single Element Vector
Even when you write just one value in R, it becomes a vector of length 1 and belongs to one of
the above vector types.
print("abc");
print(12.5)
print(63L)
print(TRUE)
print(2+3i)
print(charToRaw('hello'))
Multiple Elements Vector
Using colon operator with numeric data
v <- 5:13
print(v)
v <- 6.6:12.6
print(v)
# If the final element specified does not belong to the sequence then it is discarded.
v <- 3.8:11.4
print(v)
# Vector addition.
add.result <- v1+v2
print(add.result)
# Vector subtraction.
sub.result <- v1-v2
print(sub.result)
# Vector multiplication.
multi.result <- v1*v2
print(multi.result)
# Vector division.
divi.result <- v1/v2
print(divi.result)
Types of vectors
Vectors are of different types which are used in R. Following are some of the types of
vectors:
Numeric vectors
Numeric vectors are those which contain numeric values such as integer, float, etc.
Output:
[1] "double"
[1] "integer"
Character vectors
Character vectors contain alphanumeric values and special characters.
Output:
[1] "character"
Logical vectors
Logical vectors contain boolean values such as TRUE, FALSE and NA for Null
values.
typeof(v1)
Output:
[1] "logical"
Modifying a vector
Modification of a Vector is the process of applying some operation on an individual element
of a vector to change its value in the vector. There are different ways through which we can
modify a vector:
X <- c(2, 7, 9, 7, 8, 2)
X[3] <- 1
X[2] <-9
X[X>5] <- 0
# Modify by specifying
cat('combine() function', X)
Output
subscript operator 2 9 1 7 8 2
Logical indexing 2 0 1 0 0 2
combine() function 1 0 2
Deleting a vector
Deletion of a Vector is the process of deleting all of the elements of the vector. This can be
done by assigning it to a NULL value.
Output:
Output vector NULL
# Creation of Vector
A <- sort(X)
Output:
ascending order 1 2 2 7 8 11
descending order 11 8 7 2 2 1
> xc
abcd
5678
With the setNames function, two vectors of the same length can be used to create a named
vector:
x <- 5:8
y <- letters[1:4]
xy <- setNames(x, y)
which results in a named integer vector:
> xy
abcd
5678
You may also use the names function to get the same result:
xy <- 5:8
names(xy) <- letters[1:4]
#With such a vector it is also possibly to select elements by name:
xy["a"]
Vector sub-setting
In R Programming Language, subsetting allows the user to access elements from an object. It
takes out a portion from the object based on the condition provided.
Method 1: Subsetting in R Using [ ] Operator
Using the ‘[ ]’ operator, elements of vectors and observations from data frames can be
accessed. To neglect some indexes, ‘-‘ is used to access all other indexes of vector or data
frame.
x <- 1:15
# Print vector
cat("Original vector: ", x, "\n")
# Subsetting vector
cat("First 5 values of vector: ", x[1:5], "\n")
cat("Without values present at index 1, 2 and 3: ", x[-c(1, 2, 3)], "\n")
Matrices
Matrix is a rectangular arrangement of numbers in rows and columns. In a matrix, as we
know rows are the ones that run horizontally and columns are the ones that run vertically.
Creating and Naming a Matrix
To create a matrix in R you need to use the function called matrix(). The arguments to
this matrix() are the set of elements in the vector. You have to pass how many numbers of
rows and how many numbers of columns you want to have in your matrix.
A = matrix(
# No of rows
nrow = 3,
# No of columns
ncol = 3,
# By default matrices are in column-wise order
# So this parameter decides how to arrange the matrix
byrow = TRUE
)
# Naming rows
rownames(A) = c("r1", "r2", "r3")
# Naming columns
colnames(A) = c("c1", "c2", "c3")
Matrix where all rows and columns are filled by a single constant ‘k’:
To create such a matrix the syntax is given below:
Syntax: matrix(k, m, n)
Parameters:
k: the constant
m: no of rows
n: no of columns
print(matrix(5, 3, 3))
Diagonal matrix:
A diagonal matrix is a matrix in which the entries outside the main diagonal are all zero. To
create such a matrix the syntax is given below:
print(diag(c(5, 3, 3), 3, 3))
Identity matrix:
A square matrix in which all the elements of the principal diagonal are ones and all other
elements are zeros. To create such a matrix the syntax is given below:
print(diag(1, 3, 3))
Matrix metrics
Matrix metrics mean once a matrix is created then
cat("Number of rows:\n")
print(nrow(A))
cat("Number of columns:\n")
print(ncol(A))
cat("Number of elements:\n")
print(length(A))
# OR
print(prod(dim(A)))
Matrix subsetting
A matrix is subset with two arguments within single brackets, [], and separated by a comma.
The first argument specifies the rows, and the second the columns.
M_new<-matrix(c(25,23,25,20,15,17,13,19,25,24,21,19,20,12,30,17),ncol=4)
#M_new<-matrix(1:16,4)
M_new
colnames(M_new)<-c("C1","C2","C3","C4")
rownames(M_new)<-c("R1","R2","R3","R4")
Arrays
Arrays are the R data objects which can store data in more than two dimensions. For example
− If we create an array of dimension (2, 3, 4) then it creates 4 rectangular matrices each with 2
rows and 3 columns. Arrays can store only data type.
An array is created using the array() function. It takes vectors as input and uses the values in
the dim parameter to create an array.
Example
The following example creates an array of two 3x3 matrices each with 3 rows and 3 columns.
Accessing arrays
The arrays can be accessed by using indices for different dimensions separated by commas.
Different components can be specified by any combination of elements’ names or positions.
# accessing elements
cat ("Third element of vector is : ", vec[3])
# Print the element in the 1st row and 3rd column of the 1st matrix.
print(result[1,3,1])
# Use apply to calculate the sum of the rows across all the matrices.
result <- apply(new.array, c(1), sum)
print(result)
When we execute the above code, it produces the following result −
,,1
[,1] [,2] [,3]
[1,] 5 10 13
[2,] 9 11 14
[3,] 3 12 15
,,2
[,1] [,2] [,3]
[1,] 5 10 13
[2,] 9 11 14
[3,] 3 12 15
[1] 56 68 60
Accessing subset of array elements
A smaller subset of the array elements can be accessed by defining a range of row or column
limits.
c(vector, values): c() function allows us to append values to the end of the array. Multiple
values can also be added together.
append(vector, values): This method allows the values to be appended at any position in the
vector. By default, this function adds the element at end.
append(vector, values, after=length(vector)) adds new values after specified length of the
array specified in the last argument of the function.
Class in R
Class is the blueprint that helps to create an object and contains its member variable along with
the attributes. As discussed earlier in the previous section, there are two classes of R, S3, and
S4.
S3 Class
S3 class is somewhat primitive in nature. It lacks a formal definition and object of this class
can be created simply by adding a class attribute to it.
This simplicity accounts for the fact that it is widely used in R programming language. In
fact most of the R built-in classes are of this type.
Example 1: S3 class
# create a list with required components
s <- list(name = "John", age = 21, GPA = 3.5)
# name the class appropriately
class(s) <- "student"
S4 Class
S4 class are an improvement over the S3 class. They have a formally defined structure
which helps in making object of the same class look more or less similar.
Class components are properly defined using the setClass() function and objects are created
using the new() function.
Example 2: S4 class
< setClass("student", slots=list(name="character", age="numeric", GPA="numeric"))
Reference Class
Reference class were introduced later, compared to the other two. It is more similar to the
object oriented programming we are used to seeing in other major programming languages.
Reference classes are basically S4 classed with an environment added to it.
Example 3: Reference class
< setRefClass("student")
Factors
Introduction to Factors:
Factors in R Programming Language are data structures that are implemented to categorize the
data or represent categorical data and store it on multiple levels.
They can be stored as integers with a corresponding label to every unique integer. Though
factors may look similar to character vectors, they are integers and care must be taken while
using them as strings. The factor accepts only a restricted number of distinct values. For
example, a data field such as gender may contain values only from female, male.
Creating a Factor in R Programming Language
The command used to create or modify a factor in R language is – factor() with a vector as
input.
The two steps to creating a factor are:
Creating a vector
Converting the vector created into a factor using function factor()
Example:
print(data)
print(is.factor(data))
print(factor_data)
print(is.factor(factor_data))
Example:
v <- gl(3, 4, labels = c("A", "B","C"))
print(v)
Summarizing a Factor
The summary function in R returns the results of basic statistical calculations (minimum, 1st
quartile, median, mean, 3rd quartile, and maximum) for a numerical vector. The general way
to write the R summary function is summary(x, na.rm=FALSE/TRUE). Again, X refers to a
numerical vector, while na.rm=FALSE/TRUE specifies whether to remove empty values from
the calculation.
Example:
v <- gl(3, 4, labels = c("A", "B","C"))
print(v)
summary(v)
print(x)
print(is.factor(x))
In the above code, x is a vector with 8 elements. To convert it to a factor the function factor()
is used. Here there are 8 factors and 3 levels. Levels are the unique elements in the data. Can
be found using levels() function.
Ordered factors is an extension of factors. It arranges the levels in increasing order. We use
two functions: factor() along with argument ordered().
Example:
Output:
friend_id friend_name
1 1 Sachin
2 2 Sourav
3 3 Dravid
4 4 Sehwag
5 5 Dhoni
start_date = as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11","2015-03-27")),
)
# Extract Specific columns.
result <- data.frame(emp.data$emp_name,emp.data$salary)
print(result)
Extract 3rd and 5th row with 2nd and 4th column
result <- emp.data[c(3,5),c(2,4)]
Add Column
Just add the column vector using a new column name.
Add Row
To add more rows permanently to an existing data frame, we need to bring in the new rows in the
same structure as the existing data frame and use the rbind() function.
In the example below we create a data frame with new rows and merge it with the existing data frame
to create the final data frame.
Pulse Duration
2 150 30
3 120 45
subset(emp.data, emp_id == 3)
Sorting Data
To sort a data frame in R, use the order( ) function. By default, sorting is ASCENDING.
Prepend the sorting variable by a minus sign to indicate DESCENDING order. Here are some
examples.
data = data.frame(
rollno = c(1, 5, 4, 2, 3),
subjects = c("java", "python", "php", "sql", "c"))
print(data)
print("sort the data in decreasing order based on subjects ")
print(data[order(data$subjects, decreasing = TRUE), ] )
Output:
rollno subjects
1 1 java
2 5 python
3 4 php
4 2 sql
5 3 c
[1] "sort the data in decreasing order based on subjects "
rollno subjects
4 2 sql
2 5 python
3 4 php
1 1 java
5 3 c
[1] "sort the data in decreasing order based on rollno "
rollno subjects
2 5 python
3 4 php
5 3 c
4 2 sql
1 1 java
Lists
Lists are one-dimensional, heterogeneous data structures. The list can be a list of vectors, a
list of matrices, a list of characters and a list of functions, and so on.
A list is a vector but with heterogeneous data elements. A list in R is created with the use of
list() function. R allows accessing elements of a list with the use of the index value. In R, the
indexing of a list starts with 1 instead of 0 like other programming languages.
Creating a List
To create a List in R you need to use the function called “list()”. In other words, a list is a
generic vector containing other objects. To illustrate how a list looks, we take an example
here. We want to build a list of employees with the details. So for this, we want attributes
such as ID, employee name, and the number of employees.
empId = c(1, 2, 3, 4)
numberOfEmp = 4
print(empList)
or
list_data <- list("Red", "Green", c(21,32,11), TRUE, 51.23, 119.1)
print(list_data)
Access components by names: All the components of a list can be named and we can use
those names to access the components of the list using the dollar command.
empId = c(1, 2, 3, 4)
empName = c("Debi", "Sandeep", "Subham", "Shiba")
numberOfEmp = 4
empList = list(
"ID" = empId,
"Names" = empName,
"Total Staff" = numberOfEmp
)
print(empList)
Merging list
We can merge the list by placing all the lists into a single list.
print(vec)
Unit-4
Conditionals and control flow
R - Operators
An operator is a symbol that tells the compiler to perform specific mathematical or logical
manipulations. R language is rich in built-in operators and provides following types of
operators.
Types of Operators
We have the following types of operators in R programming −
Arithmetic Operators
Relational Operators
Logical Operators
Assignment Operators
Miscellaneous Operators
Arithmetic Operators
Following table shows the arithmetic operators supported by R language. The operators act on
each element of the vector.
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v+t)
it produces the following result −
[1] 10.0 8.5 10.0
Relational Operators
Following table shows the relational operators supported by R language. Each element of the
first vector is compared with the corresponding element of the second vector. The result of
comparison is a Boolean value.
== v <- c(2,5.5,6,9)
Checks if each element of the first vector is t <- c(8,2.5,14,9)
equal to the corresponding element of the print(v == t)
second vector. it produces the following result −
[1] FALSE FALSE FALSE TRUE
!= v <- c(2,5.5,6,9)
Checks if each element of the first vector is t <- c(8,2.5,14,9)
unequal to the corresponding element of the print(v!=t)
second vector. it produces the following result −
[1] TRUE TRUE TRUE FALSE
Logical Operators
Following table shows the logical operators supported by R language. It is applicable only to
vectors of type logical, numeric or complex. All numbers greater than 1 are considered as
logical value TRUE.
Each element of the first vector is compared with the corresponding element of the second
vector. The result of comparison is a Boolean value.
! v <- c(3,0,TRUE,2+2i)
It is called Logical NOT operator. Takes print(!v)
each element of the vector and gives the
opposite logical value. it produces the following result −
[1] FALSE TRUE FALSE FALSE
The logical operator && and || considers only the first element of the vectors and give a vector
of single element as output.
|| v <- c(0,0,TRUE,2+2i)
Called Logical OR operator. Takes first t <- c(0,3,TRUE,2+3i)
element of both the vectors and gives the print(v||t)
TRUE if one of them is TRUE. it produces the following result −
[1] FALSE
Assignment Operators
These operators are used to assign values to vectors.
%in% v1 <- 8
This
v2 <- 12
operator is
t <- 1:10
used to print(v1 %in% t)
identify if print(v2 %in% t)
an element
belongs to a it produces the following result −
vector. [1] TRUE
[1] FALSE
%*% This M = matrix( c(2,6,5,1,10,4), nrow = 2,ncol = 3,byrow = TRUE)
operator is t = M %*% t(M)
used to print(t)
multiply a
it produces the following result −
matrix with
its [,1] [,2]
transpose. [1,] 65 82
[2,] 82 117
R provides the following types of decision making statements. Click the following links to
check their detail.
1 if statement
An if statement consists of a Boolean expression followed by one or more statements.
2 if...else statement
An if statement can be followed by an optional else statement, which executes when the
Boolean expression is false.
3 switch statement
A switch statement allows a variable to be tested for equality against a list of values.
R - If Statement
An if statement consists of a Boolean expression followed by one or more statements.
Syntax
The basic syntax for creating an if statement in R is −
if(boolean_expression) {
// statement(s) will execute if the boolean expression is true.
}
If the Boolean expression evaluates to be true, then the block of code inside the if statement
will be executed. If Boolean expression evaluates to be false, then the first set of code after the
end of the if statement (after the closing curly brace) will be executed.
Flow Diagram
Example
x <- 30L
if(is.integer(x)) {
print("X is an Integer")
}
When the above code is compiled and executed, it produces the following result −
[1] "X is an Integer"
R - If...Else Statement
An if statement can be followed by an optional else statement which executes when the boolean
expression is false.
Syntax
The basic syntax for creating an if...else statement in R is −
if(boolean_expression) {
// statement(s) will execute if the boolean expression is true.
} else {
// statement(s) will execute if the boolean expression is false.
}
If the Boolean expression evaluates to be true, then the if block of code will be executed,
otherwise else block of code will be executed.
Flow Diagram
Example
x <- c("what","is","truth")
if("Truth" %in% x) {
print("Truth is found")
} else {
print("Truth is not found")
}
When the above code is compiled and executed, it produces the following result −
[1] "Truth is not found"
Here "Truth" and "truth" are two different strings.
The if...else if...else Statement
An if statement can be followed by an optional else if...else statement, which is very useful to
test various conditions using single if...else if statement.
When using if, else if, else statements there are few points to keep in mind.
An if can have zero or one else and it must come after any else if's.
An if can have zero to many else if's and they must come before the else.
Once an else if succeeds, none of the remaining else if's or else's will be tested.
Syntax
The basic syntax for creating an if...else if...else statement in R is −
if(boolean_expression 1) {
// Executes when the boolean expression 1 is true.
} else if( boolean_expression 2) {
// Executes when the boolean expression 2 is true.
} else if( boolean_expression 3) {
// Executes when the boolean expression 3 is true.
} else {
// executes when none of the above condition is true.
}
Example
x <- c("what","is","truth")
if("Truth" %in% x) {
print("Truth is found the first time")
} else if ("truth" %in% x) {
print("truth is found the second time")
} else {
print("No truth found")
}
When the above code is compiled and executed, it produces the following result −
[1] "truth is found the second time"
Nested If Statements
You can also have if statements inside if statements, this is called nested if statements.
Example
x <- 41
if (x > 10) {
print("Above ten")
if (x > 20) {
print("and also above 20!")
} else {
print("but not above 20.")
}
} else {
print("below 10.")
}
AND
The & symbol (and) is a logical operator, and is used to combine conditional statements:
Example
a <- 200
b <- 33
c <- 500
The | symbol (or) is a logical operator, and is used to combine conditional statements:
Example
a <- 200
b <- 33
c <- 500
R - Switch Statement
A switch statement allows a variable to be tested for equality against a list of values. Each
value is called a case, and the variable being switched on is checked for each case.
Syntax
The basic syntax for creating a switch statement in R is −
switch(expression, case1, case2, case3....)
The following rules apply to a switch statement −
If the value of expression is not a character string it is coerced to integer.
You can have any number of case statements within a switch.
If the value of the integer is between 1 and nargs()−1 (The max number of
arguments)then the corresponding element of case condition is evaluated and the result
returned.
If expression evaluates to a character string then that string is matched (exactly) to the
names of the elements.
If there is more than one match, the first matching element is returned.
No Default argument is available.
In the case of no match, if there is a unnamed element of ... its value is returned.
Flow Diagram
Example
x <- switch(
3,
"first",
"second",
"third",
"fourth"
)
print(x)
When the above code is compiled and executed, it produces the following result −
[1] "third"
Example 2:
# Mathematical calculation
val1 = 6
val2 = 7
val3 = "s"
result = switch(
val3,
"a"= cat("Addition =", val1 + val2),
"d"= cat("Subtraction =", val1 - val2),
"r"= cat("Division = ", val1 / val2),
"s"= cat("Multiplication =", val1 * val2),
"m"= cat("Modulus =", val1 %% val2),
"p"= cat("Power =", val1 ^ val2)
)
print(result)
Iterative Programming in R
R - Loops
Introduction:
There may be a situation when you need to execute a block of code several number
of times. In general, statements are executed sequentially. The first statement in a
function is executed first, followed by the second, and so on.
Programming languages provide various control structures that allow for more
complicated execution paths.
A loop statement allows us to execute a statement or group of statements multiple
times and the following is the general form of a loop statement in most of the
programming languages −
1 repeat loop
Executes a sequence of statements multiple times and abbreviates the code that
manages the loop variable.
2 while loop
Repeats a statement or group of statements while a given condition is true. It
tests the condition before executing the loop body.
3 for loop
Like a while statement, except that it tests the condition at the end of the loop
body.
R - For Loop
A For loop is a repetition control structure that allows you to efficiently write a loop
that needs to execute a specific number of times.
Syntax
The basic syntax for creating a for loop statement in R is −
for (value in vector) {
statements
}
Flow Diagram
R’s for loops are particularly flexible in that they are not limited to integers, or even
numbers in the input. We can pass character vectors, logical vectors, lists or
expressions.
Example
v <- LETTERS[1:4]
for ( i in v) {
print(i)
}
When the above code is compiled and executed, it produces the following result −
[1] "A"
[1] "B"
[1] "C"
[1] "D"
Example
for (x in 1:10) {
print(x)
}
'Monday',
'Tuesday',
'Wednesday',
'Thursday',
'Friday',
'Saturday')
{ print(day)
R - While Loop
The While loop executes the same code again and again until a stop condition is met.
Syntax
The basic syntax for creating a while loop in R is −
while (test_expression) {
statement
}
Flow Diagram
Here key point of the while loop is that the loop might not ever run. When the condition
is tested and the result is false, the loop body will be skipped and the first statement
after the while loop will be executed.
Example1
val = 1
Example2
n<-5
factorial < - 1
i<-1
while (i <= n)
{
factorial = factorial * i
i=i+1
}
print(factorial)
R - Repeat Loop
It is a simple loop that will run the same statement or a group of statements
repeatedly until the stop condition has been encountered. Repeat loop does
not have any condition to terminate the loop, a programmer must specifically
place a condition within the loop’s body and use the declaration of a break
statement to terminate this loop. If no condition is present in the body of the
repeat loop then it will iterate infinitely.
Syntax
The basic syntax for creating a repeat loop in R is −
repeat
{
statement
if( condition )
{
break
}
}
Flow Diagram
Example1
val = 1
repeat
{
print(val)
val = val + 1
if(val > 5)
{
break
}
}
Example 2:
i<-0
repeat
{
print("Geeks 4 geeks!")
i=i+1
if (i == 5)
{
break
}
}
1 break statement
Terminates the loop statement and transfers execution to the statement
immediately following the loop.
2 Next statement
The next statement simulates the behavior of R switch.
R - Break Statement
The break statement in R programming language has the following two usages −
When the break statement is encountered inside a loop, the loop is immediately
terminated and program control resumes at the next statement following the
loop.
It can be used to terminate a case in the switch statement
Syntax
The basic syntax for creating a break statement in R is −break
Flow Diagram
Example
for (val in 1: 5)
{
# checking condition
if (val == 3)
{
# using break keyword
break
}
R - Next Statement
The next statement in R programming language is useful when we want to skip the
current iteration of a loop without terminating it. On encountering next, the R parser
skips further evaluation and starts next iteration of the loop.
Syntax
The basic syntax for creating a next statement in R is −next
Flow Diagram
Example
for (val in 1: 5)
{
# checking condition
if (val == 3)
{
# using next keyword
next
}
R - Functions
Functions are useful when you want to perform a certain task multiple times. A function accepts
input arguments and produces the output by executing valid R commands that are inside the
function. In R Programming Language when you are creating a function the function name and
the file in which you are creating the function need not be the same and you can have one or
more function definitions in a single R file.
Built-in Function: Built function R is sqrt(), mean(), max(), these function are directly call in
the program by users.
Functions in R Language
Functions are created in R by using the command function(). The general structure of the
function file is as follows:
Built-in Function in R Programming Language
Here we will use built-in function like sum(), max() and min().
print(sum(4:6))
print(evenOdd(4))
print(evenOdd(3))
print(areaOfCircle(2))
resultList = Rectangle(2, 3)
print(resultList["Area"])
print(resultList["Perimeter"])
print(f(4))
# Case 1:
print(Rectangle(2, 3))
# Case 2:
print(Rectangle(width = 8, length = 4))
# Case 3:
print(Rectangle())
Example1:
Cal= function(a,b,c){
v = a*b
return(v)
}
print(Cal(5, 10))
Example2:
Cal= function(a,b,c){
v = a*b*c
return(v)
}
print(Cal(5, 10))
Adding Arguments in R
We can pass an argument to a function while calling the function by simply giving the value
as an argument inside the parenthesis. Below is an implementation of a function with a single
argument.
divisbleBy5 <- function(n){
if(n %% 5 == 0)
{
return("number is divisible by 5")
}
else
{
return("number is not divisible by 5")
}
}
# Function call
divisbleBy5(100)
Adding Multiple Arguments in R
A function in R programming can have multiple arguments too. Below is an
implementation of a function with multiple arguments.
divisible <- function(a, b){
if(a %% b == 0)
{
return(paste(a, "is divisible by", b))
}
else
{
return(paste(a, "is not divisible by", b))
}
}
# Function call
divisible(7, 3)
# Function call
divisible(10, 5)
divisible(12)
Dots Argument
Dots argument (…) is also known as ellipsis which allows the function to take an undefined
number of arguments. It allows the function to take an arbitrary number of arguments.
Below is an example of a function with an arbitrary number of arguments.
fun <- function(n, ...){
l <- list(n, ...)
paste(l, collapse = " ")
}
# Function call
fun(5, 1L, 6i, 15.2, TRUE)
The recursive function uses the concept of recursion to perform iterative tasks they call
themselves, again and again, which acts as a loop. These kinds of functions need a stopping
condition so that they can stop looping continuously.
Recursive functions call themselves. They break down the problem into smaller components.
The function() calls itself within the original function() on each of the smaller components.
After this, the results will be put together to solve the original problem.
Example1:
fac <- function(x){
if(x==0 || x==1)
{
return(1)
}
else
{
return(x*fac(x-1))
}
}
fac(3)
Nested Functions
Example
Nested_function(Nested_function(2,2), Nested_function(3,3))
Write a function within a function.
Example
Loading an R package
Packages
Packages are collections of R functions, data, and compiled code in a well-defined format. The
directory where packages are stored is called the library. R comes with a standard set of
packages. Others are available for download and installation. Once installed, they have to be
loaded into the session to be used.
Adding Packages
You can expand the types of analyses you do be adding other packages. A complete list of
contributed packages is available from CRAN.
Load an R Package
There are basically two extremely important functions when it comes down to R packages:
To install packages, you need administrator privileges. This means that install.packages() will
thus not work in the DataCamp interface. However, almost all CRAN packages are installed
on our servers. You can load them with library().
In this exercise, you'll be learning how to load the ggplot2 package, a powerful package for
data visualization. You'll use it to create a plot of two variables of the mtcars data frame. The
data has already been prepared for you in the workspace.
Mathematical Functions in R
R provides the various mathematical functions to perform the mathematical calculation. These
mathematical functions are very helpful to find absolute value, square value and much more
calculations. In R, there are the following functions which are used:
Numerosity reduction techniques replace the original data volume by alternative, smaller
forms of data representation.
These techniques may be parametric or nonparametric.
For parametric methods, a model is used to estimate the data, so that typically only the
data parameters need to be stored, instead of the actual data. (Outliers may also be
stored.) Regression and log-linear models are examples.
Nonparametric methods for storing reduced representations of the data include
histograms, clustering, sampling, and data cube aggregation.
The technique also works to remove noise without smoothing out the main features of
the data, making it effective for data cleaning as well. Given a set of coefficients, an
approximation of the original data can be constructed by applying the inverse of the
DWT used.
The DWT is closely related to the discrete Fourier transform (DFT), a signal
processing technique involving sines and cosines. In general, however, the DWT
achieves better lossy compression.
Unlike the DFT, wavelets are quite localized in space, contributing to the conservation
of local detail.
There is only one DFT, yet there are several families of DWTs. Figure 3.4 shows some wavelet
families. Popular wavelet transforms include the Haar-2, Daubechies-4, and Daubechies-6. The
general procedure for applying a discrete wavelet transform uses a hierarchical pyramid
algorithm that halves the data at each iteration, resulting in fast computational speed.
Wavelet transforms can be applied to multidimensional data such as a data cube. This
is done by first applying the transform to the first dimension, then to the second, and so
on.
Lossy compression by wavelets is reportedly better than JPEG compression, the current
commercial standard.
Wavelet transforms have many real world applications, including the compression of
fingerprint images, computer vision, analysis of time-series data, and data cleaning.
Attribute subset selection reduces the data set size by removing irrelevant or redundant
attributes (or dimensions). The goal of attribute subset selection is to find a minimum set of
attributes such that the resulting probability distribution of the data classes is as close as
possible to the original distribution obtained using all attributes. Mining on a reduced set of
attributes has an additional benefit: It reduces the number of attributes appearing in the
discovered patterns, helping to make the patterns easier to understand.
Therefore, heuristic methods that explore a reduced search space are commonly used for
attribute subset selection. These methods are typically greedy in that, while searching through
attribute space, they always make what looks to be the best choice at the time. Their strategy
is to make a locally optimal choice in the hope that this will lead to a globally optimal solution.
Such greedy methods are effective in practice and may come close to estimating an optimal
solution.
The “best” (and “worst”) attributes are typically determined using tests of statistical
significance, which assume that the attributes are independent of one another. Many other
attribute evaluation measures can be used such as the information gain measure used in
building decision trees for classification.5
Basic heuristic methods of attribute subset selection include the techniques that follow, some
of which are illustrated in Figure 3.6.
1. Stepwise forward selection: The procedure starts with an empty set of attributes as the
reduced set. The best of the original attributes is determined and added to the reduced set. At
each subsequent iteration or step, the best of the remaining original attributes is added to the
set.
2. Stepwise backward elimination: The procedure starts with the full set of attributes. At each
step, it removes the worst attribute remaining in the set.
3. Combination of forward selection and backward elimination: The stepwise forward
selection and backward elimination methods can be combined so that, at each step, the
procedure selects the best attribute and removes the worst from among the remaining attributes.
4. Decision tree induction: Decision tree algorithms (e.g., ID3, C4.5, and CART) were
originally intended for classification. Decision tree induction constructs a flowchart like
structure where each internal (nonleaf) node denotes a test on an attribute, each branch
corresponds to an outcome of the test, and each external (leaf) node denotes a class prediction.
At each node, the algorithm chooses the “best” attribute to partition the data into individual
classes.
When decision tree induction is used for attribute subset selection, a tree is constructed from
the given data. All attributes that do not appear in the tree are assumed to be irrelevant. The set
of attributes appearing in the tree form the reduced subset of attributes.
Regression and log-linear models can be used to approximate the given data. In (simple) linear
regression, the data are modeled to fit a straight line. For example, a random variable, y (called
a response variable), can be modeled as a linear function of another random variable, x (called
a predictor variable), with the equation
where the variance of y is assumed to be constant. In the context of data mining, x and y are
numeric database attributes. The coefficients, w and b (called regression coefficients), specify
the slope of the line and the y-intercept, respectively.
Histograms
Histograms use discarding to approximate data distributions and are a popular form of data
reduction. A histogram for an attribute, A, partitions the data distribution of A into disjoint
subsets, referred to as buckets or bins. If each bucket represents only a single attribute–
value/frequency pair, the buckets are called singleton buckets. Often, buckets instead represent
continuous ranges for the given attribute.
Clustering
Clustering techniques consider data tuples as objects.
They partition the objects into groups, or clusters, so that objects within a cluster are
“similar” to one another and “dissimilar” to objects in other clusters.
Similarity is commonly defined in terms of how “close” the objects are in space, based
on a distance function.
The “quality” of a cluster may be represented by its diameter, the maximum distance
between any two objects in the cluster.
Centroid distance is an alternative measure of cluster quality and is defined as the
average distance of each cluster object from the cluster centroid.
In data reduction, the cluster representations of the data are used to replace the actual
data.
Sampling
Sampling can be used as a data reduction technique because it allows a large data set to
be represented by a much smaller random data sample (or subset). Suppose that a large data
set, D, contains N tuples. Let’s look at the most common ways that we could sample D for data
reduction, as illustrated in Figure 3.9.
Simple random sample with replacement (SRSWR) of size s: This is similar to SRSWOR,
except that each time a tuple is drawn from D, it is recorded and then replaced. That is, after a
tuple is drawn, it is placed back in D so that it may be drawn again.
Cluster sample: If the tuples in D are grouped into M mutually disjoint “clusters,” then an SRS
of s clusters can be obtained, where s < M. For example, tuples in a database are usually
retrieved a page at a time, so that each page can be considered a cluster. A reduced data
representation can be obtained by applying, say, SRSWOR to the pages, resulting in a cluster
sample of the tuples. Other clustering criteria conveying rich semantics can also be explored.
For example, in a spatial database, we may choose to define clusters geographically based on
how closely different areas are located.
Stratified sample: If D is divided intomutually disjoint parts called strata, a stratified sample
of D is generated by obtaining an SRS at each stratum. This helps ensure a representative
sample, especially when the data are skewed. For example, a stratified sample may be obtained
fromcustomer data, where a stratum is created for each customer age group. In this way, the
age group having the smallest number of customers will be sure to be represented
An advantage of sampling for data reduction is that the cost of obtaining a sample is
proportional to the size of the sample, s, as opposed to N, the data set size. Hence, sampling
complexity is potentially sublinear to the size of the data. Other data reduction techniques can
require at least one complete pass through D. For a fixed sample size, sampling complexity
increases only linearly as the number of data dimensions, n, increases, whereas techniques
using histograms, for example, increase exponentially in n.
When applied to data reduction, sampling is most commonly used to estimate the answer to an
aggregate query. It is possible (using the central limit theorem) to determine a sufficient sample
size for estimating a given function within a specified degree of error. This sample size, s, may
be extremely small in comparison to N. Sampling is a natural choice for the progressive
refinement of a reduced data set. Such a set can be further refined by simply increasing the
sample size.
Multi-dimensional aggregated information. For example, the above Figure shows a data
cube for multidimensional analysis of sales data with respect to annual sales per item type for
each AllElectronics branch. Each cell holds an aggregate data value, corresponding to the data
point in multidimensional space. (For readability, only some cell values are shown.) Concept
hierarchies may exist for each attribute, allowing the analysis of data at multiple abstraction
levels. For example, a hierarchy for branch could allow branches to be grouped into regions,
based on their address. Data cubes provide fast access to precomputed, summarized data,
thereby benefiting online analytical processing as well as data mining.
The cube created at the lowest abstraction level is referred to as the base cuboid. The
base cuboid should correspond to an individual entity of interest such as sales or customer. In
other words, the lowest level should be usable, or useful for the analysis. A cube at the highest
level of abstraction is the apex cuboid. For the sales data in Figure 3.11, the apex cuboid would
give one total—the total sales for all three years, for all item types, and for all branches. Data
cubes created for varying levels of abstraction are often referred to as cuboids, so that a data
cube may instead refer to a lattice of cuboids. Each higher abstraction level further reduces the
resulting data size. When replying to data mining requests, the smallest available cuboid
relevant to the given task should be used.
Data Visualization
Data visualization aims to communicate data clearly and effectively through graphical
representation. Data visualization has been used extensively in many applications—for
example, at work for reporting, managing business operations, and tracking progress of tasks.
More popularly, we can take advantage of visualization techniques to discover data
relationships that are otherwise not easily observable by looking at the raw data. Nowadays,
people also use data visualization to create fun and interesting graphics.
This section start with multidimensional data such as those stored in relational
databases. We discuss several representative approaches, including pixel-oriented techniques,
geometric projection techniques, icon-based techniques, and hierarchical and graph-based
techniques. We then discuss the visualization of complex data and relations.
A simple way to visualize the value of a dimension is to use a pixel where the color of
the pixel reflects the dimension’s value. For a data set of m dimensions, pixel-oriented
techniques create m windows on the screen, one for each dimension. The m dimension values
of a record are mapped tompixels at the corresponding positions in the windows.
The colors of the pixels reflect the corresponding values. Inside a window, the data
values are arranged in some global order shared by all windows. The global order may be
obtained by sorting all data records in a way that’s meaningful for the task at hand.
Pixel-oriented visualization. AllElectronics maintains a customer information table,
which consists of four dimensions: income, credit limit, transaction volume, and age. Can we
analyze the correlation between income and the other attributes by visualization? We can sort
all customers in income-ascending order, and use this order to lay out the customer data in the
four visualization windows, as shown in Figure 2.10. The pixel colors are chosen so that the
smaller the value, the lighter the shading. Using pixelbased visualization, we can easily observe
the following: credit limit increases as income increases; customers whose income is in the
middle range are more likely to purchase more from AllElectronics; there is no clear correlation
between income and age.
In pixel-oriented techniques, data records can also be ordered in a query-dependent
way. For example, given a point query, we can sort all records in descending order of similarity
to the point query.
Filling a window by laying out the data records in a linear way may not work well for
a wide window. The first pixel in a row is far away fromthe last pixel in the previous row,
though they are next to each other in the global order. Moreover, a pixel is next to the one
above it in the window, even though the two are not next to each other in the global order. To
solve this problem, we can lay out the data records in a space-filling curve to fill the windows.
A space-filling curve is a curve with a range that covers the entire n-dimensional unit
hypercube. Since the visualization windows are 2-D, we can use any 2-D space-filling curve.
Figure 2.11 shows some frequently used 2-D space-filling curves. Note that the
windows do not have to be rectangular. For example, the circle segment technique uses
windows in the shape of segments of a circle, as illustrated in Figure 2.12. This technique can
ease the comparison of dimensions because the dimension windows are located side by side
and form a circle.
Geometric Projection Visualization Techniques
A scatter plot displays 2-D data points using Cartesian coordinates. A third dimension
can be added using different colors or shapes to represent different data points. Figure 2.13
shows an example, where X and Y are two spatial attributes and the third dimension is
represented by different shapes. Through this visualization, we can see that points of types “+”
and “_” tend to be colocated.
A 3-D scatter plot uses three axes in a Cartesian coordinate system. If it also uses color,
it can display up to 4-D data points (Figure 2.14).
For data sets with more than four dimensions, scatter plots are usually ineffective. The
scatter-plot matrix technique is a useful extension to the scatter plot. For an n dimensional
data set, a scatter-plot matrix is an n_n grid of 2-D scatter plots that provides a visualization of
each dimension with every other dimension. Figure 2.15 shows an example, which visualizes
the Iris data set. The data set consists of 450 samples from each of three species of Iris flowers.
There are five dimensions in the data set: length and width of sepal and petal, and species. The
scatter-plot matrix becomes less effective as the dimensionality increases. Another popular
technique, called parallel coordinates, can handle higher dimensionality. To visualize n-
dimensional data points, the parallel coordinates technique draws n equally spaced axes, one
for each dimension, parallel to one of the display axes.
A data record is represented by a polygonal line that intersects each axis at the point
corresponding to the associated dimension value.
A major limitation of the parallel coordinates technique is that it cannot effectively
show a data set of many records. Even for a data set of several thousand records, visual clutter
and overlap often reduce the readability of the visualization and make the patterns hard to find.
Viewing large tables of data can be tedious. By condensing the data, Chernoff faces
make the data easier for users to digest. In this way, they facilitate visualization of regularities
and irregularities present in the data, although their power in relating multiple relationships is
limited. Another limitation is that specific data values are not shown.
Furthermore, facial features vary in perceived importance. This means that the
similarity of two faces (representing twomultidimensional data points) can vary depending on
the order in which dimensions are assigned to facial characteristics. Therefore, this mapping
should be carefully chosen. Eye size and eyebrow slant have been found to be important.
Asymmetrical Chernoff faces were proposed as an extension to the original technique.
Since a face has vertical symmetry (along the y-axis), the left and right side of a face
are identical, which wastes space. Asymmetrical Chernoff faces double the number of facial
characteristics, thus allowing up to 36 dimensions to be displayed. The stick figure
visualization technique maps multidimensional data to five-piece stick figures, where each
figure has four limbs and a body. Two dimensions are mapped to the display (x and y) axes
and the remaining dimensions are mapped to the angle and/or length of the limbs. Figure 2.18
shows census data, where age and income are mapped to the display axes, and the remaining
dimensions (gender, education, and so on) are mapped to stick figures. If the data items are
relatively dense with respect to the two display dimensions, the resulting visualization shows
texture patterns, reflecting data trends.
In early days, visualization techniques were mainly for numeric data. Recently, more
and more non-numeric data, such as text and social networks, have become available.
Visualizing and analyzing such data attracts a lot of interest. There are many new visualization
techniques dedicated to these kinds of data.
For example, many people on theWeb tag various objects such as pictures, blog entries,
and product reviews. A tag cloud is a visualization of statistics of user-generated tags. Often,
in a tag cloud, tags are listed alphabetically or in a user-preferred order. The importance of a
tag is indicated by font size or color. Figure 2.21 shows a tag cloud for visualizing the popular
tags used in aWeb site. Tag clouds are often used in two ways. First, in a tag cloud for a single
item, we can use the size of a tag to represent the number of times that the tag is applied to this
item by different users.
Second, when visualizing the tag statistics on multiple items, we can use the size of a
tag to represent the number of items that the tag has been applied to, that is, the popularity of
the tag. In addition to complex data, complex relations among data entries also raise challenges
for visualization. For example, Figure 2.22 uses a disease influence graph to visualize the
correlations between diseases. The nodes in the graph are diseases, and the size of each node
is proportional to the prevalence of the corresponding disease. Two nodes are linked by an edge
if the corresponding diseases have a strong correlation.
The width of an edge is proportional to the strength of the correlation pattern of the two
corresponding diseases.
In summary, visualization provides effective tools to explore data. We have introduced
several popular methods and the essential ideas behind them. There are many existing tools
and methods. Moreover, visualization can be used in data mining in various aspects. In addition
to visualizing data, visualization can be used to represent the data mining process, the patterns
obtained from a mining method, and user interaction with the data. Visual data mining is an
important research and development direction.