KEMBAR78
Cleaning Data in R | PDF | Data | Cybernetics
0% found this document useful (0 votes)
55 views9 pages

Cleaning Data in R

The document discusses exploring and cleaning raw data in R. It describes getting familiar with the structure of data by checking its class, dimensions, and column names. The structure can be viewed using str() and glimpse(), while summary() provides distribution summaries. Exploring involves viewing the top and bottom rows with head() and tail(), as well as visualizing with histograms and scatterplots to identify patterns and issues. Tidy data principles and tidyr functions for gathering and spreading data are also introduced.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views9 pages

Cleaning Data in R

The document discusses exploring and cleaning raw data in R. It describes getting familiar with the structure of data by checking its class, dimensions, and column names. The structure can be viewed using str() and glimpse(), while summary() provides distribution summaries. Exploring involves viewing the top and bottom rows with head() and tail(), as well as visualizing with histograms and scatterplots to identify patterns and issues. Tidy data principles and tidyr functions for gathering and spreading data are also introduced.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Cleaning Data in R

mohammad nasir abdullah


1/29/2018

Contents
Exploring Raw Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Getting a feel for your data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Viewing the Structure of your data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Exploring raw data - 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Visualizing your data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Introduction to tidy data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Introduction to Tidyr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Gathering columns into key-value pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Spreding key-value pairs into columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1
Exploring Raw Data

The first step in data cleaning process, is exploring you raw data, we can think the data exploring itself as
three step process, consisting understanding the structure of your data, looking at your data and visualize
your data. To understand the structure of your data you have several tools at your disposal in R. Here we
read in a simple data set called lunch which contains information on free reduced-price and full price school
lunches served in US from 1969 through 2014. First, we check the class of the lunch object to verify that it is
in fact the data frame or a two-dimensional table consisting of rows and columns of which each column is a
single data tyoe such as numeric, character or etc..
We then view the dimensions of the data set with “dim” function, this particular data set has 46 rows and 7
columns. “dim” always displays the number of rows first followed by the number of columns.
Next, we take a look at the columns names of lunch with the “names” function, each of the seven columns
has a name (year, average free, average reduced and so on.)
#load the lunch data
lunch <- read.csv("datasets/lunch_clean.csv")

#view its class


class(lunch)

#View its dimensions


dim(lunch)

#look at column names


names(lunch)

Ok, we are starting to get a feel for things but let’s dig a little deeper.
“str” function for structure is one of the most versatile and useful functions in the R language because it can
be called on any object and will normally provide a usefull and compact summary of its internal structure.
When passed a data frame as in this case “str” tells us how many rows and columns we have. Actually, the
function refers to rows as observations and columns as variables. Which strickly speaking is true in a tidy
data set but not always the case as you will see in the next chapter.
str(lunch)

In addition, you see the name of each column followed by its dtaa type and a preview of the data contained
in it. The lunch data set happens to be entirely integers in numeric. We will have a closer look at these data
types in next chapter.
The “dplyr” package offers a slightly different flavor of “str” called “glimpse”, which offers the same information
but attemps to preview as much of each column as will fit neatly on your screen.
#load dplyr
library(dplyr)

#View structure of lunch, the dplyr way


glimpse(lunch)

So here, we first load “dplyr” with the library command then call glimpse with a single argument lunch.
Another extremely helpful function is summary which when applied to a data frame provides a useful summary
of each column, since the lunch data are entirely integers in numerics, we see a summary of the distribution
of each column including the minimum and maximum, the mean, the 25th, 50th, and 75th percentiles also
referred to as the first quartile the median and the third quartile respectively.

2
#view a summary
summary(lunch)

As you will soon see, when faced with character or factor variables, summary will produce different summaries.
To review, you have seen how you can use the “class” function to see the class of a dataset. The “dim”
function to view its dimensions names. The “names” function to view the columns names, “str” to view its
structure and “glimpse” to do the same but in slightly enhanced format and “summary” to see a helpful
summary of each column.

Getting a feel for your data

The first thing to do when you get your hands on a new dataset is to understand its structure. There are
several ways to go about this in R, each of which may reveal different issues with your data that require
attention.
In this course, we are only concerned with data that can be expressed in table format (i.e. two dimensions,
rows and columns). As you may recall from earlier courses, tables in R often have the type data.frame. You
can check the class of any object in R with the class() function.
Once you know that you are dealing with tabular data, you may also want to get a quick feel for the contents
of your data. Before printing the entire dataset to the console, it’s probably worth knowing how many rows
and columns there are. The dim() command tells you this.

Instructions:
We’ve loaded a dataset called bmi into your workspace. The data, which give the (age standardized) mean
body mass index (BMI) among males in each country for the years 1980-2008, come from the School of Public
Health, Imperial College London.
1) Check the class of bmi

2) Find the dimensions of bmi

3) Print the bmi column names


Answer:
# Check the class of bmi
class(bmi)

# Check the dimensions of bmi


dim(bmi)

# View the column names of bmi


names(bmi)

Viewing the Structure of your data

Since bmi doesn’t have a huge number of columns, you can view a quick snapshot of your data using the str()
(for structure) command. In addition to the class and dimensions of your entire dataset, str() will tell you
the class of each variable and give you a preview of its contents.
Although we won’t go into detail on the dplyr package in this lesson (see the Data Manipulation in R with
dplyr course), the glimpse() function from dplyr is a slightly cleaner alternative to str(). str() and glimpse()

3
give you a preview of your data, which may reveal issues with the way columns are labelled, how variables
are encoded, etc.
You can use the summary() command to get a better feel for how your data are distributed, which may reveal
unusual or extreme values, unexpected missing data, etc. For numeric variables, this means looking at means,
quartiles (including the median), and extreme values. For character or factor variables, you may be curious
about the number of times each value appears in the data (i.e. counts), which summary() also reveals.

Instructions:
1) View the structure of bmi using the traditional method

2) Load the dplyr package

3) View the structure of bmi using dplyr

4) Look at a summary() of bmi

Exploring raw data - 2.

So we have seen some useful summaries of our data, but there is no substitute for just looking at it. The
“head” function shows us the first 6 rows by default. If you add additional argument “n”, you can control how
many rows to display for. For example, “head(lunch), n=15”, we will display the first 15 rows of the data.
We can also view the botton of “lunch” with “tail” function. Which displays the las six rows by default, but
again that behavior can be altered in the same way with the “n” argument.
#View the top of the data
head(lunch)

#viewing the 15 top of the data


head(lunch, n=15)

#view the bottom


tail(lunch)

Viewing the top and bottom of your data only gets you so far. Sometimes, the easiest way to identify issues
with the data are to plot them. Here, we use “hist” to plot a histogram of a percent free and reduced lunch
column which quickly gives us a sense of the distribution of this variables.
#view histogram
hist(lunch$perc_free_red)

It looks like the the value of this variable falls between 50 and 60 for 20 out of the 46 years contained in the
lunch dataset. Finally, we produce a scatterplot with the “plot” function to look at the relationship between
two variables.
#view plot of two variables
plot(lunch$year, lunch$perc_free_red)

In this case, we clearly see that the percent of lunches that are either free or reduced-price has been steadily
rising over the years, going from roughtly 15 to 70 percent between 1969 and 2014.
To review, “head” and “tail” can view the top and bottom of your data respectively. Of course, you can also
just print your data to the console which is may be ok when working with small datasets like “lunch”, but
it is definitely not recommended when working with much larger dataset. Lastly, “hist” will show you a
histogram of single variable and “plot” can be used to produce a scatterplot showing the relationshop between
two variables.

4
Looking at your data
You can look at all the summaries you want, but at the end of the day, there is no substitute for looking at
your data – either in raw table form or by plotting it.
The most basic way to look at your data in R is by printing it to the console. As you may know from
experience, the print() command is not even necessary; you can just type the name of the object. The
downside to this option is that R will attempt to print the entire dataset, which can be a nuisance if the
dataset is too large.
One way around this is to use the head() and tail() commands, which only display the first and last 6 rows of
data, respectively. You can view more (or fewer) rows by providing as a second argument to the function the
number of rows you wish to view. These functions provide a useful method for quickly getting a sense of your
data without overly cluttering the console.

Instructions:
1) Print the full dataset to the console (you don’t need print() to do this)

2) View the first 6 rows of bmi

3) View the first 15 rows of bmi

4) View the last 6 rows of bmi

5) View the last 10 rows of bmi


Answer:
# Print bmi to the console
print(bmi)

# View the first 6 rows


head(bmi, n=6)

# View the first 15 rows


head(bmi, n=15)

# View the last 6 rows


tail(bmi, n=6)

# View the last 10 rows


tail(bmi, 10)

Visualizing your data

There are many ways to visualize data. Since this is not a course about data visualization, we will only touch
on two types of plots that may be useful for quickly identifying extreme or suspicious values in your data:
histograms and scatter plots.
A histogram, created with the hist() function, takes a vector (i.e. column) of data, breaks it up into intervals,
then plots as a vertical bar the number of instances within each interval. A scatter plot, created with the
plot() function, takes two vectors (i.e. columns) of data and plots them as a series of (x, y) coordinates on a
two-dimensional plane.
Let’s look at a quick example of each.

5
Instructions:
For the bmi dataset:
1) Use hist() to look at the distribution of average BMI across all countries in 2008

2) Use plot() to see how each country’s average BMI in 1980 (x-axis) compared with its BMI in 2008
(y-axis)
Answer:
# Histogram of BMIs from 2008
hist(bmi$Y2008)

# Scatter plot comparing BMIs from 1980 to those from 2008


plot(bmi$Y1980, bmi$Y2008)

Introduction to tidy data

The concepts underlying tidy data have been around for decades and may seem familiar if you have ever
worked with relational databses. In 2014, Hadley Wickham wrote a paper in the journal of statistical software
called “tidy data”, which summarizes these concepts in a clear and concise way.
In this note, we will introduce these concepts and review some simple and practical methods of implementing
them in R. Here is an example of tidy data for each of four people, we have a name age eye color and height
(in feet and inches).
Looking across the first row, we see that Jake is 34 years old has Hazel eyes and is 6 feet 1 inch tall. This row
is called an observation looking down the second column, we see the distribution of Ages among four subjects.
This column is called a variable or attribute and each individual age is considered a value of the age variable.
If we were to give this table a name, we might call it something like people since each observation describes
characteristics of a single person also known as an observational unit. If some rows instead of described
characteristics of their pets or perhaps their favorite movies then the table would be said to contain more
than one type of observational unit. This is always true of Tidy data. Observations are represented as rows,
variables are represented as columns and there is exactly one type of observational unit per table. Said
differently, a data set is a collection of values, each value belongs to both a variable and an observation.
The variable contains all values that measure the same attribute across units and an observation contains all
value measured on the same unit across attributes.
Now that we have seen the principles of Tidy data, lets look at a simple example of messy data and try to
figure out what’s wrong with it.
Observations are still in rows and we only have one type of observational unit, people in this table. It even
appears that each column is a variable. Notice, however that the columns brown blue and other are actually
values of what was previously the eye color variable. This is a common symptom of messy data column.
Header are value not variable names. When talking about dataset it is often convenient to refer to them as
either wide or long. Although these definitions are somewhat imprecise, they generally refer to situations
when you have more columns and rows or more rows than collumn respectively.
However, a less strict intepretation is simply that a wide data tends to represent key attributes of the data
horizontally in a table instead of vertically, similar to the example form the last example that you The
opposite is true for long dataset.

6
Introduction to Tidyr

Tidyr is a wonderful simple package written by Hadley Wickham for the purpose of helping you to apply the
principle of tidy data. We won’t be covering every feature of the package in this note, but will rather focus
on a subset of functions that will allow you to accomplish some of the most common data cleaning task.
In this first example, we start with a wide dataset called wide_df. We wish to make it long by turning
the column names A, B and C into values of a new variable called my_key. Using “gather” function, no
information is lost in this process. We still have a value for each combination of x and y with A, B, and C.
But these values are now represented vertically in the column that we have labeled my_val. We refer to this
process is gathering the columns A, B and C into key value pairs, we use the “-”call argument to make it
clear that we want to gather all columns except for the first column labeled “col”.
wide_df <- read.csv("wide_df.csv")

#look at wide_df
wide_df

## col A B C
## 1 x 1 2 3
## 2 y 4 5 6
#gather the columns of wide_df
library(tidyr)

long_df<- gather(wide_df, my_key, my_val, -col)


long_df

## col my_key my_val


## 1 x A 1
## 2 y A 4
## 3 x B 2
## 4 y B 5
## 5 x C 3
## 6 y C 6
In general, the “gather” function takes four arguments. " gather(data, key, value, . . . )
• data : a data set
• key : is the name of the new column to contain the so-called keys value
• value: bare name of new value column
. . . - bare names of column to gather (or not) the three dots represent either the names of the columns
you wish to gather or the names of the columns to ignore. Each prefaced with a minus sign, note that
none of these arguments require quotes around the variable names.

spread key-value pairs into columns


In this example, we will start with the long dataset from the last example and do the opposite by speading the
key value pairs represneted in the my_key and my_val columns in two columns using the “spread” function.
== spread(data, key, value) ==
Using the spread function, the first argument to spread is the name of the dataset “long_df”. The second
argument is the name of the key colum “My_key” and the third argument is the name of the value column
(in this case “my_val”). You can see that the result is the orginal wide data set which we referred to as
wide_df in the previous example.

7
#look at long_df
long_df

## col my_key my_val


## 1 x A 1
## 2 y A 4
## 3 x B 2
## 4 y B 5
## 5 x C 3
## 6 y C 6
#spread the key-value pairs of long_df
spread(long_df, my_key, my_val)

## col A B C
## 1 x 1 2 3
## 2 y 4 5 6
again, note that all argument are unqouted.

Gathering columns into key-value pairs

he most important function in tidyr is gather(). It should be used when you have columns that are not
variables and you want to collapse them into key-value pairs.
The easiest way to visualize the effect of gather() is that it makes wide datasets long. As you saw in the
video, running the following command on wide_df will make it long:
gather(wide_df, my_key, my_val, -col) Experiment with this in the console before attempting the exercise.

Instructions:
1) Apply the gather() function to bmi, saving the result to bmi_long. This will create two new columns:
2) year, containing as values what are currently column headers bmi_val, the actual BMI values

3) View the first 20 rows of bmi_long


Answer:
# Apply gather() to bmi and save the result as bmi_long
bmi_long <- gather(bmi, year, bmi_val, -Country)

# View the first 20 rows of the result


head(bmi_long, 20)

Spreding key-value pairs into columns

The opposite of gather() is spread(), which takes key-values pairs and spreads them across multiple columns.
This is useful when values in a column should actually be column names (i.e. variables). It can also make
data more compact and easier to read.
The easiest way to visualize the effect of spread() is that it makes long datasets wide. As you saw in the
video, running the following command will make long_df wide:
spread(long_df, my_key, my_val) Experiment with this in the console before attempting the exercise.

8
Instructions:
1) Use spread() to reverse the operation that you performed in the last exercise with gather(). In other
words, make bmi_long wide again, saving the result to bmi_wide

2) View the head of bmi_wide


Answer:
# Apply spread() to bmi_long
bmi_wide <- spread(bmi_long, year, bmi_val)

# View the head of bmi_wide


head(bmi_wide)

You might also like