0% found this document useful (0 votes)

55 views9 pages

Cleaning Data in R

The document discusses exploring and cleaning raw data in R. It describes getting familiar with the structure of data by checking its class, dimensions, and column names. The structure can be viewed using str() and glimpse(), while summary() provides distribution summaries. Exploring involves viewing the top and bottom rows with head() and tail(), as well as visualizing with histograms and scatterplots to identify patterns and issues. Tidy data principles and tidyr functions for gathering and spreading data are also introduced.

Uploaded by

mohammad nasir Abdullah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views9 pages

Cleaning Data in R

Uploaded by

mohammad nasir Abdullah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Cleaning Data in R

mohammad nasir abdullah

1/29/2018

Contents
Exploring Raw Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Getting a feel for your data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Viewing the Structure of your data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Exploring raw data - 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Visualizing your data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Introduction to tidy data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Introduction to Tidyr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Gathering columns into key-value pairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Spreding key-value pairs into columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1
Exploring Raw Data

The first step in data cleaning process, is exploring you raw data, we can think the data exploring itself as
three step process, consisting understanding the structure of your data, looking at your data and visualize
your data. To understand the structure of your data you have several tools at your disposal in R. Here we
read in a simple data set called lunch which contains information on free reduced-price and full price school
lunches served in US from 1969 through 2014. First, we check the class of the lunch object to verify that it is
in fact the data frame or a two-dimensional table consisting of rows and columns of which each column is a
single data tyoe such as numeric, character or etc..
We then view the dimensions of the data set with “dim” function, this particular data set has 46 rows and 7
columns. “dim” always displays the number of rows first followed by the number of columns.
Next, we take a look at the columns names of lunch with the “names” function, each of the seven columns
has a name (year, average free, average reduced and so on.)
#load the lunch data
lunch <- read.csv("datasets/lunch_clean.csv")

#view its class

class(lunch)

#View its dimensions

dim(lunch)

#look at column names

names(lunch)

Ok, we are starting to get a feel for things but let’s dig a little deeper.
“str” function for structure is one of the most versatile and useful functions in the R language because it can
be called on any object and will normally provide a usefull and compact summary of its internal structure.
When passed a data frame as in this case “str” tells us how many rows and columns we have. Actually, the
function refers to rows as observations and columns as variables. Which strickly speaking is true in a tidy
data set but not always the case as you will see in the next chapter.
str(lunch)

In addition, you see the name of each column followed by its dtaa type and a preview of the data contained
in it. The lunch data set happens to be entirely integers in numeric. We will have a closer look at these data
types in next chapter.
The “dplyr” package offers a slightly different flavor of “str” called “glimpse”, which offers the same information
but attemps to preview as much of each column as will fit neatly on your screen.
#load dplyr
library(dplyr)

#View structure of lunch, the dplyr way

glimpse(lunch)

So here, we first load “dplyr” with the library command then call glimpse with a single argument lunch.
Another extremely helpful function is summary which when applied to a data frame provides a useful summary
of each column, since the lunch data are entirely integers in numerics, we see a summary of the distribution
of each column including the minimum and maximum, the mean, the 25th, 50th, and 75th percentiles also
referred to as the first quartile the median and the third quartile respectively.

2
#view a summary
summary(lunch)

As you will soon see, when faced with character or factor variables, summary will produce different summaries.
To review, you have seen how you can use the “class” function to see the class of a dataset. The “dim”
function to view its dimensions names. The “names” function to view the columns names, “str” to view its
structure and “glimpse” to do the same but in slightly enhanced format and “summary” to see a helpful
summary of each column.

Getting a feel for your data

The first thing to do when you get your hands on a new dataset is to understand its structure. There are
several ways to go about this in R, each of which may reveal different issues with your data that require
attention.
In this course, we are only concerned with data that can be expressed in table format (i.e. two dimensions,
rows and columns). As you may recall from earlier courses, tables in R often have the type data.frame. You
can check the class of any object in R with the class() function.
Once you know that you are dealing with tabular data, you may also want to get a quick feel for the contents
of your data. Before printing the entire dataset to the console, it’s probably worth knowing how many rows
and columns there are. The dim() command tells you this.

Instructions:
We’ve loaded a dataset called bmi into your workspace. The data, which give the (age standardized) mean
body mass index (BMI) among males in each country for the years 1980-2008, come from the School of Public
Health, Imperial College London.
1) Check the class of bmi

2) Find the dimensions of bmi

3) Print the bmi column names

Answer:
# Check the class of bmi
class(bmi)

# Check the dimensions of bmi

dim(bmi)

# View the column names of bmi

names(bmi)

Viewing the Structure of your data

Since bmi doesn’t have a huge number of columns, you can view a quick snapshot of your data using the str()
(for structure) command. In addition to the class and dimensions of your entire dataset, str() will tell you
the class of each variable and give you a preview of its contents.
Although we won’t go into detail on the dplyr package in this lesson (see the Data Manipulation in R with
dplyr course), the glimpse() function from dplyr is a slightly cleaner alternative to str(). str() and glimpse()

3
give you a preview of your data, which may reveal issues with the way columns are labelled, how variables
are encoded, etc.
You can use the summary() command to get a better feel for how your data are distributed, which may reveal
unusual or extreme values, unexpected missing data, etc. For numeric variables, this means looking at means,
quartiles (including the median), and extreme values. For character or factor variables, you may be curious
about the number of times each value appears in the data (i.e. counts), which summary() also reveals.

Instructions:
1) View the structure of bmi using the traditional method

2) Load the dplyr package

3) View the structure of bmi using dplyr

4) Look at a summary() of bmi

Exploring raw data - 2.

So we have seen some useful summaries of our data, but there is no substitute for just looking at it. The
“head” function shows us the first 6 rows by default. If you add additional argument “n”, you can control how
many rows to display for. For example, “head(lunch), n=15”, we will display the first 15 rows of the data.
We can also view the botton of “lunch” with “tail” function. Which displays the las six rows by default, but
again that behavior can be altered in the same way with the “n” argument.
#View the top of the data
head(lunch)

#viewing the 15 top of the data

head(lunch, n=15)

#view the bottom

tail(lunch)

Viewing the top and bottom of your data only gets you so far. Sometimes, the easiest way to identify issues
with the data are to plot them. Here, we use “hist” to plot a histogram of a percent free and reduced lunch
column which quickly gives us a sense of the distribution of this variables.
#view histogram
hist(lunch$perc_free_red)

It looks like the the value of this variable falls between 50 and 60 for 20 out of the 46 years contained in the
lunch dataset. Finally, we produce a scatterplot with the “plot” function to look at the relationship between
two variables.
#view plot of two variables
plot(lunch$year, lunch$perc_free_red)

In this case, we clearly see that the percent of lunches that are either free or reduced-price has been steadily
rising over the years, going from roughtly 15 to 70 percent between 1969 and 2014.
To review, “head” and “tail” can view the top and bottom of your data respectively. Of course, you can also
just print your data to the console which is may be ok when working with small datasets like “lunch”, but
it is definitely not recommended when working with much larger dataset. Lastly, “hist” will show you a
histogram of single variable and “plot” can be used to produce a scatterplot showing the relationshop between
two variables.

4
Looking at your data
You can look at all the summaries you want, but at the end of the day, there is no substitute for looking at
your data – either in raw table form or by plotting it.
The most basic way to look at your data in R is by printing it to the console. As you may know from
experience, the print() command is not even necessary; you can just type the name of the object. The
downside to this option is that R will attempt to print the entire dataset, which can be a nuisance if the
dataset is too large.
One way around this is to use the head() and tail() commands, which only display the first and last 6 rows of
data, respectively. You can view more (or fewer) rows by providing as a second argument to the function the
number of rows you wish to view. These functions provide a useful method for quickly getting a sense of your
data without overly cluttering the console.

Instructions:
1) Print the full dataset to the console (you don’t need print() to do this)

2) View the first 6 rows of bmi

3) View the first 15 rows of bmi

4) View the last 6 rows of bmi

5) View the last 10 rows of bmi

Answer:
# Print bmi to the console
print(bmi)

# View the first 6 rows

head(bmi, n=6)

# View the first 15 rows

head(bmi, n=15)

# View the last 6 rows

tail(bmi, n=6)

# View the last 10 rows

tail(bmi, 10)

Visualizing your data

There are many ways to visualize data. Since this is not a course about data visualization, we will only touch
on two types of plots that may be useful for quickly identifying extreme or suspicious values in your data:
histograms and scatter plots.
A histogram, created with the hist() function, takes a vector (i.e. column) of data, breaks it up into intervals,
then plots as a vertical bar the number of instances within each interval. A scatter plot, created with the
plot() function, takes two vectors (i.e. columns) of data and plots them as a series of (x, y) coordinates on a
two-dimensional plane.
Let’s look at a quick example of each.

5
Instructions:
For the bmi dataset:
1) Use hist() to look at the distribution of average BMI across all countries in 2008

2) Use plot() to see how each country’s average BMI in 1980 (x-axis) compared with its BMI in 2008
(y-axis)
Answer:
# Histogram of BMIs from 2008
hist(bmi$Y2008)

# Scatter plot comparing BMIs from 1980 to those from 2008

plot(bmi$Y1980, bmi$Y2008)

Introduction to tidy data

The concepts underlying tidy data have been around for decades and may seem familiar if you have ever
worked with relational databses. In 2014, Hadley Wickham wrote a paper in the journal of statistical software
called “tidy data”, which summarizes these concepts in a clear and concise way.
In this note, we will introduce these concepts and review some simple and practical methods of implementing
them in R. Here is an example of tidy data for each of four people, we have a name age eye color and height
(in feet and inches).
Looking across the first row, we see that Jake is 34 years old has Hazel eyes and is 6 feet 1 inch tall. This row
is called an observation looking down the second column, we see the distribution of Ages among four subjects.
This column is called a variable or attribute and each individual age is considered a value of the age variable.
If we were to give this table a name, we might call it something like people since each observation describes
characteristics of a single person also known as an observational unit. If some rows instead of described
characteristics of their pets or perhaps their favorite movies then the table would be said to contain more
than one type of observational unit. This is always true of Tidy data. Observations are represented as rows,
variables are represented as columns and there is exactly one type of observational unit per table. Said
differently, a data set is a collection of values, each value belongs to both a variable and an observation.
The variable contains all values that measure the same attribute across units and an observation contains all
value measured on the same unit across attributes.
Now that we have seen the principles of Tidy data, lets look at a simple example of messy data and try to
figure out what’s wrong with it.
Observations are still in rows and we only have one type of observational unit, people in this table. It even
appears that each column is a variable. Notice, however that the columns brown blue and other are actually
values of what was previously the eye color variable. This is a common symptom of messy data column.
Header are value not variable names. When talking about dataset it is often convenient to refer to them as
either wide or long. Although these definitions are somewhat imprecise, they generally refer to situations
when you have more columns and rows or more rows than collumn respectively.
However, a less strict intepretation is simply that a wide data tends to represent key attributes of the data
horizontally in a table instead of vertically, similar to the example form the last example that you The
opposite is true for long dataset.

6
Introduction to Tidyr

Tidyr is a wonderful simple package written by Hadley Wickham for the purpose of helping you to apply the
principle of tidy data. We won’t be covering every feature of the package in this note, but will rather focus
on a subset of functions that will allow you to accomplish some of the most common data cleaning task.
In this first example, we start with a wide dataset called wide_df. We wish to make it long by turning
the column names A, B and C into values of a new variable called my_key. Using “gather” function, no
information is lost in this process. We still have a value for each combination of x and y with A, B, and C.
But these values are now represented vertically in the column that we have labeled my_val. We refer to this
process is gathering the columns A, B and C into key value pairs, we use the “-”call argument to make it
clear that we want to gather all columns except for the first column labeled “col”.
wide_df <- read.csv("wide_df.csv")

#look at wide_df
wide_df

## col A B C
## 1 x 1 2 3
## 2 y 4 5 6
#gather the columns of wide_df
library(tidyr)

long_df<- gather(wide_df, my_key, my_val, -col)

long_df

## col my_key my_val

## 1 x A 1
## 2 y A 4
## 3 x B 2
## 4 y B 5
## 5 x C 3
## 6 y C 6
In general, the “gather” function takes four arguments. " gather(data, key, value, . . . )
• data : a data set
• key : is the name of the new column to contain the so-called keys value
• value: bare name of new value column
. . . - bare names of column to gather (or not) the three dots represent either the names of the columns
you wish to gather or the names of the columns to ignore. Each prefaced with a minus sign, note that
none of these arguments require quotes around the variable names.

spread key-value pairs into columns

In this example, we will start with the long dataset from the last example and do the opposite by speading the
key value pairs represneted in the my_key and my_val columns in two columns using the “spread” function.
== spread(data, key, value) ==
Using the spread function, the first argument to spread is the name of the dataset “long_df”. The second
argument is the name of the key colum “My_key” and the third argument is the name of the value column
(in this case “my_val”). You can see that the result is the orginal wide data set which we referred to as
wide_df in the previous example.

7
#look at long_df
long_df

## col my_key my_val

## 1 x A 1
## 2 y A 4
## 3 x B 2
## 4 y B 5
## 5 x C 3
## 6 y C 6
#spread the key-value pairs of long_df
spread(long_df, my_key, my_val)

## col A B C
## 1 x 1 2 3
## 2 y 4 5 6
again, note that all argument are unqouted.

Gathering columns into key-value pairs

he most important function in tidyr is gather(). It should be used when you have columns that are not
variables and you want to collapse them into key-value pairs.
The easiest way to visualize the effect of gather() is that it makes wide datasets long. As you saw in the
video, running the following command on wide_df will make it long:
gather(wide_df, my_key, my_val, -col) Experiment with this in the console before attempting the exercise.

Instructions:
1) Apply the gather() function to bmi, saving the result to bmi_long. This will create two new columns:
2) year, containing as values what are currently column headers bmi_val, the actual BMI values

3) View the first 20 rows of bmi_long

Answer:
# Apply gather() to bmi and save the result as bmi_long
bmi_long <- gather(bmi, year, bmi_val, -Country)

# View the first 20 rows of the result

head(bmi_long, 20)

Spreding key-value pairs into columns

The opposite of gather() is spread(), which takes key-values pairs and spreads them across multiple columns.
This is useful when values in a column should actually be column names (i.e. variables). It can also make
data more compact and easier to read.
The easiest way to visualize the effect of spread() is that it makes long datasets wide. As you saw in the
video, running the following command will make long_df wide:
spread(long_df, my_key, my_val) Experiment with this in the console before attempting the exercise.

8
Instructions:
1) Use spread() to reverse the operation that you performed in the last exercise with gather(). In other
words, make bmi_long wide again, saving the result to bmi_wide

2) View the head of bmi_wide

Answer:
# Apply spread() to bmi_long
bmi_wide <- spread(bmi_long, year, bmi_val)

# View the head of bmi_wide

head(bmi_wide)

Cleaning Data
No ratings yet
Cleaning Data
17 pages
Practical Assignment-10 Mini Project Nutrition Calculator - Calculate Nutrition For Recipes
No ratings yet
Practical Assignment-10 Mini Project Nutrition Calculator - Calculate Nutrition For Recipes
16 pages
Unit - 2: Data Manipulation With R & Data Visualization in Watson Studio
No ratings yet
Unit - 2: Data Manipulation With R & Data Visualization in Watson Studio
58 pages
Modern Statistics With R
100% (4)
Modern Statistics With R
580 pages
Exploratory Data Analysis With R-Leanpub PDF
No ratings yet
Exploratory Data Analysis With R-Leanpub PDF
125 pages
Book - Roger D Peng-Exploratory Data Analysis With R-Leanpub (2015) PDF
0% (1)
Book - Roger D Peng-Exploratory Data Analysis With R-Leanpub (2015) PDF
125 pages
Book - Roger D Peng-Exploratory Data Analysis With R-Leanpub (2015) PDF
No ratings yet
Book - Roger D Peng-Exploratory Data Analysis With R-Leanpub (2015) PDF
125 pages
Exploratory Data Analysis With R PDF
No ratings yet
Exploratory Data Analysis With R PDF
125 pages
Verzani Answers
100% (8)
Verzani Answers
94 pages
Exdata
No ratings yet
Exdata
184 pages
Exploratory Data Analysis With R
No ratings yet
Exploratory Data Analysis With R
218 pages
STQS2223 CH 4
No ratings yet
STQS2223 CH 4
30 pages
CRC Data Science
No ratings yet
CRC Data Science
443 pages
Visual Statistics Use R PDF
No ratings yet
Visual Statistics Use R PDF
388 pages
Visual Statistics Use R!
50% (2)
Visual Statistics Use R!
388 pages
Ismaykim1 PDF
No ratings yet
Ismaykim1 PDF
522 pages
Tutorial-Introduction To Dplyr
No ratings yet
Tutorial-Introduction To Dplyr
54 pages
R Workshop Material 18-19, Oct-2023
No ratings yet
R Workshop Material 18-19, Oct-2023
67 pages
R Basic and Advanced
No ratings yet
R Basic and Advanced
9 pages
Data Cleansing Using R
0% (1)
Data Cleansing Using R
10 pages
Basic R Dplyr Session 4 Demonstration
No ratings yet
Basic R Dplyr Session 4 Demonstration
18 pages
R Cookbook 1st Edition Teetor PDF Version
No ratings yet
R Cookbook 1st Edition Teetor PDF Version
75 pages
R Topicscovered
No ratings yet
R Topicscovered
22 pages
R For Health Data Science
100% (1)
R For Health Data Science
365 pages
MIT 302 - Statistical Computing II - Tutorial 02
No ratings yet
MIT 302 - Statistical Computing II - Tutorial 02
5 pages
Lesson 7 - The Data Frame
No ratings yet
Lesson 7 - The Data Frame
7 pages
Basics of Data Analysis and Graphics in
No ratings yet
Basics of Data Analysis and Graphics in
103 pages
R Course Own English HS
No ratings yet
R Course Own English HS
70 pages
Unit 2
No ratings yet
Unit 2
32 pages
DR - Pierpaolo-Delser - Introduction R
No ratings yet
DR - Pierpaolo-Delser - Introduction R
83 pages
Module 2.9
No ratings yet
Module 2.9
12 pages
楊睿中統計學合併版
No ratings yet
楊睿中統計學合併版
557 pages
Apunts BLOC 1 Estadística
No ratings yet
Apunts BLOC 1 Estadística
15 pages
R Data Manipulation Guide
No ratings yet
R Data Manipulation Guide
46 pages
Chapter 03 Wrangling
No ratings yet
Chapter 03 Wrangling
40 pages
Introduction To R For Social Scientist Preview
No ratings yet
Introduction To R For Social Scientist Preview
26 pages
R Tutorial
No ratings yet
R Tutorial
15 pages
Intro To Data Science Lecture 4
No ratings yet
Intro To Data Science Lecture 4
13 pages
Data Science Course Overview
No ratings yet
Data Science Course Overview
318 pages
Beginner Guide To R and R Studio V1
No ratings yet
Beginner Guide To R and R Studio V1
27 pages
Peng Análisis Exploratorio R
No ratings yet
Peng Análisis Exploratorio R
198 pages
MultivariateRGGobi PDF
No ratings yet
MultivariateRGGobi PDF
60 pages
DataCamp Week 5
No ratings yet
DataCamp Week 5
7 pages
Rbook PDF
No ratings yet
Rbook PDF
360 pages
Lab1 411 Eman Yahya 7773225
No ratings yet
Lab1 411 Eman Yahya 7773225
16 pages
R File Code
No ratings yet
R File Code
16 pages
Week 5 Data Wrangling
No ratings yet
Week 5 Data Wrangling
96 pages
Impact of Maternal Immune Activation and Sex On Placental and Fetal Brain Cytokine and Gene Expression-DeSKTOP-FJIPOUP
No ratings yet
Impact of Maternal Immune Activation and Sex On Placental and Fetal Brain Cytokine and Gene Expression-DeSKTOP-FJIPOUP
14 pages
An Intelligent Hybrid Ensemble Gene Selection Model For Autism Using DNN
No ratings yet
An Intelligent Hybrid Ensemble Gene Selection Model For Autism Using DNN
16 pages
On Relationship Between Multicollinearity and Separation in Logistic Regression
No ratings yet
On Relationship Between Multicollinearity and Separation in Logistic Regression
10 pages
Synthetic Generation of High Dimensional Dataset
No ratings yet
Synthetic Generation of High Dimensional Dataset
8 pages
Comparison of Statistical and Machine Learning Methods in Modelling of Data With Multicollinerity
No ratings yet
Comparison of Statistical and Machine Learning Methods in Modelling of Data With Multicollinerity
18 pages
Separation in Logistic Regression - Causes Consequences and Control
No ratings yet
Separation in Logistic Regression - Causes Consequences and Control
7 pages
El Cuento Mexicano de Fin de Siglo Algun
No ratings yet
El Cuento Mexicano de Fin de Siglo Algun
9 pages
Oracle Analytic Functions Guide
100% (1)
Oracle Analytic Functions Guide
3 pages
Addis Ababa University Department of Law Fresh Man Course
No ratings yet
Addis Ababa University Department of Law Fresh Man Course
12 pages
Bridge Pier Design Specifications
No ratings yet
Bridge Pier Design Specifications
25 pages
Sand and Gravel For Se As Filtration Medium - Specification: Indian Standard
No ratings yet
Sand and Gravel For Se As Filtration Medium - Specification: Indian Standard
12 pages
DQ Model
No ratings yet
DQ Model
7 pages
Astm A367 22
No ratings yet
Astm A367 22
3 pages
Thyristor Three-Phase Rectifier/Inverter Guide
100% (1)
Thyristor Three-Phase Rectifier/Inverter Guide
8 pages
Problemas Geometria 1
No ratings yet
Problemas Geometria 1
21 pages
Science 10 Second Grading Exam
No ratings yet
Science 10 Second Grading Exam
2 pages
Data Science Course Overview
No ratings yet
Data Science Course Overview
28 pages
Power Series Solutions - Complete
No ratings yet
Power Series Solutions - Complete
65 pages
APM Agents
No ratings yet
APM Agents
102 pages
Spedding 1988
No ratings yet
Spedding 1988
12 pages
Sorghum Disease Detection with AI
No ratings yet
Sorghum Disease Detection with AI
29 pages
FMC Conventional Wellhead Breakdown
100% (1)
FMC Conventional Wellhead Breakdown
13 pages
SEIKO 6M13 Watch User Guide
100% (1)
SEIKO 6M13 Watch User Guide
20 pages
Linear Regression
No ratings yet
Linear Regression
12 pages
CESTAT30 - 01.01.introduction To The Course - Lecture
No ratings yet
CESTAT30 - 01.01.introduction To The Course - Lecture
8 pages
Philips Hts3450
No ratings yet
Philips Hts3450
80 pages
Install Ohmw 4.01.01.rc.03
No ratings yet
Install Ohmw 4.01.01.rc.03
3 pages
Astm A729 A729m-09
No ratings yet
Astm A729 A729m-09
8 pages
Ah en Ax SW Suite Change Notes 8040 en 24
100% (1)
Ah en Ax SW Suite Change Notes 8040 en 24
89 pages
Motion Concepts for Students
No ratings yet
Motion Concepts for Students
58 pages
Winters Promise Quilt Pattern
No ratings yet
Winters Promise Quilt Pattern
7 pages
Q Data Based - 5
No ratings yet
Q Data Based - 5
2 pages
SATB Cadences and Arrangements
No ratings yet
SATB Cadences and Arrangements
9 pages
O Level Physics 5054 - 21 Paper 2 May - June 2023
No ratings yet
O Level Physics 5054 - 21 Paper 2 May - June 2023
19 pages
NPT-National Pipe Thread Chart: Connect With Us On: 855.728.5460
No ratings yet
NPT-National Pipe Thread Chart: Connect With Us On: 855.728.5460
1 page
PADS Tutorial
No ratings yet
PADS Tutorial
59 pages

Cleaning Data in R

Uploaded by

Cleaning Data in R

Uploaded by

Cleaning Data in R

mohammad nasir abdullah

#view its class

#View its dimensions

#look at column names

#View structure of lunch, the dplyr way

Getting a feel for your data

2) Find the dimensions of bmi

3) Print the bmi column names

# Check the dimensions of bmi

# View the column names of bmi

Viewing the Structure of your data

2) Load the dplyr package

3) View the structure of bmi using dplyr

4) Look at a summary() of bmi

Exploring raw data - 2.

#viewing the 15 top of the data

#view the bottom

2) View the first 6 rows of bmi

3) View the first 15 rows of bmi

4) View the last 6 rows of bmi

5) View the last 10 rows of bmi

# View the first 6 rows

# View the first 15 rows

# View the last 6 rows

# View the last 10 rows

Visualizing your data

# Scatter plot comparing BMIs from 1980 to those from 2008

Introduction to tidy data

long_df<- gather(wide_df, my_key, my_val, -col)

## col my_key my_val

spread key-value pairs into columns

## col my_key my_val

Gathering columns into key-value pairs

3) View the first 20 rows of bmi_long

# View the first 20 rows of the result

Spreding key-value pairs into columns

2) View the head of bmi_wide

# View the head of bmi_wide

You might also like