Data manipulation in R involves a series of tasks to clean, transform, and analyze data.
Some of the most
common techniques include filtering, sorting, summarizing, and reshaping data. Below are some key
techniques and functions in R for data manipulation, with examples:
1. Loading Required Libraries
To manipulate data in R, you often need libraries like dplyr, tidyr, and data.table. Here’s how to load
them:
# Install and load dplyr for data manipulation
install.packages("dplyr")
library(dplyr)
# Install and load tidyr for reshaping data
install.packages("tidyr")
library(tidyr)
# Install and load data.table for fast manipulation
install.packages("data.table")
library(data.table)
2. Creating Data
You can create data frames in R using data.frame() or tibble() (from the tibble package). Example:
# Example data frame
data <- data.frame(
ID = 1:5,
Name = c("Alice", "Bob", "Charlie", "David", "Eva"),
Age = c(25, 30, 35, 40, 45),
Score = c(85, 90, 88, 95, 89)
# View the data
print(data)
3. Selecting Columns and Rows
You can select specific rows and columns using various techniques:
a. Select Columns
# Select specific columns
data %>%
select(Name, Score)
b. Select Rows
# Filter rows based on conditions
data %>%
filter(Age > 30)
c. Select Both
# Select specific rows and columns
data %>%
filter(Age > 30) %>%
select(Name, Age)
4. Adding/Modifying Columns
You can add or modify columns with mutate():
# Add a new column based on existing ones
data <- data %>%
mutate(Score_Above_90 = ifelse(Score > 90, TRUE, FALSE))
5. Summarizing Data
You can summarize your data using summarize() (or summarise() in British spelling).
# Get the average score by Age
data %>%
group_by(Age) %>%
summarize(Average_Score = mean(Score))
6. Arranging (Sorting) Data
Use arrange() to sort data by one or more variables:
# Sort data by Score in descending order
data %>%
arrange(desc(Score))
7. Reshaping Data
You can reshape data using pivot_longer() and pivot_wider() from tidyr.
a. Pivot Longer
Converting wide format data into long format:
# Example: pivoting data from wide to long format
long_data <- data.frame(
ID = 1:3,
Math = c(90, 85, 80),
Science = c(88, 92, 79)
long_data %>%
pivot_longer(cols = c(Math, Science), names_to = "Subject", values_to = "Score")
b. Pivot Wider
Converting long format data back to wide format:
# Example: pivoting data from long to wide format
long_data %>%
pivot_wider(names_from = "Subject", values_from = "Score")
8. Handling Missing Data
You can handle missing data using na.omit() or mutate() with ifelse().
# Removing rows with missing values
clean_data <- na.omit(data)
# Impute missing values (example: replace NAs with 0)
data$Score[is.na(data$Score)] <- 0
9. Merging Data Frames
To merge data frames, use left_join(), right_join(), or inner_join() from dplyr:
# Example: Merging two data frames by a common column
data1 <- data.frame(ID = 1:3, Name = c("Alice", "Bob", "Charlie"))
data2 <- data.frame(ID = 1:3, Score = c(85, 90, 88))
merged_data <- left_join(data1, data2, by = "ID")
print(merged_data)
10. Data Table for Fast Manipulation
You can use data.table for faster operations, especially on large datasets:
# Convert data frame to data.table
dt <- as.data.table(data)
# Example: Filtering data
dt[Age > 30]
# Example: Summarizing data
dt[, .(Average_Score = mean(Score)), by = Age]
11. Other Useful Functions
arrange(): Sort data.
group_by() and summarize(): Group data and calculate summary statistics.
mutate(): Create new columns or modify existing ones.
filter(): Subset rows based on conditions.
spread() and gather() (deprecated, now pivot_wider() and pivot_longer() in tidyr): Reshape data.
Example Workflow:
library(dplyr)
# Example data manipulation pipeline
data %>%
filter(Age > 30) %>%
mutate(Score_Category = ifelse(Score > 90, "High", "Low")) %>%
group_by(Score_Category) %>%
summarize(Average_Age = mean(Age), Average_Score = mean(Score))
Conclusion
These are some basic techniques for manipulating data in R. You can combine these operations to
perform more complex data cleaning and transformation tasks. The dplyr and tidyr packages provide a
powerful, readable, and consistent syntax for these operations.