Ex 5 Detailed explanations
2. Setting File Path
R
Копировать код
file_path <-
paste0(dirname(rstudioapi::getSourceEditorContext()$path), "/")
● Dynamically set the working directory to the location of the current R script.
● Ensures that the path will always point to the correct folder containing the script,
regardless of its location.
3. Reading Data
R
Копировать код
BOA <- read.csv(paste0(file_path, "BagsOfApples.csv"), sep=";")
BOO <- read_excel(paste0(file_path, "BagsOfOrangesNA.xlsx"))
Geo <- read_excel(paste0(file_path, "Geo_dim.xlsx"))
● Reads three datasets into R:
○ BOA (Bags of Apples):
■ Reads a semicolon-delimited CSV file using read.csv().
■ Assumes the file is located in the directory set by file_path.
○ BOO (Bags of Oranges):
■ Reads an Excel file using read_excel() from the readxl package.
○ Geo (Geo-dimensions):
■ Likely contains geographic mapping information.
4. Adding Columns
R
Копировать код
BOA$fruits <- "Apples"
BOO$fruits <- "Oranges"
● Adds a new column fruits to each dataframe:
○ For BOA, all rows are labeled as "Apples".
○ For BOO, all rows are labeled as "Oranges".
5. Combining Data (Elaborated)
R
Копировать код
BOF <- rbind(BOO, BOA)
How It Works:
● rbind():
○ Combines two dataframes (BOO and BOA) by stacking rows.
○ Assumes both dataframes have the same column structure (column names
and types).
Additional Use Cases:
1. If Columns Don’t Match Exactly:
○ bind_rows(BOO, BOA, .id = "source") can be used to allow differing
columns. Missing columns will be filled with NA.
2. Adding an Identifier for Data Source:
By using the .id parameter in bind_rows, you can add an extra column indicating the
source of each row:
R
Копировать код
BOF <- bind_rows(BOO = BOO, BOA = BOA, .id = "source")
○ Here, rows from BOO would have source = "BOO", and rows from BOA
would have source = "BOA".
6. Replacing Values
R
Копировать код
BOF$origin <- str_replace_all(BOF$origin, "California", "United
States")
● Uses str_replace_all() from the stringr package to replace "California"
with "United States" in the origin column.
7. Merging Datasets
R
Копировать код
BOF <- left_join(BOF, Geo, by = c("origin" = "Country"))
● Joins BOF with Geo based on the origin column in BOF and Country column in
Geo.
8. Renaming Columns
R
Копировать код
BOF <- rename(BOF, price = prize)
● Renames the prize column to price.
9. Removing Columns
R
Копировать код
BOF <- select(BOF, -bagNo)
● Removes the bagNo column using select().
10. Handling Missing Data
R
Копировать код
anyNA(BOF)
BOF <- na.omit(BOF)
● anyNA(BOF):
○ Checks for the presence of NA (missing values) in the dataframe.
● na.omit(BOF):
○ Removes all rows with any NA values.
11. Filtering Data
R
Копировать код
BOF_europe <- filter(BOF, Region == "Europe")
● Creates a subset of BOF where the Region column equals "Europe".
12. Adding Calculated Columns
R
Копировать код
BOF <- mutate(BOF, ppk = price / weight)
● mutate():
○ Adds a new column ppk (price per kilo), calculated as price divided by
weight.
13. Sorting Data
R
Копировать код
arrange(BOF, desc(ppk))
● arrange():
○ Sorts the dataframe based on the ppk column in descending order.
14. Saving Processed Data (Elaborated)
R
Копировать код
write.table(BOF, file="bagsoffruits_price.txt", sep="\t",
row.names=FALSE)
How It Works:
● write.table():
○ Writes the dataframe BOF to a file named bagsoffruits_price.txt.
○ The file uses tab (\t) as the delimiter.
Additional Use Cases:
1. Changing the Delimiter:
You can change sep to any other delimiter, such as commas for a CSV file:
R
Копировать код
write.table(BOF, file="bagsoffruits_price.csv", sep=",",
row.names=FALSE)
○
2. Adding Row Names:
○ Set row.names=TRUE to include row numbers as a separate column.
3. Saving with Quotes:
To wrap text fields in quotes:
R
Копировать код
write.table(BOF, file="bagsoffruits_price.txt", sep="\t",
row.names=FALSE, quote=TRUE)
○
4. Using write.csv() for Simplicity:
For CSV files, write.csv() can be used as a shortcut:
R
Копировать код
write.csv(BOF, file="bagsoffruits_price.csv", row.names=FALSE)
15. Counting and Grouping Data
R
Копировать код
count(BOF, foodLabel)
● Counts the number of rows for each unique value in the foodLabel column.
R
Копировать код
arrange(count(BOF, foodLabel), desc(n))
● Arranges the counts in descending order by frequency.
16. Summarizing Data
R
Копировать код
BOFg <- group_by(BOF, foodLabel)
BOFgn <- summarise(BOFg, meanppk = mean(ppk))
● group_by():
○ Groups data by the foodLabel column.
● summarise():
○ Calculates the mean ppk for each group.
17. Troubleshooting and Assistance
● The document recommends using tools like ChatGPT or Copilot to troubleshoot
errors or get clarification on concepts.