library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(tidyr)
Jun Ryu
February 5, 2023
For today, we import two powerful packages (dplyr
and tidyr
) under tidyverse that will help us clean and manipulate data easily.
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Our first task is to create a dataset that we can work with. We intend to simulate a dataset that resembles a gradebook. Specifically, we want the following:
The simulated gradebook should contain the grades for 100 students and scores (out of 100) for 5 homework and 5 quizzes:
set.seed(605574052) # set seed for reproducibility
gradebook <- data.frame()
for (i in 1:100) {
UID <- round(runif(1, min = 100000000, max = 999999999), 0)
assignments <- round(runif(10, min = 0, max = 100), 0)
data <- c(UID, assignments)
gradebook <- rbind(gradebook, data)
}
colnames(gradebook) <- c("UID", "Homework_1", "Homework_2", "Homework_3", "Homework_4",
"Homework_5", "Quiz_1", "Quiz_2", "Quiz_3", "Quiz_4", "Quiz_5")
head(gradebook)
UID Homework_1 Homework_2 Homework_3 Homework_4 Homework_5 Quiz_1
1 532521854 45 17 75 22 13 80
2 486614393 20 50 82 78 41 53
3 576992987 53 54 85 52 29 35
4 143843463 33 2 6 39 39 49
5 773280766 99 25 52 51 17 42
6 486851405 58 38 76 49 86 39
Quiz_2 Quiz_3 Quiz_4 Quiz_5
1 87 83 82 11
2 10 55 73 42
3 74 28 27 46
4 55 97 84 67
5 66 37 100 10
6 7 80 2 16
Now, we will modify the dataset to randomly replace 10% of Homework_4
and Quiz_4
with NA
, respectively.
set.seed(605574052)
gradebook[sample(100, 10), "Homework_4"] <- NA
gradebook[sample(100, 10), "Quiz_4"] <- NA
sum(is.na(gradebook["Homework_4"])) # should output 10
[1] 10
[1] 10
With this, our dataset is all ready to go. In the later parts, we will perform imputation on the data using two different approaches. Here, imputation is the process of replacing missing values with estimated values. The simplest (far from preferred) method to impute values is to replace missing values with the most typical value, say the mean or the median.
In this part, we will try to impute the NA
values WITHOUT using any of the tools provided by dplyr
or tidyr
.
In order to achieve this, we will write a function messy_impute()
with at least three arguments:
NA
values occur may be different)1
, the function imputes the missing values by row, if 2
, by column. if choosing by column, the function should process homework and quizzes separately)messy_impute()
Algorithm:
We first determine how to impute based on the measure of center (mean or median).
We call the helper function, which first establishes the indices of the NA
values in the data frame.
The helper function determines how to impute based on the margin value (row or column).
If we are imputing by row, then we iterate through each NA
value and impute the missing values.
If we are imputing by column, then we iterate through each NA
value but we check what category (homework or quiz) each NA
value falls under.
Based on which category the NA
value falls under, we impute by that category separately.
We return the imputed data frame.
messy_impute_data <- function(df, center_fn, margin, ...) {
# This helper function essentially sets up the imputation process based on the center method (mean or median) and margin (row or column)
# This will throw an error if the margin is any input other than 1 or 2
# This will also throw an error if the score does not belong in either the homework or quiz category
# Args:
# df: the gradebook data frame
# center_fn: center function (either mean or margin)
# margin: the margin (either row or column)
# ...: extra args to pass onto the center_fn
# Return:
# the imputed data frame
index <- which(is.na(df), arr.ind = TRUE)
if (margin == 1) {
for (i in 1:nrow(index)) {
df[index[i,1], index[i,2]] <- center_fn(df[ ,index[i,2]], na.rm = T, ...)
}
} else if (margin == 2) {
for (i in 1:nrow(index)) {
if (grepl("Homework", colnames(gradebook)[index[i,2]])) {
df[index[i,1], index[i,2]] <- apply(gradebook[index[i,1], grepl("Homework", names(gradebook))], 1, center_fn, na.rm = T, ...)
} else if (grepl("Quiz", colnames(gradebook)[index[i,2]])) {
df[index[i,1], index[i,2]] <- apply(gradebook[index[i,1], grepl("Quiz", names(gradebook))], 1, center_fn, na.rm = T, ...)
} else {
stop("This score does not belong to either the homework or quiz category.")
}
}
} else {
stop("The margin must be either 1 (by row) or 2 (by column).")
}
df
}
messy_impute <- function(df, center = "Mean", margin, ...) {
# This function uses the above helper function to actually impute the data frame
# This will throw an error if the measure of center is not mean or median
# Args:
# df: the gradebook data frame
# center: center function with default as mean (either mean or margin)
# margin: the margin (either row or column)
# ...: extra args to pass onto the center_fn in the helper function
# Return:
# the imputed data frame
center <- tolower(center)
if (center == "mean") {
df <- messy_impute_data(df, mean, margin, ...)
} else if (center == "median") {
df <- messy_impute_data(df, median, margin, ...)
} else {
stop("The measure of center must be either 'mean' or 'median'.")
}
df
}
Let’s demonstrate the above function! We will select two students missing Homework_4
and two students missing Quiz_4
from our simulated gradebook and perform imputations.
Test Cases:
UID Homework_1 Homework_2 Homework_3 Homework_4 Homework_5 Quiz_1
3 576992987 53 54 85 52 29 35
Quiz_2 Quiz_3 Quiz_4 Quiz_5
3 74 28 NA 46
UID Homework_1 Homework_2 Homework_3 Homework_4 Homework_5 Quiz_1
4 143843463 33 2 6 39 39 49
Quiz_2 Quiz_3 Quiz_4 Quiz_5
4 55 97 NA 67
UID Homework_1 Homework_2 Homework_3 Homework_4 Homework_5 Quiz_1
8 473921063 84 82 23 NA 53 63
Quiz_2 Quiz_3 Quiz_4 Quiz_5
8 99 10 68 88
UID Homework_1 Homework_2 Homework_3 Homework_4 Homework_5 Quiz_1
22 642770743 73 19 41 NA 15 89
Quiz_2 Quiz_3 Quiz_4 Quiz_5
22 52 63 62 57
messy_impute(gradebook, "mean", 1)[3,] #apply row imputing by mean for 1st student missing Quiz_4; expected imputed value: 48.25556
UID Homework_1 Homework_2 Homework_3 Homework_4 Homework_5 Quiz_1
3 576992987 53 54 85 52 29 35
Quiz_2 Quiz_3 Quiz_4 Quiz_5
3 74 28 48.25556 46
messy_impute(gradebook, "median", 2)[4,] #apply column imputing by median for 2nd student missing Quiz_4; expected imputed value: 61
UID Homework_1 Homework_2 Homework_3 Homework_4 Homework_5 Quiz_1
4 143843463 33 2 6 39 39 49
Quiz_2 Quiz_3 Quiz_4 Quiz_5
4 55 97 61 67
messy_impute(gradebook, "mean", 2)[8, ] #apply column imputing by mean for 1st student missing Homework_4; expected imputed value: 60.5
UID Homework_1 Homework_2 Homework_3 Homework_4 Homework_5 Quiz_1
8 473921063 84 82 23 60.5 53 63
Quiz_2 Quiz_3 Quiz_4 Quiz_5
8 99 10 68 88
messy_impute(gradebook, "mean", 1, trim = 0.25)[22,] #apply row imputing by mean (with trim) for 2nd student missing Homework_4; expected imputed value: 54.54348
UID Homework_1 Homework_2 Homework_3 Homework_4 Homework_5 Quiz_1
22 642770743 73 19 41 54.54348 15 89
Quiz_2 Quiz_3 Quiz_4 Quiz_5
22 52 63 62 57
The function definitely works as intended, but the code looks quite messy. What happens when we make use of tidyr
and dplyr
?
First, we will convert our simulated dataset into a tidy format.
gradebook_tidy <- as_tibble(gradebook) %>% pivot_longer(names(gradebook)[-1], names_to = c("Assignment_Type", "Assignment_Number"), values_to = "Score", names_sep = "_")
gradebook_tidy
# A tibble: 1,000 × 4
UID Assignment_Type Assignment_Number Score
<dbl> <chr> <chr> <dbl>
1 532521854 Homework 1 45
2 532521854 Homework 2 17
3 532521854 Homework 3 75
4 532521854 Homework 4 22
5 532521854 Homework 5 13
6 532521854 Quiz 1 80
7 532521854 Quiz 2 87
8 532521854 Quiz 3 83
9 532521854 Quiz 4 82
10 532521854 Quiz 5 11
# … with 990 more rows
To perform tidy imputation, we will write a function called tidy_impute()
. The trick here is to make use of group_by()
in order to pull the appropriate data needed for each imputation method. The tidy_impute()
function should have the same arguments as the messy_impute()
function.
tidy_impute()
Algorithm:
We first determine how to impute based on the measure of center (mean or median).
We create a center function that reflects the measure of center.
We now determine how to impute based on margin (row or column).
If imputing by row, we first group by assignment_type and assignment_number.
Then, we mutate the Score column of the tidy data using an if_else statement to see where the NA values are.
If imputing by column, we group by student IDs and assignment_type, then do step 5 as described above.
We return the imputed tidy data.
tidy_impute <- function(tidy_df, center = "Mean", margin, ...) {
# This function imputes a tidy data
# This will throw an error if the measure of center is not mean or median
# This will also throw an error if the margin is any input other than 1 or 2
# Args:
# tidy_df: the tidied gradebook data
# center: center function with default as mean (either mean or margin)
# margin: the margin (either row or column)
# ...: extra args to pass onto the center_fn
# Return:
# the imputed tidy data
center <- tolower(center)
if (center == "mean") {
center_fn <- function(x) mean(x, na.rm = T, ...)
} else if (center == "median") {
center_fn <- function(x) median(x, na.rm = T, ...)
} else {
stop("The measure of center must be either 'mean' or 'median'.")
}
if (margin == 1) {
tidy_df <- tidy_df %>% group_by(Assignment_Type, Assignment_Number) %>% mutate(Score=if_else(is.na(Score), center_fn(Score), Score))
} else if (margin == 2) {
tidy_df <- tidy_df %>% group_by(UID, Assignment_Type) %>% mutate(Score=if_else(is.na(Score), center_fn(Score), Score))
} else {
stop("The margin must be either 1 (by row) or 2 (by column).")
}
tidy_df
}
We use the same cases from d) to demonstrate our new function, tidy_impute()
.
# A tibble: 1 × 4
UID Assignment_Type Assignment_Number Score
<dbl> <chr> <chr> <dbl>
1 576992987 Quiz 4 NA
# A tibble: 1 × 4
UID Assignment_Type Assignment_Number Score
<dbl> <chr> <chr> <dbl>
1 143843463 Quiz 4 NA
# A tibble: 1 × 4
UID Assignment_Type Assignment_Number Score
<dbl> <chr> <chr> <dbl>
1 473921063 Homework 4 NA
# A tibble: 1 × 4
UID Assignment_Type Assignment_Number Score
<dbl> <chr> <chr> <dbl>
1 642770743 Homework 4 NA
#tidy_impute should result in the same numbers as messy_impute
tidy_impute(gradebook_tidy, "mean", 1)[29,] #expected imputed value: 48.25556
# A tibble: 1 × 4
# Groups: Assignment_Type, Assignment_Number [1]
UID Assignment_Type Assignment_Number Score
<dbl> <chr> <chr> <dbl>
1 576992987 Quiz 4 48.3
# A tibble: 1 × 4
# Groups: UID, Assignment_Type [1]
UID Assignment_Type Assignment_Number Score
<dbl> <chr> <chr> <dbl>
1 143843463 Quiz 4 61
# A tibble: 1 × 4
# Groups: UID, Assignment_Type [1]
UID Assignment_Type Assignment_Number Score
<dbl> <chr> <chr> <dbl>
1 473921063 Homework 4 60.5
# A tibble: 1 × 4
# Groups: Assignment_Type, Assignment_Number [1]
UID Assignment_Type Assignment_Number Score
<dbl> <chr> <chr> <dbl>
1 642770743 Homework 4 54.5
Great! The new function also works as intended and is much more concise than messy_impute()
. With this, we discover the power of using tidyverse
for any data-related work.