Intro to Data Manipulation

R
tidyverse
Author

Jun Ryu

Published

February 5, 2023

For today, we import two powerful packages (dplyr and tidyr) under tidyverse that will help us clean and manipulate data easily.

library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(tidyr) 

a) Simulate a Dataset

Our first task is to create a dataset that we can work with. We intend to simulate a dataset that resembles a gradebook. Specifically, we want the following:

  • Each row of the gradebook should contain all measurements for a single student.
  • Each column should contain scores for one assignment.
  • There should be 11 columns (1 for 9-digit student IDs, 5 for homework, 5 for quizzes).

The simulated gradebook should contain the grades for 100 students and scores (out of 100) for 5 homework and 5 quizzes:

set.seed(605574052) # set seed for reproducibility 
gradebook <- data.frame()

for (i in 1:100) {
  UID <- round(runif(1, min = 100000000, max = 999999999), 0)
  assignments <- round(runif(10, min = 0, max = 100), 0)
  data <- c(UID, assignments)
  gradebook <- rbind(gradebook, data)
}

colnames(gradebook) <- c("UID", "Homework_1", "Homework_2", "Homework_3", "Homework_4", 
                         "Homework_5", "Quiz_1", "Quiz_2", "Quiz_3", "Quiz_4", "Quiz_5")
head(gradebook)
        UID Homework_1 Homework_2 Homework_3 Homework_4 Homework_5 Quiz_1
1 532521854         45         17         75         22         13     80
2 486614393         20         50         82         78         41     53
3 576992987         53         54         85         52         29     35
4 143843463         33          2          6         39         39     49
5 773280766         99         25         52         51         17     42
6 486851405         58         38         76         49         86     39
  Quiz_2 Quiz_3 Quiz_4 Quiz_5
1     87     83     82     11
2     10     55     73     42
3     74     28     27     46
4     55     97     84     67
5     66     37    100     10
6      7     80      2     16

b) Modify the Dataset

Now, we will modify the dataset to randomly replace 10% of Homework_4 and Quiz_4 with NA, respectively.

set.seed(605574052)

gradebook[sample(100, 10), "Homework_4"] <- NA
gradebook[sample(100, 10), "Quiz_4"] <- NA

sum(is.na(gradebook["Homework_4"])) # should output 10
[1] 10
sum(is.na(gradebook["Quiz_4"])) # should output 10
[1] 10

With this, our dataset is all ready to go. In the later parts, we will perform imputation on the data using two different approaches. Here, imputation is the process of replacing missing values with estimated values. The simplest (far from preferred) method to impute values is to replace missing values with the most typical value, say the mean or the median.

c) Messy Imputation

In this part, we will try to impute the NA values WITHOUT using any of the tools provided by dplyr or tidyr.

In order to achieve this, we will write a function messy_impute() with at least three arguments:

  • df: the gradebook dataframe (note that the number of rows and where NA values occur may be different)
  • center: a character object indicating the impute function (Mean or Median) with the default as “Mean”
  • margin: an integer (1 or 2) indicating the imputation method (if 1, the function imputes the missing values by row, if 2, by column. if choosing by column, the function should process homework and quizzes separately)

messy_impute()

Algorithm:

  1. We first determine how to impute based on the measure of center (mean or median).

  2. We call the helper function, which first establishes the indices of the NA values in the data frame.

  3. The helper function determines how to impute based on the margin value (row or column).

  4. If we are imputing by row, then we iterate through each NA value and impute the missing values.

  5. If we are imputing by column, then we iterate through each NA value but we check what category (homework or quiz) each NA value falls under.

  6. Based on which category the NA value falls under, we impute by that category separately.

  7. We return the imputed data frame.

messy_impute_data <- function(df, center_fn, margin, ...) {
  # This helper function essentially sets up the imputation process based on the center method (mean or median) and margin (row or column)
  # This will throw an error if the margin is any input other than 1 or 2
  # This will also throw an error if the score does not belong in either the homework or quiz category
  # Args:
  # df: the gradebook data frame
  # center_fn: center function (either mean or margin)
  # margin: the margin (either row or column)
  # ...: extra args to pass onto the center_fn
  # Return:
  # the imputed data frame
  
  index <- which(is.na(df), arr.ind = TRUE)
  if (margin == 1) {
    for (i in 1:nrow(index)) {
      df[index[i,1], index[i,2]] <- center_fn(df[ ,index[i,2]], na.rm = T, ...)
    }
  } else if (margin == 2) {
    for (i in 1:nrow(index)) {
      if (grepl("Homework", colnames(gradebook)[index[i,2]])) {
        df[index[i,1], index[i,2]] <- apply(gradebook[index[i,1], grepl("Homework", names(gradebook))], 1, center_fn, na.rm = T, ...)
      } else if (grepl("Quiz", colnames(gradebook)[index[i,2]])) {
        df[index[i,1], index[i,2]] <- apply(gradebook[index[i,1], grepl("Quiz", names(gradebook))], 1, center_fn, na.rm = T, ...)
      } else {
        stop("This score does not belong to either the homework or quiz category.")
      }
    }
  } else {
    stop("The margin must be either 1 (by row) or 2 (by column).")
  }
  df
}

messy_impute <- function(df, center = "Mean", margin, ...) {
  # This function uses the above helper function to actually impute the data frame 
  # This will throw an error if the measure of center is not mean or median
  # Args:
  # df: the gradebook data frame
  # center: center function with default as mean (either mean or margin)
  # margin: the margin (either row or column)
  # ...: extra args to pass onto the center_fn in the helper function
  # Return:
  # the imputed data frame
  
  center <- tolower(center)
  if (center == "mean") {
    df <- messy_impute_data(df, mean, margin, ...)
  } else if (center == "median") {
    df <- messy_impute_data(df, median, margin, ...)
  } else {
    stop("The measure of center must be either 'mean' or 'median'.")
  }
  
  df
}

d) Demo 1

Let’s demonstrate the above function! We will select two students missing Homework_4 and two students missing Quiz_4 from our simulated gradebook and perform imputations.

Test Cases:

gradebook[3,] # 1st student missing Quiz_4
        UID Homework_1 Homework_2 Homework_3 Homework_4 Homework_5 Quiz_1
3 576992987         53         54         85         52         29     35
  Quiz_2 Quiz_3 Quiz_4 Quiz_5
3     74     28     NA     46
gradebook[4,] # 2nd student missing Quiz_4
        UID Homework_1 Homework_2 Homework_3 Homework_4 Homework_5 Quiz_1
4 143843463         33          2          6         39         39     49
  Quiz_2 Quiz_3 Quiz_4 Quiz_5
4     55     97     NA     67
gradebook[8,] # 1st student missing Homework_4
        UID Homework_1 Homework_2 Homework_3 Homework_4 Homework_5 Quiz_1
8 473921063         84         82         23         NA         53     63
  Quiz_2 Quiz_3 Quiz_4 Quiz_5
8     99     10     68     88
gradebook[22,] # 2nd student missing Homework_4
         UID Homework_1 Homework_2 Homework_3 Homework_4 Homework_5 Quiz_1
22 642770743         73         19         41         NA         15     89
   Quiz_2 Quiz_3 Quiz_4 Quiz_5
22     52     63     62     57
messy_impute(gradebook, "mean", 1)[3,] #apply row imputing by mean for 1st student missing Quiz_4; expected imputed value: 48.25556
        UID Homework_1 Homework_2 Homework_3 Homework_4 Homework_5 Quiz_1
3 576992987         53         54         85         52         29     35
  Quiz_2 Quiz_3   Quiz_4 Quiz_5
3     74     28 48.25556     46
messy_impute(gradebook, "median", 2)[4,] #apply column imputing by median for 2nd student missing Quiz_4; expected imputed value: 61
        UID Homework_1 Homework_2 Homework_3 Homework_4 Homework_5 Quiz_1
4 143843463         33          2          6         39         39     49
  Quiz_2 Quiz_3 Quiz_4 Quiz_5
4     55     97     61     67
messy_impute(gradebook, "mean", 2)[8, ] #apply column imputing by mean for 1st student missing Homework_4; expected imputed value: 60.5
        UID Homework_1 Homework_2 Homework_3 Homework_4 Homework_5 Quiz_1
8 473921063         84         82         23       60.5         53     63
  Quiz_2 Quiz_3 Quiz_4 Quiz_5
8     99     10     68     88
messy_impute(gradebook, "mean", 1, trim = 0.25)[22,] #apply row imputing by mean (with trim) for 2nd student missing Homework_4; expected imputed value: 54.54348
         UID Homework_1 Homework_2 Homework_3 Homework_4 Homework_5 Quiz_1
22 642770743         73         19         41   54.54348         15     89
   Quiz_2 Quiz_3 Quiz_4 Quiz_5
22     52     63     62     57

The function definitely works as intended, but the code looks quite messy. What happens when we make use of tidyr and dplyr?

e) Convert the Data

First, we will convert our simulated dataset into a tidy format.

gradebook_tidy <- as_tibble(gradebook) %>% pivot_longer(names(gradebook)[-1], names_to = c("Assignment_Type", "Assignment_Number"), values_to = "Score", names_sep = "_")

gradebook_tidy
# A tibble: 1,000 × 4
         UID Assignment_Type Assignment_Number Score
       <dbl> <chr>           <chr>             <dbl>
 1 532521854 Homework        1                    45
 2 532521854 Homework        2                    17
 3 532521854 Homework        3                    75
 4 532521854 Homework        4                    22
 5 532521854 Homework        5                    13
 6 532521854 Quiz            1                    80
 7 532521854 Quiz            2                    87
 8 532521854 Quiz            3                    83
 9 532521854 Quiz            4                    82
10 532521854 Quiz            5                    11
# … with 990 more rows

f) Tidy Imputation

To perform tidy imputation, we will write a function called tidy_impute(). The trick here is to make use of group_by() in order to pull the appropriate data needed for each imputation method. The tidy_impute() function should have the same arguments as the messy_impute() function.

tidy_impute()

Algorithm:

  1. We first determine how to impute based on the measure of center (mean or median).

  2. We create a center function that reflects the measure of center.

  3. We now determine how to impute based on margin (row or column).

  4. If imputing by row, we first group by assignment_type and assignment_number.

  5. Then, we mutate the Score column of the tidy data using an if_else statement to see where the NA values are.

  6. If imputing by column, we group by student IDs and assignment_type, then do step 5 as described above.

  7. We return the imputed tidy data.

tidy_impute <- function(tidy_df, center = "Mean", margin, ...) {
  # This function imputes a tidy data
  # This will throw an error if the measure of center is not mean or median
  # This will also throw an error if the margin is any input other than 1 or 2
  # Args:
  # tidy_df: the tidied gradebook data
  # center: center function with default as mean (either mean or margin)
  # margin: the margin (either row or column)
  # ...: extra args to pass onto the center_fn
  # Return:
  # the imputed tidy data
  
  center <- tolower(center)
  if (center == "mean") {
    center_fn <- function(x) mean(x, na.rm = T, ...)
  } else if (center == "median") {
    center_fn <- function(x) median(x, na.rm = T, ...)
  } else {
    stop("The measure of center must be either 'mean' or 'median'.")
  }
  
  if (margin == 1) {
    tidy_df <- tidy_df %>% group_by(Assignment_Type, Assignment_Number) %>% mutate(Score=if_else(is.na(Score), center_fn(Score), Score))
  } else if (margin == 2) {
    tidy_df <- tidy_df %>% group_by(UID, Assignment_Type) %>% mutate(Score=if_else(is.na(Score), center_fn(Score), Score))
  } else {
    stop("The margin must be either 1 (by row) or 2 (by column).")
  }
  
  tidy_df
}

g) Demo 2

We use the same cases from d) to demonstrate our new function, tidy_impute().

gradebook_tidy[29,] # 1st student missing Quiz_4
# A tibble: 1 × 4
        UID Assignment_Type Assignment_Number Score
      <dbl> <chr>           <chr>             <dbl>
1 576992987 Quiz            4                    NA
gradebook_tidy[39,] # 2nd student missing Quiz_4
# A tibble: 1 × 4
        UID Assignment_Type Assignment_Number Score
      <dbl> <chr>           <chr>             <dbl>
1 143843463 Quiz            4                    NA
gradebook_tidy[74,] # 1st student missing Homework_4
# A tibble: 1 × 4
        UID Assignment_Type Assignment_Number Score
      <dbl> <chr>           <chr>             <dbl>
1 473921063 Homework        4                    NA
gradebook_tidy[214,] # 2nd student missing Homework_4
# A tibble: 1 × 4
        UID Assignment_Type Assignment_Number Score
      <dbl> <chr>           <chr>             <dbl>
1 642770743 Homework        4                    NA
#tidy_impute should result in the same numbers as messy_impute
tidy_impute(gradebook_tidy, "mean", 1)[29,] #expected imputed value: 48.25556
# A tibble: 1 × 4
# Groups:   Assignment_Type, Assignment_Number [1]
        UID Assignment_Type Assignment_Number Score
      <dbl> <chr>           <chr>             <dbl>
1 576992987 Quiz            4                  48.3
tidy_impute(gradebook_tidy, "median", 2)[39,] #expected imputed value: 61
# A tibble: 1 × 4
# Groups:   UID, Assignment_Type [1]
        UID Assignment_Type Assignment_Number Score
      <dbl> <chr>           <chr>             <dbl>
1 143843463 Quiz            4                    61
tidy_impute(gradebook_tidy, "mean", 2)[74, ] #expected imputed value: 60.5
# A tibble: 1 × 4
# Groups:   UID, Assignment_Type [1]
        UID Assignment_Type Assignment_Number Score
      <dbl> <chr>           <chr>             <dbl>
1 473921063 Homework        4                  60.5
tidy_impute(gradebook_tidy, "mean", 1, trim = 0.25)[214,] #expected imputed value: 54.54348
# A tibble: 1 × 4
# Groups:   Assignment_Type, Assignment_Number [1]
        UID Assignment_Type Assignment_Number Score
      <dbl> <chr>           <chr>             <dbl>
1 642770743 Homework        4                  54.5

Great! The new function also works as intended and is much more concise than messy_impute(). With this, we discover the power of using tidyverse for any data-related work.