Chapter 6 Data Wrangling and Reshaping

6.1 pivot

Pivoting can seem abstract at first, but it’s incredibly useful when you have a wide data and need to conduct further analysis. RStudio often works best with data in a long format. Let’s start with an example:

Name Height (cm) Age
Lily 150 20
Mary 160 18
John 156 14
Steve 180 15
Amy 174 16

This is straightforward: new rows can be added as more people are recorded. But sometimes, you’ll encounter wide data, such as:

Name Eng_C1 Eng_C2 Eng_C3 Math_C1 Math_C2 Math_C3
Kelly 90 95 80 90 80 97
Liam 70 80 94 80 89 79
Henry 79 80 85 96 92 90

Here, students’ scores in English and Math from Chapter 1 (C1) to Chapter 3 (C3) are spread across multiple columns. This wide format can be tricky for analysis, and pivoting the data into a long format is often more effective.

6.1.1 example: converting wide to long format

Let’s first create a data first.

performance_original <- data.frame(
   Name = c("Kelly", "Liam", "Henry", "Alice", "Jake", "Dan"),
   Eng_C1 = c(90, 70, 79, 98, 81, 79),
   Eng_C2 = c(95, 80, 80, 93, 93, 70),
   Eng_C3 = c(80, 94, 85, 89, 73, 91),
   Math_C1 = c(90, 80, 96, 83, 92, 68),
   Math_C2 = c(80, 89, 92, 72, 84, 83),
   Math_C3 = c(97, 79, 90, 74, 70, 92))
performance_original
##    Name Eng_C1 Eng_C2 Eng_C3 Math_C1 Math_C2 Math_C3
## 1 Kelly     90     95     80      90      80      97
## 2  Liam     70     80     94      80      89      79
## 3 Henry     79     80     85      96      92      90
## 4 Alice     98     93     89      83      72      74
## 5  Jake     81     93     73      92      84      70
## 6   Dan     79     70     91      68      83      92

To convert this data into a long format, where each row represents one exam score, we can use pivot_longer(). The goal is to get data that looks like this:

Name Subject Score
Kelly Eng_C1 90
Kelly Eng_C2 95
Kelly Eng_C3 80
Kelly Math_C1 90
Kelly Math_C2 80
Kelly Math_C3 97
Liam Eng_C1 70
Liam Eng_C2 80
Liam Eng_C3 94
Liam Math_C1 80
Liam Math_C2 89
Liam Math_C3 79
Henry Eng_C1 79
Henry Eng_C2 80
… and so on

Here’s the code to achieve that:

performance_original |>
  pivot_longer(!Name, names_to = "Subject", values_to = "Score") 
## # A tibble: 36 × 3
##    Name  Subject Score
##    <chr> <chr>   <dbl>
##  1 Kelly Eng_C1     90
##  2 Kelly Eng_C2     95
##  3 Kelly Eng_C3     80
##  4 Kelly Math_C1    90
##  5 Kelly Math_C2    80
##  6 Kelly Math_C3    97
##  7 Liam  Eng_C1     70
##  8 Liam  Eng_C2     80
##  9 Liam  Eng_C3     94
## 10 Liam  Math_C1    80
## # ℹ 26 more rows

We can see that the data format is now what we intended. Let’s break down the key parts of the code used:

  • !Name: The ! symbol tells RStudio to exclude the Name column from the pivot operation, meaning that we still want to keep the Name column intact in the dataset.

  • names_to = "Subject": This argument specifies that we want to gather the multiple collumn names (e.g., Eng_C1, Math_C1, etc.) into a new column called Subject.

  • values_to = "Score": This indicates that the values from the original columns (the actual exam scores) should be placed into a new column named Score.

These two arguments (names_to and values_to) allow us to compress multiple columns into one, transforming the data from a wide to a long format.

There are other ways to achieve the same result. The approach we used works well when the dataset doesn’t have too many columns to exclude (in this case, just the Name column). However, if you have a large dataset with many columns to exclude, or if it’s easier to specify the columns you want to compress directly, you can use a different method like cols =.

For example, you could list out the columns you want to gather explicitly, or use a range of column names depending on the situation. This flexibility allows you to choose the most efficient method based on your dataset’s structure.

performance_original |>
  pivot_longer(cols = c("Eng_C1", "Eng_C2", "Eng_C3", "Math_C1", "Math_C2", "Math_C3"),
               names_to = "Subject", 
               values_to = "Score") -> performance
performance
## # A tibble: 36 × 3
##    Name  Subject Score
##    <chr> <chr>   <dbl>
##  1 Kelly Eng_C1     90
##  2 Kelly Eng_C2     95
##  3 Kelly Eng_C3     80
##  4 Kelly Math_C1    90
##  5 Kelly Math_C2    80
##  6 Kelly Math_C3    97
##  7 Liam  Eng_C1     70
##  8 Liam  Eng_C2     80
##  9 Liam  Eng_C3     94
## 10 Liam  Math_C1    80
## # ℹ 26 more rows

Now, let’s dive a bit deeper into a more abstract concept.

If we want to further split our data, for instance, breaking down the Subject column into two new columns, like Subject_em (for subject name) and Chapter (for chapter number), we can use the separate() function to accomplish this.

performance |>
  separate(
    col = Subject, 
    into = c("Subject_em", "Chapter"),
    sep = "[^[:alnum:]]+",
    remove = TRUE,
    convert = FALSE) -> performance
performance
## # A tibble: 36 × 4
##    Name  Subject_em Chapter Score
##    <chr> <chr>      <chr>   <dbl>
##  1 Kelly Eng        C1         90
##  2 Kelly Eng        C2         95
##  3 Kelly Eng        C3         80
##  4 Kelly Math       C1         90
##  5 Kelly Math       C2         80
##  6 Kelly Math       C3         97
##  7 Liam  Eng        C1         70
##  8 Liam  Eng        C2         80
##  9 Liam  Eng        C3         94
## 10 Liam  Math       C1         80
## # ℹ 26 more rows

I know it looks scary, but please hear me out– this function will become one of your best friends once you understand it😊.

As mentioned eariler, we want to separate our Subject column into Subject_em and Chapter. So, we use col = to specify the target column (in this case, Subject), and into = to define the names of the two new columns we want to create.

Now, let’s demystify the seemingly bizarre-looking sep = "[^[:alnum:]]+" part. In general, this means “one or more non-alphabet or non-numeric characters”. Let’s break it down:

  • [:alnum:] stands for alphabet (A-Z, a-z), and numbers (0-9).

  • [^] means not include.

  • + means one or more.

When we combine these, it means we’re telling RStudio to look for one or more things that are not alphabet nor number. In this case, we want to find the underscore (_) and use it as the separation point.

We also see the parameters remove = TRUE and convert = FALSE. These are defaults in the separate() function, but let’s briefly touch on them to understand what they do:

  • remove = TRUE means to delete the original column Subject column after separating it.

  • convert = FALSE ensures that RStudio doesn’t automatically change the types of the new columns. For exapmle, if the separation produces numbers (e.g., separating Eng_1 into Eng and 1), setting convert = TRUE would make the new 1 column numeric. We can change the column type later if needed, but for now, we keep it FALSE.

Congratulations! You’ve now learned how to pivot wide-format data into long-format, and how to separate columns. (Even after 5 years of learning and using RStuido, I’m still amazed by its power. EVERY SINGLE TIME.)

If there’s a pivot_longer(), there must be a pivot_wider(), right? Well, yes! But as I mentioned earlier, RStudio prefers long-format data for analysis and visualization. For that reason (and for I am not proficient with pivot_wider()), I’m focusing on pivot_longer() here since it’s generally more useful for data analysis. (Maybe I’ll dive into pivot_wider() when I’m more proficient with it, haha.)

6.2 join

Another useful function when working with multiple datasets or files is combining them into one. Let’s continue with our performance dataset, and imagine that we have documented students’ Biology exam scores in another dataset.

performance_biology <- data.frame(
   Name = c("Kelly", "Liam", "Henry", "Alice", "Jake", "Dan"),
   Bio_C1 = c(80, 80, 83, 93, 80, 88),
   Bio_C2 = c(73, 76, 75, 77, 79, 83),
   Bio_C3 = c(90, 93, 94, 90, 81, 82))

We pull the original performance data and want to add the Biology columns to it.

performance_original |>
  left_join(performance_biology, by = "Name") 
##    Name Eng_C1 Eng_C2 Eng_C3 Math_C1 Math_C2 Math_C3 Bio_C1 Bio_C2 Bio_C3
## 1 Kelly     90     95     80      90      80      97     80     73     90
## 2  Liam     70     80     94      80      89      79     80     76     93
## 3 Henry     79     80     85      96      92      90     83     75     94
## 4 Alice     98     93     89      83      72      74     93     77     90
## 5  Jake     81     93     73      92      84      70     80     79     81
## 6   Dan     79     70     91      68      83      92     88     83     82

Here, we’ve successfully added the Biology columns to the original dataset. When dealing with multiple independent variables (columns), you can use by = c("column1", "column2", …) to specify the join basis.

Now, you might have guessed – there is also a right_join() in RStudio. While I don’t use it as often as left_join() in my usual data analysis, it’s still important to understand the difference.

  • left_join(): keeps the data from the left dataset, even if there is no matching data on the right. If no match is found, it fills the right columns with NA.

  • right_join(): keeps the data from the right dataset, even if there is no matching data on the left. Again, unmatched data is filled with NA.

While these functions are simple, they’re extremely powerful for combining datasets, and – dare I say – quite magical in what they can accomplish.

6.3 bind

Besides left and right joins, there’s another way to combine datasets: binding.

dataset1 <- data.frame(
  brand = c("1","2","3"),
  price = c(200, 300, 250),
  profit = c(100, 200, 200))
dataset2 <- data.frame(
  brand = c("1","2","3"),
  sales = c(20, 40, 60))

We have created two separate datassets. One contains a product’s brand, price, and profit, and the other contains sales numbers. We can use cbind() (column bind) to merge these datasets side by side.

cbind(dataset1, dataset2) -> merged_data
merged_data
##   brand price profit brand sales
## 1     1   200    100     1    20
## 2     2   300    200     2    40
## 3     3   250    200     3    60

However, as you might notice, the column brand is repeated in the merged data. If this repetition doesn’t affect your analysis, you can leave it. But if you prefer a cleaner result, you could opt to use left_join() instead, as it binds the data without duplicating columns.

dataset1 |>
  left_join(dataset2, by = "brand")
##   brand price profit sales
## 1     1   200    100    20
## 2     2   300    200    40
## 3     3   250    200    60