Chapter 6 Data Wrangling and Reshaping
6.1 pivot
Pivoting can seem abstract at first, but it’s incredibly useful when you have a wide data and need to conduct further analysis. RStudio often works best with data in a long format. Let’s start with an example:
Name | Height (cm) | Age |
---|---|---|
Lily | 150 | 20 |
Mary | 160 | 18 |
John | 156 | 14 |
Steve | 180 | 15 |
Amy | 174 | 16 |
This is straightforward: new rows can be added as more people are recorded. But sometimes, you’ll encounter wide data, such as:
Name | Eng_C1 | Eng_C2 | Eng_C3 | Math_C1 | Math_C2 | Math_C3 |
---|---|---|---|---|---|---|
Kelly | 90 | 95 | 80 | 90 | 80 | 97 |
Liam | 70 | 80 | 94 | 80 | 89 | 79 |
Henry | 79 | 80 | 85 | 96 | 92 | 90 |
Here, students’ scores in English and Math from Chapter 1 (C1) to Chapter 3 (C3) are spread across multiple columns. This wide format can be tricky for analysis, and pivoting the data into a long format is often more effective.
6.1.1 example: converting wide to long format
Let’s first create a data first.
performance_original <- data.frame(
Name = c("Kelly", "Liam", "Henry", "Alice", "Jake", "Dan"),
Eng_C1 = c(90, 70, 79, 98, 81, 79),
Eng_C2 = c(95, 80, 80, 93, 93, 70),
Eng_C3 = c(80, 94, 85, 89, 73, 91),
Math_C1 = c(90, 80, 96, 83, 92, 68),
Math_C2 = c(80, 89, 92, 72, 84, 83),
Math_C3 = c(97, 79, 90, 74, 70, 92))
performance_original
## Name Eng_C1 Eng_C2 Eng_C3 Math_C1 Math_C2 Math_C3
## 1 Kelly 90 95 80 90 80 97
## 2 Liam 70 80 94 80 89 79
## 3 Henry 79 80 85 96 92 90
## 4 Alice 98 93 89 83 72 74
## 5 Jake 81 93 73 92 84 70
## 6 Dan 79 70 91 68 83 92
To convert this data into a long format, where each row represents one exam score, we can use pivot_longer()
. The goal is to get data that looks like this:
Name | Subject | Score |
---|---|---|
Kelly | Eng_C1 | 90 |
Kelly | Eng_C2 | 95 |
Kelly | Eng_C3 | 80 |
Kelly | Math_C1 | 90 |
Kelly | Math_C2 | 80 |
Kelly | Math_C3 | 97 |
Liam | Eng_C1 | 70 |
Liam | Eng_C2 | 80 |
Liam | Eng_C3 | 94 |
Liam | Math_C1 | 80 |
Liam | Math_C2 | 89 |
Liam | Math_C3 | 79 |
Henry | Eng_C1 | 79 |
Henry | Eng_C2 | 80 |
… and so on |
Here’s the code to achieve that:
## # A tibble: 36 × 3
## Name Subject Score
## <chr> <chr> <dbl>
## 1 Kelly Eng_C1 90
## 2 Kelly Eng_C2 95
## 3 Kelly Eng_C3 80
## 4 Kelly Math_C1 90
## 5 Kelly Math_C2 80
## 6 Kelly Math_C3 97
## 7 Liam Eng_C1 70
## 8 Liam Eng_C2 80
## 9 Liam Eng_C3 94
## 10 Liam Math_C1 80
## # ℹ 26 more rows
We can see that the data format is now what we intended. Let’s break down the key parts of the code used:
!Name
: The!
symbol tells RStudio to exclude theName
column from the pivot operation, meaning that we still want to keep theName
column intact in the dataset.names_to = "Subject"
: This argument specifies that we want to gather the multiple collumn names (e.g.,Eng_C1
,Math_C1
, etc.) into a new column calledSubject
.values_to = "Score"
: This indicates that the values from the original columns (the actual exam scores) should be placed into a new column namedScore
.
These two arguments (names_to
and values_to
) allow us to compress multiple columns into one, transforming the data from a wide to a long format.
There are other ways to achieve the same result. The approach we used works well when the dataset doesn’t have too many columns to exclude (in this case, just the Name
column). However, if you have a large dataset with many columns to exclude, or if it’s easier to specify the columns you want to compress directly, you can use a different method like cols =
.
For example, you could list out the columns you want to gather explicitly, or use a range of column names depending on the situation. This flexibility allows you to choose the most efficient method based on your dataset’s structure.
performance_original |>
pivot_longer(cols = c("Eng_C1", "Eng_C2", "Eng_C3", "Math_C1", "Math_C2", "Math_C3"),
names_to = "Subject",
values_to = "Score") -> performance
performance
## # A tibble: 36 × 3
## Name Subject Score
## <chr> <chr> <dbl>
## 1 Kelly Eng_C1 90
## 2 Kelly Eng_C2 95
## 3 Kelly Eng_C3 80
## 4 Kelly Math_C1 90
## 5 Kelly Math_C2 80
## 6 Kelly Math_C3 97
## 7 Liam Eng_C1 70
## 8 Liam Eng_C2 80
## 9 Liam Eng_C3 94
## 10 Liam Math_C1 80
## # ℹ 26 more rows
Now, let’s dive a bit deeper into a more abstract concept.
If we want to further split our data, for instance, breaking down the Subject
column into two new columns, like Subject_em
(for subject name) and Chapter
(for chapter number), we can use the separate()
function to accomplish this.
performance |>
separate(
col = Subject,
into = c("Subject_em", "Chapter"),
sep = "[^[:alnum:]]+",
remove = TRUE,
convert = FALSE) -> performance
performance
## # A tibble: 36 × 4
## Name Subject_em Chapter Score
## <chr> <chr> <chr> <dbl>
## 1 Kelly Eng C1 90
## 2 Kelly Eng C2 95
## 3 Kelly Eng C3 80
## 4 Kelly Math C1 90
## 5 Kelly Math C2 80
## 6 Kelly Math C3 97
## 7 Liam Eng C1 70
## 8 Liam Eng C2 80
## 9 Liam Eng C3 94
## 10 Liam Math C1 80
## # ℹ 26 more rows
I know it looks scary, but please hear me out– this function will become one of your best friends once you understand it😊.
As mentioned eariler, we want to separate our Subject
column into Subject_em
and Chapter
. So, we use col =
to specify the target column (in this case, Subject
), and into =
to define the names of the two new columns we want to create.
Now, let’s demystify the seemingly bizarre-looking sep = "[^[:alnum:]]+"
part. In general, this means “one or more non-alphabet or non-numeric characters”. Let’s break it down:
[:alnum:]
stands for alphabet (A-Z, a-z), and numbers (0-9).[^]
means not include.+
means one or more.
When we combine these, it means we’re telling RStudio to look for one or more things that are not alphabet nor number. In this case, we want to find the underscore (_
) and use it as the separation point.
We also see the parameters remove = TRUE
and convert = FALSE
. These are defaults in the separate()
function, but let’s briefly touch on them to understand what they do:
remove = TRUE
means to delete the original columnSubject
column after separating it.convert = FALSE
ensures that RStudio doesn’t automatically change the types of the new columns. For exapmle, if the separation produces numbers (e.g., separatingEng_1
intoEng
and1
), settingconvert = TRUE
would make the new1
column numeric. We can change the column type later if needed, but for now, we keep itFALSE
.
Congratulations! You’ve now learned how to pivot wide-format data into long-format, and how to separate columns. (Even after 5 years of learning and using RStuido, I’m still amazed by its power. EVERY SINGLE TIME.)
If there’s a pivot_longer()
, there must be a pivot_wider()
, right? Well, yes! But as I mentioned earlier, RStudio prefers long-format data for analysis and visualization. For that reason (and for I am not proficient with pivot_wider()
), I’m focusing on pivot_longer()
here since it’s generally more useful for data analysis. (Maybe I’ll dive into pivot_wider()
when I’m more proficient with it, haha.)
6.2 join
Another useful function when working with multiple datasets or files is combining them into one. Let’s continue with our performance
dataset, and imagine that we have documented students’ Biology exam scores in another dataset.
performance_biology <- data.frame(
Name = c("Kelly", "Liam", "Henry", "Alice", "Jake", "Dan"),
Bio_C1 = c(80, 80, 83, 93, 80, 88),
Bio_C2 = c(73, 76, 75, 77, 79, 83),
Bio_C3 = c(90, 93, 94, 90, 81, 82))
We pull the original performance data and want to add the Biology columns to it.
## Name Eng_C1 Eng_C2 Eng_C3 Math_C1 Math_C2 Math_C3 Bio_C1 Bio_C2 Bio_C3
## 1 Kelly 90 95 80 90 80 97 80 73 90
## 2 Liam 70 80 94 80 89 79 80 76 93
## 3 Henry 79 80 85 96 92 90 83 75 94
## 4 Alice 98 93 89 83 72 74 93 77 90
## 5 Jake 81 93 73 92 84 70 80 79 81
## 6 Dan 79 70 91 68 83 92 88 83 82
Here, we’ve successfully added the Biology columns to the original dataset. When dealing with multiple independent variables (columns), you can use by = c("column1", "column2", …)
to specify the join basis.
Now, you might have guessed – there is also a right_join()
in RStudio. While I don’t use it as often as left_join()
in my usual data analysis, it’s still important to understand the difference.
left_join()
: keeps the data from the left dataset, even if there is no matching data on the right. If no match is found, it fills the right columns withNA
.right_join()
: keeps the data from the right dataset, even if there is no matching data on the left. Again, unmatched data is filled withNA
.
While these functions are simple, they’re extremely powerful for combining datasets, and – dare I say – quite magical in what they can accomplish.
6.3 bind
Besides left and right joins, there’s another way to combine datasets: binding.
dataset1 <- data.frame(
brand = c("1","2","3"),
price = c(200, 300, 250),
profit = c(100, 200, 200))
dataset2 <- data.frame(
brand = c("1","2","3"),
sales = c(20, 40, 60))
We have created two separate datassets. One contains a product’s brand, price, and profit, and the other contains sales numbers. We can use cbind()
(column bind) to merge these datasets side by side.
## brand price profit brand sales
## 1 1 200 100 1 20
## 2 2 300 200 2 40
## 3 3 250 200 3 60
However, as you might notice, the column brand
is repeated in the merged data. If this repetition doesn’t affect your analysis, you can leave it. But if you prefer a cleaner result, you could opt to use left_join()
instead, as it binds the data without duplicating columns.
## brand price profit sales
## 1 1 200 100 20
## 2 2 300 200 40
## 3 3 250 200 60