Chapter 4 Overview of Tidyverse and Core Functions
I really love tidyverse
. It’s a game changer. The three packages I use most often from tidyverse
are:
ggplot2
dplyr
tidyr
But first, let’s clarify what tidyverse
is.
Tidyverse
is a collection of packages that makes data transformation and visualization (and therefore, my life) easier. Under the tidyverse
umbrella, we can handle data like pros.
We’ll dive into ggplot2
in a later chapter. For now, let’s focus on dplyr
and tidyr
first.
If you haven’t use tidyverse
before, you need to install the package first.
Next, don’t forget to load your packages uding the library()
function.
Now, you’re all set.
Before jumping into the functions, let’s meet a funny-looking friend: The pipe (%>%
/ |>
).
They look like this:
In my humble opinion, this is one of the greatest analogies of this century. Although it doesn’t look like a real pipe, it functions like one.
Let’s use an example to see how it works.
4.1 pipe example
We use the babynames
package mentioned above to demonstrate the use of the pipe.
We can take a look at the data again.
## # A tibble: 1,924,665 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Mary 7065 0.0724
## 2 1880 F Anna 2604 0.0267
## 3 1880 F Emma 2003 0.0205
## 4 1880 F Elizabeth 1939 0.0199
## 5 1880 F Minnie 1746 0.0179
## 6 1880 F Margaret 1578 0.0162
## 7 1880 F Ida 1472 0.0151
## 8 1880 F Alice 1414 0.0145
## 9 1880 F Bertha 1320 0.0135
## 10 1880 F Sarah 1288 0.0132
## # ℹ 1,924,655 more rows
Now, let’s say we want to see the names of babies born in the year 2000.
## # A tibble: 29,769 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 2000 F Emily 25953 0.0130
## 2 2000 F Hannah 23080 0.0116
## 3 2000 F Madison 19967 0.0100
## 4 2000 F Ashley 17997 0.00902
## 5 2000 F Sarah 17697 0.00887
## 6 2000 F Alexis 17629 0.00884
## 7 2000 F Samantha 17266 0.00866
## 8 2000 F Jessica 15709 0.00787
## 9 2000 F Elizabeth 15094 0.00757
## 10 2000 F Taylor 15078 0.00756
## # ℹ 29,759 more rows
Here you’ll see all the baby names from 2000!
In the code, I used the pipe (%>%
) to direct my dataset (babynames
) into the next function, filter()
.
Inside the filter()
function, I specified that I wanted to see only babies born in the year 2000. In other words, I filtered out all the rows in the year
column except those with the value ‘2000’.
The great thing about the pipe (%>%
/ |>
) is that you can chain multiple operations together. Just like water flows from east to west, you can direct your data from one operation to the next.
Let’s try it out.
## # A tibble: 17,653 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 2000 F Emily 25953 0.0130
## 2 2000 F Hannah 23080 0.0116
## 3 2000 F Madison 19967 0.0100
## 4 2000 F Ashley 17997 0.00902
## 5 2000 F Sarah 17697 0.00887
## 6 2000 F Alexis 17629 0.00884
## 7 2000 F Samantha 17266 0.00866
## 8 2000 F Jessica 15709 0.00787
## 9 2000 F Elizabeth 15094 0.00757
## 10 2000 F Taylor 15078 0.00756
## # ℹ 17,643 more rows
Here, we added "F"
inside quotes because it’s a character, not a number. We need to specify this so that RStudio can understands what we mean.
Now, the data is shows female babies born in year 2000.
We can also achieve this result in another way.
## # A tibble: 17,653 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 2000 F Emily 25953 0.0130
## 2 2000 F Hannah 23080 0.0116
## 3 2000 F Madison 19967 0.0100
## 4 2000 F Ashley 17997 0.00902
## 5 2000 F Sarah 17697 0.00887
## 6 2000 F Alexis 17629 0.00884
## 7 2000 F Samantha 17266 0.00866
## 8 2000 F Jessica 15709 0.00787
## 9 2000 F Elizabeth 15094 0.00757
## 10 2000 F Taylor 15078 0.00756
## # ℹ 17,643 more rows
Let’s explore more data transformations in the next chapter.