Chapter 4 Overview of Tidyverse and Core Functions

I really love tidyverse. It’s a game changer. The three packages I use most often from tidyverse are:

  1. ggplot2
  2. dplyr
  3. tidyr

But first, let’s clarify what tidyverse is.

Tidyverse is a collection of packages that makes data transformation and visualization (and therefore, my life) easier. Under the tidyverse umbrella, we can handle data like pros.

We’ll dive into ggplot2 in a later chapter. For now, let’s focus on dplyr and tidyr first.

If you haven’t use tidyverse before, you need to install the package first.

install.packages("tidyverse")
install.packages("dplyr")
install.packages("tidyr")

Next, don’t forget to load your packages uding the library() function.

library("tidyverse")
library("dplyr")
library("tidyr")

Now, you’re all set.

Before jumping into the functions, let’s meet a funny-looking friend: The pipe (%>% / |>).

They look like this:

%>%
|>

In my humble opinion, this is one of the greatest analogies of this century. Although it doesn’t look like a real pipe, it functions like one.

Let’s use an example to see how it works.

4.1 pipe example

We use the babynames package mentioned above to demonstrate the use of the pipe.

We can take a look at the data again.

babynames
## # A tibble: 1,924,665 × 5
##     year sex   name          n   prop
##    <dbl> <chr> <chr>     <int>  <dbl>
##  1  1880 F     Mary       7065 0.0724
##  2  1880 F     Anna       2604 0.0267
##  3  1880 F     Emma       2003 0.0205
##  4  1880 F     Elizabeth  1939 0.0199
##  5  1880 F     Minnie     1746 0.0179
##  6  1880 F     Margaret   1578 0.0162
##  7  1880 F     Ida        1472 0.0151
##  8  1880 F     Alice      1414 0.0145
##  9  1880 F     Bertha     1320 0.0135
## 10  1880 F     Sarah      1288 0.0132
## # ℹ 1,924,655 more rows

Now, let’s say we want to see the names of babies born in the year 2000.

babynames %>%
    filter(year == 2000)
## # A tibble: 29,769 × 5
##     year sex   name          n    prop
##    <dbl> <chr> <chr>     <int>   <dbl>
##  1  2000 F     Emily     25953 0.0130 
##  2  2000 F     Hannah    23080 0.0116 
##  3  2000 F     Madison   19967 0.0100 
##  4  2000 F     Ashley    17997 0.00902
##  5  2000 F     Sarah     17697 0.00887
##  6  2000 F     Alexis    17629 0.00884
##  7  2000 F     Samantha  17266 0.00866
##  8  2000 F     Jessica   15709 0.00787
##  9  2000 F     Elizabeth 15094 0.00757
## 10  2000 F     Taylor    15078 0.00756
## # ℹ 29,759 more rows

Here you’ll see all the baby names from 2000!

In the code, I used the pipe (%>%) to direct my dataset (babynames) into the next function, filter().

Inside the filter() function, I specified that I wanted to see only babies born in the year 2000. In other words, I filtered out all the rows in the year column except those with the value ‘2000’.

The great thing about the pipe (%>% / |>) is that you can chain multiple operations together. Just like water flows from east to west, you can direct your data from one operation to the next.

Let’s try it out.

babynames %>%
    filter(year == 2000) %>%
    filter(sex == "F")
## # A tibble: 17,653 × 5
##     year sex   name          n    prop
##    <dbl> <chr> <chr>     <int>   <dbl>
##  1  2000 F     Emily     25953 0.0130 
##  2  2000 F     Hannah    23080 0.0116 
##  3  2000 F     Madison   19967 0.0100 
##  4  2000 F     Ashley    17997 0.00902
##  5  2000 F     Sarah     17697 0.00887
##  6  2000 F     Alexis    17629 0.00884
##  7  2000 F     Samantha  17266 0.00866
##  8  2000 F     Jessica   15709 0.00787
##  9  2000 F     Elizabeth 15094 0.00757
## 10  2000 F     Taylor    15078 0.00756
## # ℹ 17,643 more rows

Here, we added "F" inside quotes because it’s a character, not a number. We need to specify this so that RStudio can understands what we mean.

Now, the data is shows female babies born in year 2000.

We can also achieve this result in another way.

babynames %>%
 filter(year == 2000 & sex == "F")
## # A tibble: 17,653 × 5
##     year sex   name          n    prop
##    <dbl> <chr> <chr>     <int>   <dbl>
##  1  2000 F     Emily     25953 0.0130 
##  2  2000 F     Hannah    23080 0.0116 
##  3  2000 F     Madison   19967 0.0100 
##  4  2000 F     Ashley    17997 0.00902
##  5  2000 F     Sarah     17697 0.00887
##  6  2000 F     Alexis    17629 0.00884
##  7  2000 F     Samantha  17266 0.00866
##  8  2000 F     Jessica   15709 0.00787
##  9  2000 F     Elizabeth 15094 0.00757
## 10  2000 F     Taylor    15078 0.00756
## # ℹ 17,643 more rows

Let’s explore more data transformations in the next chapter.