Chapter 5 Data Cleaning and Transformation

Here is a dataset for superheros that I downloaded from here.

library(readr)
setwd  ("~/Documents/R_tutorials")
heroes <- read_delim("example files/heroes.csv", 
    delim = ";", escape_double = FALSE, trim_ws = TRUE)
View(heroes)

Note that the spreadsheet is in .csv format. Therefore, when importing data, we should choose From Text(readr) instead of From Text(Excel).

However, as you may notice from the Preview, the data is not ideal – it appears as a single row, with all the information packed into one.

To resolve this, we can click on Delimiter and select Semicolon, telling RStudio that the columns in this dataset are separated by semicolons (;) rather than commas or tabs.

5.1 select(), filter(), & mutate()

Once the data is properly imported, we can see that it contains information about many superheroes. If we want a clearer view to focus on the data that we need for analysis, we can use select() function to choose specific columns.

heroes |>
  select(c("Name", "Height", "Weight", "Gender", "Eye color", "Hair color", "Strength", "Intelligence"))
## # A tibble: 735 × 8
##    Name        Height Weight Gender `Eye color` `Hair color` Strength Intelligence
##    <chr>        <dbl>  <dbl> <chr>  <chr>       <chr>           <dbl> <chr>       
##  1 A-Bomb        203.  442.  M      Yellow      No Hair           100 moderate    
##  2 Abraxas        NA    NA   M      Blue        Black             100 high        
##  3 Abomination   203.  442.  M      Green       No Hair            80 good        
##  4 Adam Monroe    NA    NA   M      Blue        Blond              10 good        
##  5 Agent 13      173.   61.0 F      Blue        Blond              NA <NA>        
##  6 Air-Walker    189.  108.  M      Blue        White              85 average     
##  7 Agent Bob     178.   81.4 M      Brown       Brown              10 low         
##  8 Abe Sapien    191.   65.4 M      Blue        No Hair            30 high        
##  9 Abin Sur      186.   90.9 M      Blue        No Hair            90 average     
## 10 Angela         NA    NA   F      <NA>        <NA>              100 high        
## # ℹ 725 more rows

Now, if we want to filter for superheroes with high intelligence, we can use the filter() function.

This function allows us to extract rows that meet specified conditions.

heroes |>
  filter(Intelligence == "high")
## # A tibble: 144 × 12
##    Name   Identity `Birth place` Publisher Height Weight Gender `First appearance`
##    <chr>  <chr>    <chr>         <chr>      <dbl>  <dbl> <chr>               <dbl>
##  1 Abrax… Abraxas  Within Etern… Marvel C…   NA     NA   M                      NA
##  2 Abe S… Abraham… <NA>          Dark Hor…  191.    65.4 M                    1993
##  3 Angela <NA>     <NA>          Image Co…   NA     NA   F                      NA
##  4 Yoda   Yoda     <NA>          George L…   66.3   17.0 M                    1980
##  5 Zatan… Zatanna… <NA>          DC Comics  170.    57.8 F                      NA
##  6 Yello… Hank Pym Elmsford, Ne… Marvel C…  183.    83.7 M                      NA
##  7 X-Man  Nate Gr… American Nor… Marvel C…  176.    61.8 M                    1995
##  8 Wonde… Diana P… Themyscira    DC Comics  183.    74.7 F                    1941
##  9 Watch… Uatu     <NA>          Marvel C…   NA     NA   M                    1963
## 10 Warlo… Adam Wa… The Beehive,… Marvel C…  188.   108.  M                      NA
## # ℹ 134 more rows
## # ℹ 4 more variables: `Eye color` <chr>, `Hair color` <chr>, Strength <dbl>,
## #   Intelligence <chr>

In fact, select() and filter() can be combined.

Let’s look for female superheroes with strength over 50 and high intelligence levels.

heroes |>
  select(c("Name", "Gender", "Strength", "Intelligence")) |>
  filter(Gender == "F" & Strength >= 50 & Intelligence == "high")
## # A tibble: 18 × 4
##    Name            Gender Strength Intelligence
##    <chr>           <chr>     <dbl> <chr>       
##  1 Angela          F           100 high        
##  2 Wonder Woman    F           100 high        
##  3 Valkyrie        F            95 high        
##  4 Supergirl       F           100 high        
##  5 Silk Spectre II F            55 high        
##  6 She-Hulk        F           100 high        
##  7 Power Girl      F           100 high        
##  8 Phoenix         F           100 high        
##  9 Lady Bullseye   F            75 high        
## 10 Jean Grey       F            80 high        
## 11 Granny Goodness F           100 high        
## 12 Giganta         F            90 high        
## 13 Faora           F            95 high        
## 14 Donna Troy      F            95 high        
## 15 Cheetah III     F           100 high        
## 16 Cat             F            90 high        
## 17 Captain Marvel  F            90 high        
## 18 Big Barda       F           100 high

If we want to adjust this filter to include superheroes with either good or high intelligence, there are two ways to do it.

The first method is to use the | operator (which means “or”). The second method is to use the %in% operator, which selects data that fits either of the specified conditions.

# method 1
heroes |>
  select(c("Name", "Gender", "Strength", "Intelligence")) |>
  filter(Gender == "F" & Strength >= 50) |>
  filter(Intelligence == "high" | Intelligence == "good")
## # A tibble: 47 × 4
##    Name         Gender Strength Intelligence
##    <chr>        <chr>     <dbl> <chr>       
##  1 Angela       F           100 high        
##  2 Wonder Girl  F            90 good        
##  3 Wonder Woman F           100 high        
##  4 Valkyrie     F            95 high        
##  5 Vindicator   F            65 good        
##  6 Thor Girl    F            85 good        
##  7 T-X          F            65 good        
##  8 Supergirl    F           100 high        
##  9 Stargirl     F            80 good        
## 10 Spider-Gwen  F            55 good        
## # ℹ 37 more rows
# method 2
heroes |>
  select(c("Name", "Gender", "Strength", "Intelligence")) |>
  filter(Gender == "F" & Strength >= 50) |>
  filter(Intelligence %in% c("high", "good"))
## # A tibble: 47 × 4
##    Name         Gender Strength Intelligence
##    <chr>        <chr>     <dbl> <chr>       
##  1 Angela       F           100 high        
##  2 Wonder Girl  F            90 good        
##  3 Wonder Woman F           100 high        
##  4 Valkyrie     F            95 high        
##  5 Vindicator   F            65 good        
##  6 Thor Girl    F            85 good        
##  7 T-X          F            65 good        
##  8 Supergirl    F           100 high        
##  9 Stargirl     F            80 good        
## 10 Spider-Gwen  F            55 good        
## # ℹ 37 more rows

Both methods give us the same result. While the first method is more straightforward, I personally feel cooler using the second one😎.

Now, let’s calculate the superheroes’ Body Mass Index (BMI). The formula is:

\[ BMI = weight(kg)/height(m)^2 \]

So, we need to :

  • Add a new column that converts height from centimeters to meters.

  • Square the height in meters.

  • Divide weight by the square of the height.

heroes |>
  mutate(Height_m = Height / 100) |> # convert heights from cm to m
  mutate(Height_m2 = Height_m*Height_m) |> # square of the heights
  mutate(BMI = Weight / Height_m2) # divide weights by the squrare of the heights
## # A tibble: 735 × 15
##    Name   Identity `Birth place` Publisher Height Weight Gender `First appearance`
##    <chr>  <chr>    <chr>         <chr>      <dbl>  <dbl> <chr>               <dbl>
##  1 A-Bomb Richard… Scarsdale, A… Marvel C…   203.  442.  M                    2008
##  2 Abrax… Abraxas  Within Etern… Marvel C…    NA    NA   M                      NA
##  3 Abomi… Emil Bl… Zagreb, Yugo… Marvel C…   203.  442.  M                      NA
##  4 Adam … <NA>     <NA>          NBC - He…    NA    NA   M                      NA
##  5 Agent… Sharon … <NA>          Marvel C…   173.   61.0 F                      NA
##  6 Air-W… Gabriel… Xandar, a pl… Marvel C…   189.  108.  M                      NA
##  7 Agent… Bob      <NA>          Marvel C…   178.   81.4 M                    2007
##  8 Abe S… Abraham… <NA>          Dark Hor…   191.   65.4 M                    1993
##  9 Abin … <NA>     Ungara        DC Comics   186.   90.9 M                    1959
## 10 Angela <NA>     <NA>          Image Co…    NA    NA   F                      NA
## # ℹ 725 more rows
## # ℹ 7 more variables: `Eye color` <chr>, `Hair color` <chr>, Strength <dbl>,
## #   Intelligence <chr>, Height_m <dbl>, Height_m2 <dbl>, BMI <dbl>

We can either break this down into multiple steps or, for simplicity, combine everything into one line of code.

heroes |>
  mutate(BMI = Weight / (Height/100)^2) 
## # A tibble: 735 × 13
##    Name   Identity `Birth place` Publisher Height Weight Gender `First appearance`
##    <chr>  <chr>    <chr>         <chr>      <dbl>  <dbl> <chr>               <dbl>
##  1 A-Bomb Richard… Scarsdale, A… Marvel C…   203.  442.  M                    2008
##  2 Abrax… Abraxas  Within Etern… Marvel C…    NA    NA   M                      NA
##  3 Abomi… Emil Bl… Zagreb, Yugo… Marvel C…   203.  442.  M                      NA
##  4 Adam … <NA>     <NA>          NBC - He…    NA    NA   M                      NA
##  5 Agent… Sharon … <NA>          Marvel C…   173.   61.0 F                      NA
##  6 Air-W… Gabriel… Xandar, a pl… Marvel C…   189.  108.  M                      NA
##  7 Agent… Bob      <NA>          Marvel C…   178.   81.4 M                    2007
##  8 Abe S… Abraham… <NA>          Dark Hor…   191.   65.4 M                    1993
##  9 Abin … <NA>     Ungara        DC Comics   186.   90.9 M                    1959
## 10 Angela <NA>     <NA>          Image Co…    NA    NA   F                      NA
## # ℹ 725 more rows
## # ℹ 5 more variables: `Eye color` <chr>, `Hair color` <chr>, Strength <dbl>,
## #   Intelligence <chr>, BMI <dbl>

Either approach will give the same result, so you can shoose whichever suits you best.

Tip: By default, new columns appear on the right, but if you prefer them on the left (to make it easier to check when there are many columns), you can specify .after = "column_name" in mutate() to position the new column immediately after a chosen column.

heroes |>
  mutate(BMI = Weight / (Height/100)^2, .before = 1) 
## # A tibble: 735 × 13
##      BMI Name        Identity         `Birth place` Publisher Height Weight Gender
##    <dbl> <chr>       <chr>            <chr>         <chr>      <dbl>  <dbl> <chr> 
##  1 107.  A-Bomb      Richard Milhous… Scarsdale, A… Marvel C…   203.  442.  M     
##  2  NA   Abraxas     Abraxas          Within Etern… Marvel C…    NA    NA   M     
##  3 107.  Abomination Emil Blonsky     Zagreb, Yugo… Marvel C…   203.  442.  M     
##  4  NA   Adam Monroe <NA>             <NA>          NBC - He…    NA    NA   M     
##  5  20.3 Agent 13    Sharon Carter    <NA>          Marvel C…   173.   61.0 F     
##  6  30.4 Air-Walker  Gabriel Lan      Xandar, a pl… Marvel C…   189.  108.  M     
##  7  25.6 Agent Bob   Bob              <NA>          Marvel C…   178.   81.4 M     
##  8  17.9 Abe Sapien  Abraham Sapien   <NA>          Dark Hor…   191.   65.4 M     
##  9  26.4 Abin Sur    <NA>             Ungara        DC Comics   186.   90.9 M     
## 10  NA   Angela      <NA>             <NA>          Image Co…    NA    NA   F     
## # ℹ 725 more rows
## # ℹ 5 more variables: `First appearance` <dbl>, `Eye color` <chr>,
## #   `Hair color` <chr>, Strength <dbl>, Intelligence <chr>

If, for example, I want the new BMI column to appear immediately after the Name column, I can write .after = "Name" to achieve that.

heroes |>
  mutate(BMI = Weight / (Height/100)^2, .after = "Name")
## # A tibble: 735 × 13
##    Name          BMI Identity         `Birth place` Publisher Height Weight Gender
##    <chr>       <dbl> <chr>            <chr>         <chr>      <dbl>  <dbl> <chr> 
##  1 A-Bomb      107.  Richard Milhous… Scarsdale, A… Marvel C…   203.  442.  M     
##  2 Abraxas      NA   Abraxas          Within Etern… Marvel C…    NA    NA   M     
##  3 Abomination 107.  Emil Blonsky     Zagreb, Yugo… Marvel C…   203.  442.  M     
##  4 Adam Monroe  NA   <NA>             <NA>          NBC - He…    NA    NA   M     
##  5 Agent 13     20.3 Sharon Carter    <NA>          Marvel C…   173.   61.0 F     
##  6 Air-Walker   30.4 Gabriel Lan      Xandar, a pl… Marvel C…   189.  108.  M     
##  7 Agent Bob    25.6 Bob              <NA>          Marvel C…   178.   81.4 M     
##  8 Abe Sapien   17.9 Abraham Sapien   <NA>          Dark Hor…   191.   65.4 M     
##  9 Abin Sur     26.4 <NA>             Ungara        DC Comics   186.   90.9 M     
## 10 Angela       NA   <NA>             <NA>          Image Co…    NA    NA   F     
## # ℹ 725 more rows
## # ℹ 5 more variables: `First appearance` <dbl>, `Eye color` <chr>,
## #   `Hair color` <chr>, Strength <dbl>, Intelligence <chr>

Tip: Instead of using select() to manually keep only the columns you’ve used during mutate() process, you can use .keep = "used". This tells RStudio to retain only the columns involved in the mutation process.

heroes |>
  mutate(BMI = Weight / (Height/100)^2, .keep = "used") 
## # A tibble: 735 × 3
##    Height Weight   BMI
##     <dbl>  <dbl> <dbl>
##  1   203.  442.  107. 
##  2    NA    NA    NA  
##  3   203.  442.  107. 
##  4    NA    NA    NA  
##  5   173.   61.0  20.3
##  6   189.  108.   30.4
##  7   178.   81.4  25.6
##  8   191.   65.4  17.9
##  9   186.   90.9  26.4
## 10    NA    NA    NA  
## # ℹ 725 more rows

Now, I want to reflect on a personal experience from elementary school. After annual health exams, students were categorized based on their BMI, and teachers would tell those who were categorized as “overweight”, “obese”, or “underweight” to be cautious about their BMI.

I’m not going to dive into the mental toll this caused me as a child who often got labeled “overweight”, but let’s use this as an example to categorize numerical data into groups in RStudio.

According to Wikipedia, BMI categories are (simply) defined as follows:

BMI Category
< 18.5 underweight (UW)
18.5 - 24.9 (inclusive) normal weight (NW)
25.0 - 29.9 (inclusive) overweight (OW)
>= 30 obese (OB)

Now, let’s categorize the superheroes based on their BMI values.

We will use mutate() to create a new column for their BMI category.

heroes |>
  mutate(BMI = Weight / (Height/100)^2, .after = "Name") |> 
  mutate(BMI_status = case_when(BMI < 18.5 ~ "underweight",
                                BMI >= 18.5 & BMI <= 24.9 ~ "normal_weight",
                                BMI >= 25 & BMI <= 29.9 ~ "overweight",
                                BMI >= 30 ~ "obese"), .after = "BMI") -> heroes2
heroes2 |>
  count(BMI_status)
## # A tibble: 5 × 2
##   BMI_status        n
##   <chr>         <int>
## 1 normal_weight   201
## 2 obese           127
## 3 overweight      114
## 4 underweight      41
## 5 <NA>            252

Once this is done, we can see how many superheroes are categorized as obese.

5.2 dealing with omission

You may notice a row with <NA> values, which occurs when the data is incomplete. If we don’t want to include rows with missing BMI values in our analysis, we can remove them by using na.omit(). This makes further analysis much easier.

heroes2 |>
  na.omit() |>
  count(BMI_status) 
## # A tibble: 4 × 2
##   BMI_status        n
##   <chr>         <int>
## 1 normal_weight    48
## 2 obese            40
## 3 overweight       40
## 4 underweight      13

You can also update your dataset to exclude missing values. Here, I assigned the cleaner version of my data: heroes2.

heroes2 |>
  na.omit() -> heroes2

If you want to further categorize the data, such as determining how many male and female superheroes are categorized as obese or underweight, you can do so by adding another column to the count() function.

heroes2 |>
  count(Gender, BMI_status)
## # A tibble: 8 × 3
##   Gender BMI_status        n
##   <chr>  <chr>         <int>
## 1 F      normal_weight    26
## 2 F      obese             1
## 3 F      overweight        1
## 4 F      underweight      12
## 5 M      normal_weight    22
## 6 M      obese            39
## 7 M      overweight       39
## 8 M      underweight       1

Alternatively, you can use group_by(), which we will cover in a future chapters, to achieve the same result.

heroes2 |>
  group_by(Gender) |>
  count(BMI_status)
## # A tibble: 8 × 3
## # Groups:   Gender [2]
##   Gender BMI_status        n
##   <chr>  <chr>         <int>
## 1 F      normal_weight    26
## 2 F      obese             1
## 3 F      overweight        1
## 4 F      underweight      12
## 5 M      normal_weight    22
## 6 M      obese            39
## 7 M      overweight       39
## 8 M      underweight       1

5.3 change column names

Lastly, let’s discuss renaming columns. You don’t need to go back to the original .csv or .xlsx files to change column names!

colnames()

If we want to change the Hair color column to Hair_color and Eye color colun to Eye_color, we can do this directly in RStudio using the colnames() function.

We can use colnames() to modify column names in our datasets by assigning new names to the existing ones.

colnames(data)[colnames(data) == "old column name"] <- "new column name"
colnames(heroes2)[colnames(heroes2) == "Hair color"] <- "Hair_color"
colnames(heroes2)[colnames(heroes2) == "Eye color"] <- "Eye_color"

Let’s check if we achieved our goal.

heroes2 |>
  select(c(Hair_color, Eye_color))
## # A tibble: 141 × 2
##    Hair_color Eye_color
##    <chr>      <chr>    
##  1 No Hair    Yellow   
##  2 Black      Brown    
##  3 No Hair    Blue     
##  4 Brown      Blue     
##  5 Blond      Blue     
##  6 Black      Blue     
##  7 Black      Brown    
##  8 Brown      Brown    
##  9 Black      Blue     
## 10 Black      Amber    
## # ℹ 131 more rows

From the results, we can see that the column names have been changed as we desired.

rename()

rename() is another way to achieve the same result. It is a function in dplyr. Therefore, make sure to have library(dplyr) before using this.

library(dplyr)

The function should be written as follows:

data |>
  rename("New column name 1" = "Old column name 1",
         "New column name 2" = "Old column name 2")

Make sure to type the old column names accurately, so that R can identify and rename the correct columns.

heroes |>
  rename("Birth_place" = "Birth place", "First_app" = "First appearance") -> heroes

Let us select both columns to make sure we did the right move.

heroes |>
  select(c(Birth_place, First_app))
## # A tibble: 735 × 2
##    Birth_place                                             First_app
##    <chr>                                                       <dbl>
##  1 Scarsdale, Arizona                                           2008
##  2 Within Eternity                                                NA
##  3 Zagreb, Yugoslavia                                             NA
##  4 <NA>                                                           NA
##  5 <NA>                                                           NA
##  6 Xandar, a planet in the Tranta system, Andromeda galaxy        NA
##  7 <NA>                                                         2007
##  8 <NA>                                                         1993
##  9 Ungara                                                       1959
## 10 <NA>                                                           NA
## # ℹ 725 more rows

We can see from the result that the original column Birth place was changed into Birth_place, and the First appearance became First_app.

You might wonder, if we can change column names using colnames(), why not also learn how to change row names? The good news is that you can! Instead of introducing a new function like rownames(), why not use something we already know? For example, mutate() can be a great way to change row names.

In my opinion, it’s not always necessary to keep adding new functions; by mastering a few that you like, you can create many different possibilities.

However, modifying row names can be a bit risky since you’re altering the entries of your data. It’s important to clearly understand the purpose of this change and to avoid making arbitrary modifications.