Chapter 5 Data Cleaning and Transformation
Here is a dataset for superheros that I downloaded from here.
library(readr)
setwd ("~/Documents/R_tutorials")
heroes <- read_delim("example files/heroes.csv",
delim = ";", escape_double = FALSE, trim_ws = TRUE)
View(heroes)
Note that the spreadsheet is in .csv
format. Therefore, when importing data, we should choose From Text(readr)
instead of From Text(Excel)
.
However, as you may notice from the Preview
, the data is not ideal – it appears as a single row, with all the information packed into one.
To resolve this, we can click on Delimiter
and select Semicolon
, telling RStudio that the columns in this dataset are separated by semicolons (;
) rather than commas or tabs.
5.1 select(), filter(), & mutate()
Once the data is properly imported, we can see that it contains information about many superheroes. If we want a clearer view to focus on the data that we need for analysis, we can use select()
function to choose specific columns.
heroes |>
select(c("Name", "Height", "Weight", "Gender", "Eye color", "Hair color", "Strength", "Intelligence"))
## # A tibble: 735 × 8
## Name Height Weight Gender `Eye color` `Hair color` Strength Intelligence
## <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 A-Bomb 203. 442. M Yellow No Hair 100 moderate
## 2 Abraxas NA NA M Blue Black 100 high
## 3 Abomination 203. 442. M Green No Hair 80 good
## 4 Adam Monroe NA NA M Blue Blond 10 good
## 5 Agent 13 173. 61.0 F Blue Blond NA <NA>
## 6 Air-Walker 189. 108. M Blue White 85 average
## 7 Agent Bob 178. 81.4 M Brown Brown 10 low
## 8 Abe Sapien 191. 65.4 M Blue No Hair 30 high
## 9 Abin Sur 186. 90.9 M Blue No Hair 90 average
## 10 Angela NA NA F <NA> <NA> 100 high
## # ℹ 725 more rows
Now, if we want to filter for superheroes with high intelligence, we can use the filter()
function.
This function allows us to extract rows that meet specified conditions.
## # A tibble: 144 × 12
## Name Identity `Birth place` Publisher Height Weight Gender `First appearance`
## <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 Abrax… Abraxas Within Etern… Marvel C… NA NA M NA
## 2 Abe S… Abraham… <NA> Dark Hor… 191. 65.4 M 1993
## 3 Angela <NA> <NA> Image Co… NA NA F NA
## 4 Yoda Yoda <NA> George L… 66.3 17.0 M 1980
## 5 Zatan… Zatanna… <NA> DC Comics 170. 57.8 F NA
## 6 Yello… Hank Pym Elmsford, Ne… Marvel C… 183. 83.7 M NA
## 7 X-Man Nate Gr… American Nor… Marvel C… 176. 61.8 M 1995
## 8 Wonde… Diana P… Themyscira DC Comics 183. 74.7 F 1941
## 9 Watch… Uatu <NA> Marvel C… NA NA M 1963
## 10 Warlo… Adam Wa… The Beehive,… Marvel C… 188. 108. M NA
## # ℹ 134 more rows
## # ℹ 4 more variables: `Eye color` <chr>, `Hair color` <chr>, Strength <dbl>,
## # Intelligence <chr>
In fact, select()
and filter()
can be combined.
Let’s look for female superheroes with strength over 50 and high intelligence levels.
heroes |>
select(c("Name", "Gender", "Strength", "Intelligence")) |>
filter(Gender == "F" & Strength >= 50 & Intelligence == "high")
## # A tibble: 18 × 4
## Name Gender Strength Intelligence
## <chr> <chr> <dbl> <chr>
## 1 Angela F 100 high
## 2 Wonder Woman F 100 high
## 3 Valkyrie F 95 high
## 4 Supergirl F 100 high
## 5 Silk Spectre II F 55 high
## 6 She-Hulk F 100 high
## 7 Power Girl F 100 high
## 8 Phoenix F 100 high
## 9 Lady Bullseye F 75 high
## 10 Jean Grey F 80 high
## 11 Granny Goodness F 100 high
## 12 Giganta F 90 high
## 13 Faora F 95 high
## 14 Donna Troy F 95 high
## 15 Cheetah III F 100 high
## 16 Cat F 90 high
## 17 Captain Marvel F 90 high
## 18 Big Barda F 100 high
If we want to adjust this filter to include superheroes with either good
or high
intelligence, there are two ways to do it.
The first method is to use the |
operator (which means “or”). The second method is to use the %in%
operator, which selects data that fits either of the specified conditions.
# method 1
heroes |>
select(c("Name", "Gender", "Strength", "Intelligence")) |>
filter(Gender == "F" & Strength >= 50) |>
filter(Intelligence == "high" | Intelligence == "good")
## # A tibble: 47 × 4
## Name Gender Strength Intelligence
## <chr> <chr> <dbl> <chr>
## 1 Angela F 100 high
## 2 Wonder Girl F 90 good
## 3 Wonder Woman F 100 high
## 4 Valkyrie F 95 high
## 5 Vindicator F 65 good
## 6 Thor Girl F 85 good
## 7 T-X F 65 good
## 8 Supergirl F 100 high
## 9 Stargirl F 80 good
## 10 Spider-Gwen F 55 good
## # ℹ 37 more rows
# method 2
heroes |>
select(c("Name", "Gender", "Strength", "Intelligence")) |>
filter(Gender == "F" & Strength >= 50) |>
filter(Intelligence %in% c("high", "good"))
## # A tibble: 47 × 4
## Name Gender Strength Intelligence
## <chr> <chr> <dbl> <chr>
## 1 Angela F 100 high
## 2 Wonder Girl F 90 good
## 3 Wonder Woman F 100 high
## 4 Valkyrie F 95 high
## 5 Vindicator F 65 good
## 6 Thor Girl F 85 good
## 7 T-X F 65 good
## 8 Supergirl F 100 high
## 9 Stargirl F 80 good
## 10 Spider-Gwen F 55 good
## # ℹ 37 more rows
Both methods give us the same result. While the first method is more straightforward, I personally feel cooler using the second one😎.
Now, let’s calculate the superheroes’ Body Mass Index (BMI). The formula is:
\[ BMI = weight(kg)/height(m)^2 \]
So, we need to :
Add a new column that converts height from centimeters to meters.
Square the height in meters.
Divide weight by the square of the height.
heroes |>
mutate(Height_m = Height / 100) |> # convert heights from cm to m
mutate(Height_m2 = Height_m*Height_m) |> # square of the heights
mutate(BMI = Weight / Height_m2) # divide weights by the squrare of the heights
## # A tibble: 735 × 15
## Name Identity `Birth place` Publisher Height Weight Gender `First appearance`
## <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 A-Bomb Richard… Scarsdale, A… Marvel C… 203. 442. M 2008
## 2 Abrax… Abraxas Within Etern… Marvel C… NA NA M NA
## 3 Abomi… Emil Bl… Zagreb, Yugo… Marvel C… 203. 442. M NA
## 4 Adam … <NA> <NA> NBC - He… NA NA M NA
## 5 Agent… Sharon … <NA> Marvel C… 173. 61.0 F NA
## 6 Air-W… Gabriel… Xandar, a pl… Marvel C… 189. 108. M NA
## 7 Agent… Bob <NA> Marvel C… 178. 81.4 M 2007
## 8 Abe S… Abraham… <NA> Dark Hor… 191. 65.4 M 1993
## 9 Abin … <NA> Ungara DC Comics 186. 90.9 M 1959
## 10 Angela <NA> <NA> Image Co… NA NA F NA
## # ℹ 725 more rows
## # ℹ 7 more variables: `Eye color` <chr>, `Hair color` <chr>, Strength <dbl>,
## # Intelligence <chr>, Height_m <dbl>, Height_m2 <dbl>, BMI <dbl>
We can either break this down into multiple steps or, for simplicity, combine everything into one line of code.
## # A tibble: 735 × 13
## Name Identity `Birth place` Publisher Height Weight Gender `First appearance`
## <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl>
## 1 A-Bomb Richard… Scarsdale, A… Marvel C… 203. 442. M 2008
## 2 Abrax… Abraxas Within Etern… Marvel C… NA NA M NA
## 3 Abomi… Emil Bl… Zagreb, Yugo… Marvel C… 203. 442. M NA
## 4 Adam … <NA> <NA> NBC - He… NA NA M NA
## 5 Agent… Sharon … <NA> Marvel C… 173. 61.0 F NA
## 6 Air-W… Gabriel… Xandar, a pl… Marvel C… 189. 108. M NA
## 7 Agent… Bob <NA> Marvel C… 178. 81.4 M 2007
## 8 Abe S… Abraham… <NA> Dark Hor… 191. 65.4 M 1993
## 9 Abin … <NA> Ungara DC Comics 186. 90.9 M 1959
## 10 Angela <NA> <NA> Image Co… NA NA F NA
## # ℹ 725 more rows
## # ℹ 5 more variables: `Eye color` <chr>, `Hair color` <chr>, Strength <dbl>,
## # Intelligence <chr>, BMI <dbl>
Either approach will give the same result, so you can shoose whichever suits you best.
Tip: By default, new columns appear on the right, but if you prefer them on the left (to make it easier to check when there are many columns), you can specify
.after = "column_name"
inmutate()
to position the new column immediately after a chosen column.
## # A tibble: 735 × 13
## BMI Name Identity `Birth place` Publisher Height Weight Gender
## <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 107. A-Bomb Richard Milhous… Scarsdale, A… Marvel C… 203. 442. M
## 2 NA Abraxas Abraxas Within Etern… Marvel C… NA NA M
## 3 107. Abomination Emil Blonsky Zagreb, Yugo… Marvel C… 203. 442. M
## 4 NA Adam Monroe <NA> <NA> NBC - He… NA NA M
## 5 20.3 Agent 13 Sharon Carter <NA> Marvel C… 173. 61.0 F
## 6 30.4 Air-Walker Gabriel Lan Xandar, a pl… Marvel C… 189. 108. M
## 7 25.6 Agent Bob Bob <NA> Marvel C… 178. 81.4 M
## 8 17.9 Abe Sapien Abraham Sapien <NA> Dark Hor… 191. 65.4 M
## 9 26.4 Abin Sur <NA> Ungara DC Comics 186. 90.9 M
## 10 NA Angela <NA> <NA> Image Co… NA NA F
## # ℹ 725 more rows
## # ℹ 5 more variables: `First appearance` <dbl>, `Eye color` <chr>,
## # `Hair color` <chr>, Strength <dbl>, Intelligence <chr>
If, for example, I want the new BMI
column to appear immediately after the Name
column, I can write .after = "Name"
to achieve that.
## # A tibble: 735 × 13
## Name BMI Identity `Birth place` Publisher Height Weight Gender
## <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 A-Bomb 107. Richard Milhous… Scarsdale, A… Marvel C… 203. 442. M
## 2 Abraxas NA Abraxas Within Etern… Marvel C… NA NA M
## 3 Abomination 107. Emil Blonsky Zagreb, Yugo… Marvel C… 203. 442. M
## 4 Adam Monroe NA <NA> <NA> NBC - He… NA NA M
## 5 Agent 13 20.3 Sharon Carter <NA> Marvel C… 173. 61.0 F
## 6 Air-Walker 30.4 Gabriel Lan Xandar, a pl… Marvel C… 189. 108. M
## 7 Agent Bob 25.6 Bob <NA> Marvel C… 178. 81.4 M
## 8 Abe Sapien 17.9 Abraham Sapien <NA> Dark Hor… 191. 65.4 M
## 9 Abin Sur 26.4 <NA> Ungara DC Comics 186. 90.9 M
## 10 Angela NA <NA> <NA> Image Co… NA NA F
## # ℹ 725 more rows
## # ℹ 5 more variables: `First appearance` <dbl>, `Eye color` <chr>,
## # `Hair color` <chr>, Strength <dbl>, Intelligence <chr>
Tip: Instead of using
select()
to manually keep only the columns you’ve used duringmutate()
process, you can use.keep = "used"
. This tells RStudio to retain only the columns involved in the mutation process.
## # A tibble: 735 × 3
## Height Weight BMI
## <dbl> <dbl> <dbl>
## 1 203. 442. 107.
## 2 NA NA NA
## 3 203. 442. 107.
## 4 NA NA NA
## 5 173. 61.0 20.3
## 6 189. 108. 30.4
## 7 178. 81.4 25.6
## 8 191. 65.4 17.9
## 9 186. 90.9 26.4
## 10 NA NA NA
## # ℹ 725 more rows
Now, I want to reflect on a personal experience from elementary school. After annual health exams, students were categorized based on their BMI, and teachers would tell those who were categorized as “overweight”, “obese”, or “underweight” to be cautious about their BMI.
I’m not going to dive into the mental toll this caused me as a child who often got labeled “overweight”, but let’s use this as an example to categorize numerical data into groups in RStudio.
According to Wikipedia, BMI categories are (simply) defined as follows:
BMI | Category |
---|---|
< 18.5 | underweight (UW) |
18.5 - 24.9 (inclusive) | normal weight (NW) |
25.0 - 29.9 (inclusive) | overweight (OW) |
>= 30 | obese (OB) |
Now, let’s categorize the superheroes based on their BMI values.
We will use mutate()
to create a new column for their BMI category.
heroes |>
mutate(BMI = Weight / (Height/100)^2, .after = "Name") |>
mutate(BMI_status = case_when(BMI < 18.5 ~ "underweight",
BMI >= 18.5 & BMI <= 24.9 ~ "normal_weight",
BMI >= 25 & BMI <= 29.9 ~ "overweight",
BMI >= 30 ~ "obese"), .after = "BMI") -> heroes2
## # A tibble: 5 × 2
## BMI_status n
## <chr> <int>
## 1 normal_weight 201
## 2 obese 127
## 3 overweight 114
## 4 underweight 41
## 5 <NA> 252
Once this is done, we can see how many superheroes are categorized as obese.
5.2 dealing with omission
You may notice a row with <NA>
values, which occurs when the data is incomplete. If we don’t want to include rows with missing BMI values in our analysis, we can remove them by using na.omit()
. This makes further analysis much easier.
## # A tibble: 4 × 2
## BMI_status n
## <chr> <int>
## 1 normal_weight 48
## 2 obese 40
## 3 overweight 40
## 4 underweight 13
You can also update your dataset to exclude missing values. Here, I assigned the cleaner version of my data: heroes2
.
If you want to further categorize the data, such as determining how many male and female superheroes are categorized as obese or underweight, you can do so by adding another column to the count()
function.
## # A tibble: 8 × 3
## Gender BMI_status n
## <chr> <chr> <int>
## 1 F normal_weight 26
## 2 F obese 1
## 3 F overweight 1
## 4 F underweight 12
## 5 M normal_weight 22
## 6 M obese 39
## 7 M overweight 39
## 8 M underweight 1
Alternatively, you can use group_by()
, which we will cover in a future chapters, to achieve the same result.
## # A tibble: 8 × 3
## # Groups: Gender [2]
## Gender BMI_status n
## <chr> <chr> <int>
## 1 F normal_weight 26
## 2 F obese 1
## 3 F overweight 1
## 4 F underweight 12
## 5 M normal_weight 22
## 6 M obese 39
## 7 M overweight 39
## 8 M underweight 1
5.3 change column names
Lastly, let’s discuss renaming columns. You don’t need to go back to the original .csv
or .xlsx
files to change column names!
colnames()
If we want to change the Hair color
column to Hair_color
and Eye color
colun to Eye_color
, we can do this directly in RStudio using the colnames()
function.
We can use colnames()
to modify column names in our datasets by assigning new names to the existing ones.
colnames(heroes2)[colnames(heroes2) == "Hair color"] <- "Hair_color"
colnames(heroes2)[colnames(heroes2) == "Eye color"] <- "Eye_color"
Let’s check if we achieved our goal.
## # A tibble: 141 × 2
## Hair_color Eye_color
## <chr> <chr>
## 1 No Hair Yellow
## 2 Black Brown
## 3 No Hair Blue
## 4 Brown Blue
## 5 Blond Blue
## 6 Black Blue
## 7 Black Brown
## 8 Brown Brown
## 9 Black Blue
## 10 Black Amber
## # ℹ 131 more rows
From the results, we can see that the column names have been changed as we desired.
rename()
rename()
is another way to achieve the same result. It is a function in dplyr
. Therefore, make sure to have library(dplyr) before using this.
The function should be written as follows:
data |>
rename("New column name 1" = "Old column name 1",
"New column name 2" = "Old column name 2")
Make sure to type the old column names accurately, so that R can identify and rename the correct columns.
Let us select both columns to make sure we did the right move.
## # A tibble: 735 × 2
## Birth_place First_app
## <chr> <dbl>
## 1 Scarsdale, Arizona 2008
## 2 Within Eternity NA
## 3 Zagreb, Yugoslavia NA
## 4 <NA> NA
## 5 <NA> NA
## 6 Xandar, a planet in the Tranta system, Andromeda galaxy NA
## 7 <NA> 2007
## 8 <NA> 1993
## 9 Ungara 1959
## 10 <NA> NA
## # ℹ 725 more rows
We can see from the result that the original column Birth place
was changed into Birth_place
, and the First appearance
became First_app
.
You might wonder, if we can change column names using
colnames()
, why not also learn how to change row names? The good news is that you can! Instead of introducing a new function likerownames()
, why not use something we already know? For example, mutate() can be a great way to change row names.In my opinion, it’s not always necessary to keep adding new functions; by mastering a few that you like, you can create many different possibilities.
However, modifying row names can be a bit risky since you’re altering the entries of your data. It’s important to clearly understand the purpose of this change and to avoid making arbitrary modifications.