Chapter 3 Data Import and Reading
Let’s import our data!
3.1 Import Your Data
Note: If this is your first time import data and your data is from an Excel file, remember to type install.packages("readxl")
in your coding space first.
There are, of course, tons of ways to import your data, but the following is what I use most often.
I click “Import Dataset” in the top-right corner of the Environment box. Then choose “Import”, select “From Excel” (or “From Text” if I have a .csv
file), and use “Browse” in the top-right corner to pick the file I want to import.
You can either double-click the file or click it once and then click the “Open” at the bottom-right.
The middle part is the Data Preview section, where you can see your data. Don’t import the wrong file! (I’ve done that multiple times, especially when I’m lazy about giving files proper names 😅)
Next, in the bottom-right corner, you’ll see a section called Code Preview.
The code may look something like this (but not identical, because, hopefully, we’re using different file names).
library(readxl)
weather_math <- read_excel("~/Documents/R_tutorials/example files/weather_math.xlsx")
View(weather_math)
Now you’ve seen the Code Preview section, at the end of it, there’s a tiny icon that looks like a clipboard. What do you do? CLICK IT!
Great Job!!
Now you’ve copied the code necessary to import your file. Press Import in the bottom-right.
Go back to the coding section. Paste it (Ctrl/Command + v) into the coding section (the top-left section of the interface).
(Let’s hope RStudio never change their default interface or I’ll have to rewrite this part.😅)
Also, you can sometimes use head()
to see the first few rows of your data, just to make sure everything was imported correctly.
## # A tibble: 6 × 3
## id day score
## <dbl> <chr> <dbl>
## 1 1 sunny 9
## 2 2 sunny 8
## 3 3 sunny 7
## 4 4 sunny 8
## 5 5 sunny 8
## 6 6 sunny 8
Or you could just type in the data’s name.
## # A tibble: 20 × 3
## id day score
## <dbl> <chr> <dbl>
## 1 1 sunny 9
## 2 2 sunny 8
## 3 3 sunny 7
## 4 4 sunny 8
## 5 5 sunny 8
## 6 6 sunny 8
## 7 7 sunny 6
## 8 8 sunny 5
## 9 9 sunny 3
## 10 10 sunny 4
## 11 1 rainy 1
## 12 2 rainy 3
## 13 3 rainy 5
## 14 4 rainy 7
## 15 5 rainy 6
## 16 6 rainy 5
## 17 7 rainy 6
## 18 8 rainy 7
## 19 9 rainy 5
## 20 10 rainy 3
3.1.1 Weather Math Data Explanation
Let’s take a peek at this data, so you’ll have a better grasp of our future examples.
## # A tibble: 20 × 3
## id day score
## <dbl> <chr> <dbl>
## 1 1 sunny 9
## 2 2 sunny 8
## 3 3 sunny 7
## 4 4 sunny 8
## 5 5 sunny 8
## 6 6 sunny 8
## 7 7 sunny 6
## 8 8 sunny 5
## 9 9 sunny 3
## 10 10 sunny 4
## 11 1 rainy 1
## 12 2 rainy 3
## 13 3 rainy 5
## 14 4 rainy 7
## 15 5 rainy 6
## 16 6 rainy 5
## 17 7 rainy 6
## 18 8 rainy 7
## 19 9 rainy 5
## 20 10 rainy 3
This is a made-up data. Ten imaginary people participated in this imaginary research. They took a math test on both a rainy day and a sunny day, and their scores were recorded. The first column represents the participant number, the second column shows the weather (either sunny or rainy), and the third column contains their scores.
3.2 Dataset packages
3.2.1 babynames
There are various data packages available in R, and we can easily access them by installing the appropriate package. For example, we can install a package called babynames
.
This package contains baby names in the US from the year 1880 to year 2017.
Let’s take a look.
## # A tibble: 1,924,665 × 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Mary 7065 0.0724
## 2 1880 F Anna 2604 0.0267
## 3 1880 F Emma 2003 0.0205
## 4 1880 F Elizabeth 1939 0.0199
## 5 1880 F Minnie 1746 0.0179
## 6 1880 F Margaret 1578 0.0162
## 7 1880 F Ida 1472 0.0151
## 8 1880 F Alice 1414 0.0145
## 9 1880 F Bertha 1320 0.0135
## 10 1880 F Sarah 1288 0.0132
## # ℹ 1,924,655 more rows
RStudio creates a separate page for the data when we use View()
. We can see that there are 5 columns: year
, sex
, name
, n
, and prop
.
Wait, what is prop
?
When we encounter questions like this, we can type ?dataset
or ?function
on the Console. In this case, we typed ?babynames
.
After hitting enter, the Help screen on the right side will display some information. Upon reading the explanation of this dataset, we learn that prop
refers to n
divided by the total number of babies of that sex with that name born in that year.
3.2.2 starwars
There are also built-in datasets within the dplyr
package. For example, there is a dataset called starwars
.
We can import it with just a few lines of code, simply by calling the dplyr
package where starwars
dataset is stored.
## # A tibble: 87 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Skyw… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
## 3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
## 4 Darth Vad… 202 136 none white yellow 41.9 male mascu…
## 5 Leia Orga… 150 49 brown light brown 19 fema… femin…
## 6 Owen Lars 178 120 brown, gr… light blue 52 male mascu…
## 7 Beru Whit… 165 75 brown light blue 47 fema… femin…
## 8 R5-D4 97 32 <NA> white, red red NA none mascu…
## 9 Biggs Dar… 183 84 black light brown 24 male mascu…
## 10 Obi-Wan K… 182 77 auburn, w… fair blue-gray 57 male mascu…
## # ℹ 77 more rows
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
3.3 Ways to Check Data
These functions are commonly used for inspecting datasets.
View()
opens a new page to display the datasets.glimpse()
,head()
,tail()
,dim()
, andslice()
provide different ways to explore data.
You can also check the dimensions of your datasets using ncol()
and nrow()
. Both can be displayed at once by using dim()
.