Chapter 3 Data Import and Reading

Let’s import our data!

3.1 Import Your Data

Note: If this is your first time import data and your data is from an Excel file, remember to type install.packages("readxl") in your coding space first.

install.packages("readxl")

library(readxl)

There are, of course, tons of ways to import your data, but the following is what I use most often.

I click “Import Dataset” in the top-right corner of the Environment box. Then choose “Import”, select “From Excel” (or “From Text” if I have a .csv file), and use “Browse” in the top-right corner to pick the file I want to import.

You can either double-click the file or click it once and then click the “Open” at the bottom-right.

The middle part is the Data Preview section, where you can see your data. Don’t import the wrong file! (I’ve done that multiple times, especially when I’m lazy about giving files proper names 😅)

Next, in the bottom-right corner, you’ll see a section called Code Preview.

The code may look something like this (but not identical, because, hopefully, we’re using different file names).

library(readxl)
weather_math <- read_excel("~/Documents/R_tutorials/example files/weather_math.xlsx")
View(weather_math)

Now you’ve seen the Code Preview section, at the end of it, there’s a tiny icon that looks like a clipboard. What do you do? CLICK IT!

Great Job!!

Now you’ve copied the code necessary to import your file. Press Import in the bottom-right.

Go back to the coding section. Paste it (Ctrl/Command + v) into the coding section (the top-left section of the interface).

(Let’s hope RStudio never change their default interface or I’ll have to rewrite this part.😅)

Also, you can sometimes use head() to see the first few rows of your data, just to make sure everything was imported correctly.

head(weather_math)

## # A tibble: 6 × 3
##      id day   score
##   <dbl> <chr> <dbl>
## 1     1 sunny     9
## 2     2 sunny     8
## 3     3 sunny     7
## 4     4 sunny     8
## 5     5 sunny     8
## 6     6 sunny     8

Or you could just type in the data’s name.

weather_math

## # A tibble: 20 × 3
##       id day   score
##    <dbl> <chr> <dbl>
##  1     1 sunny     9
##  2     2 sunny     8
##  3     3 sunny     7
##  4     4 sunny     8
##  5     5 sunny     8
##  6     6 sunny     8
##  7     7 sunny     6
##  8     8 sunny     5
##  9     9 sunny     3
## 10    10 sunny     4
## 11     1 rainy     1
## 12     2 rainy     3
## 13     3 rainy     5
## 14     4 rainy     7
## 15     5 rainy     6
## 16     6 rainy     5
## 17     7 rainy     6
## 18     8 rainy     7
## 19     9 rainy     5
## 20    10 rainy     3

3.1.1 Weather Math Data Explanation

Let’s take a peek at this data, so you’ll have a better grasp of our future examples.

weather_math

## # A tibble: 20 × 3
##       id day   score
##    <dbl> <chr> <dbl>
##  1     1 sunny     9
##  2     2 sunny     8
##  3     3 sunny     7
##  4     4 sunny     8
##  5     5 sunny     8
##  6     6 sunny     8
##  7     7 sunny     6
##  8     8 sunny     5
##  9     9 sunny     3
## 10    10 sunny     4
## 11     1 rainy     1
## 12     2 rainy     3
## 13     3 rainy     5
## 14     4 rainy     7
## 15     5 rainy     6
## 16     6 rainy     5
## 17     7 rainy     6
## 18     8 rainy     7
## 19     9 rainy     5
## 20    10 rainy     3

This is a made-up data. Ten imaginary people participated in this imaginary research. They took a math test on both a rainy day and a sunny day, and their scores were recorded. The first column represents the participant number, the second column shows the weather (either sunny or rainy), and the third column contains their scores.

3.2 Dataset packages

3.2.1 babynames

There are various data packages available in R, and we can easily access them by installing the appropriate package. For example, we can install a package called babynames .

This package contains baby names in the US from the year 1880 to year 2017.

Let’s take a look.

install.packages("babynames")

library(babynames)
babynames

## # A tibble: 1,924,665 × 5
##     year sex   name          n   prop
##    <dbl> <chr> <chr>     <int>  <dbl>
##  1  1880 F     Mary       7065 0.0724
##  2  1880 F     Anna       2604 0.0267
##  3  1880 F     Emma       2003 0.0205
##  4  1880 F     Elizabeth  1939 0.0199
##  5  1880 F     Minnie     1746 0.0179
##  6  1880 F     Margaret   1578 0.0162
##  7  1880 F     Ida        1472 0.0151
##  8  1880 F     Alice      1414 0.0145
##  9  1880 F     Bertha     1320 0.0135
## 10  1880 F     Sarah      1288 0.0132
## # ℹ 1,924,655 more rows

RStudio creates a separate page for the data when we use View(). We can see that there are 5 columns: year, sex, name, n, and prop.

Wait, what is prop?

When we encounter questions like this, we can type ?dataset or ?function on the Console. In this case, we typed ?babynames.

After hitting enter, the Help screen on the right side will display some information. Upon reading the explanation of this dataset, we learn that prop refers to n divided by the total number of babies of that sex with that name born in that year.

3.2.2 starwars

There are also built-in datasets within the dplyr package. For example, there is a dataset called starwars.

We can import it with just a few lines of code, simply by calling the dplyr package where starwars dataset is stored.

install.packages("dplyr")

library(dplyr)

starwars

## # A tibble: 87 × 14
##    name       height  mass hair_color skin_color eye_color birth_year sex   gender
##    <chr>       <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
##  1 Luke Skyw…    172    77 blond      fair       blue            19   male  mascu…
##  2 C-3PO         167    75 <NA>       gold       yellow         112   none  mascu…
##  3 R2-D2          96    32 <NA>       white, bl… red             33   none  mascu…
##  4 Darth Vad…    202   136 none       white      yellow          41.9 male  mascu…
##  5 Leia Orga…    150    49 brown      light      brown           19   fema… femin…
##  6 Owen Lars     178   120 brown, gr… light      blue            52   male  mascu…
##  7 Beru Whit…    165    75 brown      light      blue            47   fema… femin…
##  8 R5-D4          97    32 <NA>       white, red red             NA   none  mascu…
##  9 Biggs Dar…    183    84 black      light      brown           24   male  mascu…
## 10 Obi-Wan K…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
## # ℹ 77 more rows
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

3.3 Ways to Check Data

These functions are commonly used for inspecting datasets.

View() opens a new page to display the datasets.
glimpse(), head(), tail(), dim(), and slice() provide different ways to explore data.

You can also check the dimensions of your datasets using ncol() and nrow(). Both can be displayed at once by using dim().