Import and Tidy Data

rm(list = ls()) # clean-up workspace
sessionInfo()

## R version 4.1.1 (2021-08-10)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Big Sur 11.5.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.27     R6_2.5.0          jsonlite_1.7.2    magrittr_2.0.1   
##  [5] evaluate_0.14     rlang_0.4.11      stringi_1.7.3     jquerylib_0.1.4  
##  [9] bslib_0.2.5.1     rmarkdown_2.10    tools_4.1.1       stringr_1.4.0    
## [13] xfun_0.25         yaml_2.2.1        compiler_4.1.1    htmltools_0.5.1.1
## [17] knitr_1.33        sass_0.4.0

Announcement

Poll on HW1 extension
Profiling example with pre-allocation: you do not know unless you profile.
- What I forgot last time: the size of data is different in the last two cases.
- The example with pre-allocated memory ran too fast that I had to increase the size by 10 fold to accommodate for the timer.

read.csv(), read.table(), read.delim() functions import data from files.

(oringp <- read.table("https://raw.githubusercontent.com/tulane-math-7360-2021/tulane-math-7360-2021.github.io/main/HW/HW1/oringp.dat"))

##      V1       V2 V3 V4 V5
## 1     1  4/12/81  6  0 66
## 2     2 11/12/81  6  1 70
## 3     3  3/22/82  6  0 69
## 4     5 11/11/82  6  0 68
## 5     6  4/04/83  6  0 67
## 6     7  6/18/83  6  0 72
## 7     8  8/30/83  6  0 73
## 8     9 11/28/83  6  0 70
## 9  41-B  2/03/84  6  1 57
## 10 41-C  4/06/84  6  1 63
## 11 41-D  8/30/84  6  1 70
## 12 41-G 10/05/84  6  0 78
## 13 51-A 11/08/84  6  0 67
## 14 51-C  1/24/85  6  3 53
## 15 51-D  4/12/85  6  0 67
## 16 51-B  4/29/85  6  0 75
## 17 51-G  6/17/85  6  0 70
## 18 51-F  7/29/85  6  0 81
## 19 51-I  8/27/85  6  0 76
## 20 51-J 10/03/85  6  0 79
## 21 61-A 10/30/85  6  2 75
## 22 61-B 11/26/85  6  0 76
## 23 61-C  1/12/86  6  1 58
## 24 61-I  1/28/86  6 NA 31

(oringp <- read.table("../../HW/HW1/oringp.dat"))

##      V1       V2 V3 V4 V5
## 1     1  4/12/81  6  0 66
## 2     2 11/12/81  6  1 70
## 3     3  3/22/82  6  0 69
## 4     5 11/11/82  6  0 68
## 5     6  4/04/83  6  0 67
## 6     7  6/18/83  6  0 72
## 7     8  8/30/83  6  0 73
## 8     9 11/28/83  6  0 70
## 9  41-B  2/03/84  6  1 57
## 10 41-C  4/06/84  6  1 63
## 11 41-D  8/30/84  6  1 70
## 12 41-G 10/05/84  6  0 78
## 13 51-A 11/08/84  6  0 67
## 14 51-C  1/24/85  6  3 53
## 15 51-D  4/12/85  6  0 67
## 16 51-B  4/29/85  6  0 75
## 17 51-G  6/17/85  6  0 70
## 18 51-F  7/29/85  6  0 81
## 19 51-I  8/27/85  6  0 76
## 20 51-J 10/03/85  6  0 79
## 21 61-A 10/30/85  6  2 75
## 22 61-B 11/26/85  6  0 76
## 23 61-C  1/12/86  6  1 58
## 24 61-I  1/28/86  6 NA 31

One-page project description due today.
Facilities email: please move furniture back after class.

Acknowledgement

Dr. Hua Zhou’s slides

Source

We follow R for Data Science by Garrett Grolemund and Hadley Wickham for the next couple of lectures.

Workflow

A typical data science project:

Data wrangling.
- The art of getting your data into R in a useful form for visualisation and modelling.
- Data import
- Data transformation.
  - select important variables
  - filter out key observations
  - create new variables
  - compute summaries
Visualisation.
- Make elegant and informative plots that help you understand data.

Data wrangling

There are three main parts to data wrangling:

We will proceed with:

tibbles | r4ds chapter 10
- A data frame variant
- how to construct them
data import | r4ds chapter 11
- how to get data from disk into R
tidy data | r4ds chapter 12
- a consistent way of storing your data that makes everything easier (transformation, visualisation, and modelling)

Tidyverse

tidyverse is a collection of R packages that make data wrangling easy.
Install tidyverse from RStudio menu Tools -> Install Packages... or
```
install.packages("tidyverse")
```

After installation, load tidyverse by

library("tidyverse")

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.3     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

environment(filter)

## <environment: namespace:dplyr>

environment(stats::filter)

## <environment: namespace:stats>

Tibbles

Tibbles are one of the unifying features of the tidyverse.

create tibbles

coerce a data frame to a tibble as_tibble()

Most other R packages use regular data frames.
iris is a data frame available in base R:

# a regular data frame
iris

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1            5.1         3.5          1.4         0.2     setosa
## 2            4.9         3.0          1.4         0.2     setosa
## 3            4.7         3.2          1.3         0.2     setosa
## 4            4.6         3.1          1.5         0.2     setosa
## 5            5.0         3.6          1.4         0.2     setosa
## 6            5.4         3.9          1.7         0.4     setosa
## 7            4.6         3.4          1.4         0.3     setosa
## 8            5.0         3.4          1.5         0.2     setosa
## 9            4.4         2.9          1.4         0.2     setosa
## 10           4.9         3.1          1.5         0.1     setosa
## 11           5.4         3.7          1.5         0.2     setosa
## 12           4.8         3.4          1.6         0.2     setosa
## 13           4.8         3.0          1.4         0.1     setosa
## 14           4.3         3.0          1.1         0.1     setosa
## 15           5.8         4.0          1.2         0.2     setosa
## 16           5.7         4.4          1.5         0.4     setosa
## 17           5.4         3.9          1.3         0.4     setosa
## 18           5.1         3.5          1.4         0.3     setosa
## 19           5.7         3.8          1.7         0.3     setosa
## 20           5.1         3.8          1.5         0.3     setosa
## 21           5.4         3.4          1.7         0.2     setosa
## 22           5.1         3.7          1.5         0.4     setosa
## 23           4.6         3.6          1.0         0.2     setosa
## 24           5.1         3.3          1.7         0.5     setosa
## 25           4.8         3.4          1.9         0.2     setosa
## 26           5.0         3.0          1.6         0.2     setosa
## 27           5.0         3.4          1.6         0.4     setosa
## 28           5.2         3.5          1.5         0.2     setosa
## 29           5.2         3.4          1.4         0.2     setosa
## 30           4.7         3.2          1.6         0.2     setosa
## 31           4.8         3.1          1.6         0.2     setosa
## 32           5.4         3.4          1.5         0.4     setosa
## 33           5.2         4.1          1.5         0.1     setosa
## 34           5.5         4.2          1.4         0.2     setosa
## 35           4.9         3.1          1.5         0.2     setosa
## 36           5.0         3.2          1.2         0.2     setosa
## 37           5.5         3.5          1.3         0.2     setosa
## 38           4.9         3.6          1.4         0.1     setosa
## 39           4.4         3.0          1.3         0.2     setosa
## 40           5.1         3.4          1.5         0.2     setosa
## 41           5.0         3.5          1.3         0.3     setosa
## 42           4.5         2.3          1.3         0.3     setosa
## 43           4.4         3.2          1.3         0.2     setosa
## 44           5.0         3.5          1.6         0.6     setosa
## 45           5.1         3.8          1.9         0.4     setosa
## 46           4.8         3.0          1.4         0.3     setosa
## 47           5.1         3.8          1.6         0.2     setosa
## 48           4.6         3.2          1.4         0.2     setosa
## 49           5.3         3.7          1.5         0.2     setosa
## 50           5.0         3.3          1.4         0.2     setosa
## 51           7.0         3.2          4.7         1.4 versicolor
## 52           6.4         3.2          4.5         1.5 versicolor
## 53           6.9         3.1          4.9         1.5 versicolor
## 54           5.5         2.3          4.0         1.3 versicolor
## 55           6.5         2.8          4.6         1.5 versicolor
## 56           5.7         2.8          4.5         1.3 versicolor
## 57           6.3         3.3          4.7         1.6 versicolor
## 58           4.9         2.4          3.3         1.0 versicolor
## 59           6.6         2.9          4.6         1.3 versicolor
## 60           5.2         2.7          3.9         1.4 versicolor
## 61           5.0         2.0          3.5         1.0 versicolor
## 62           5.9         3.0          4.2         1.5 versicolor
## 63           6.0         2.2          4.0         1.0 versicolor
## 64           6.1         2.9          4.7         1.4 versicolor
## 65           5.6         2.9          3.6         1.3 versicolor
## 66           6.7         3.1          4.4         1.4 versicolor
## 67           5.6         3.0          4.5         1.5 versicolor
## 68           5.8         2.7          4.1         1.0 versicolor
## 69           6.2         2.2          4.5         1.5 versicolor
## 70           5.6         2.5          3.9         1.1 versicolor
## 71           5.9         3.2          4.8         1.8 versicolor
## 72           6.1         2.8          4.0         1.3 versicolor
## 73           6.3         2.5          4.9         1.5 versicolor
## 74           6.1         2.8          4.7         1.2 versicolor
## 75           6.4         2.9          4.3         1.3 versicolor
## 76           6.6         3.0          4.4         1.4 versicolor
## 77           6.8         2.8          4.8         1.4 versicolor
## 78           6.7         3.0          5.0         1.7 versicolor
## 79           6.0         2.9          4.5         1.5 versicolor
## 80           5.7         2.6          3.5         1.0 versicolor
## 81           5.5         2.4          3.8         1.1 versicolor
## 82           5.5         2.4          3.7         1.0 versicolor
## 83           5.8         2.7          3.9         1.2 versicolor
## 84           6.0         2.7          5.1         1.6 versicolor
## 85           5.4         3.0          4.5         1.5 versicolor
## 86           6.0         3.4          4.5         1.6 versicolor
## 87           6.7         3.1          4.7         1.5 versicolor
## 88           6.3         2.3          4.4         1.3 versicolor
## 89           5.6         3.0          4.1         1.3 versicolor
## 90           5.5         2.5          4.0         1.3 versicolor
## 91           5.5         2.6          4.4         1.2 versicolor
## 92           6.1         3.0          4.6         1.4 versicolor
## 93           5.8         2.6          4.0         1.2 versicolor
## 94           5.0         2.3          3.3         1.0 versicolor
## 95           5.6         2.7          4.2         1.3 versicolor
## 96           5.7         3.0          4.2         1.2 versicolor
## 97           5.7         2.9          4.2         1.3 versicolor
## 98           6.2         2.9          4.3         1.3 versicolor
## 99           5.1         2.5          3.0         1.1 versicolor
## 100          5.7         2.8          4.1         1.3 versicolor
## 101          6.3         3.3          6.0         2.5  virginica
## 102          5.8         2.7          5.1         1.9  virginica
## 103          7.1         3.0          5.9         2.1  virginica
## 104          6.3         2.9          5.6         1.8  virginica
## 105          6.5         3.0          5.8         2.2  virginica
## 106          7.6         3.0          6.6         2.1  virginica
## 107          4.9         2.5          4.5         1.7  virginica
## 108          7.3         2.9          6.3         1.8  virginica
## 109          6.7         2.5          5.8         1.8  virginica
## 110          7.2         3.6          6.1         2.5  virginica
## 111          6.5         3.2          5.1         2.0  virginica
## 112          6.4         2.7          5.3         1.9  virginica
## 113          6.8         3.0          5.5         2.1  virginica
## 114          5.7         2.5          5.0         2.0  virginica
## 115          5.8         2.8          5.1         2.4  virginica
## 116          6.4         3.2          5.3         2.3  virginica
## 117          6.5         3.0          5.5         1.8  virginica
## 118          7.7         3.8          6.7         2.2  virginica
## 119          7.7         2.6          6.9         2.3  virginica
## 120          6.0         2.2          5.0         1.5  virginica
## 121          6.9         3.2          5.7         2.3  virginica
## 122          5.6         2.8          4.9         2.0  virginica
## 123          7.7         2.8          6.7         2.0  virginica
## 124          6.3         2.7          4.9         1.8  virginica
## 125          6.7         3.3          5.7         2.1  virginica
## 126          7.2         3.2          6.0         1.8  virginica
## 127          6.2         2.8          4.8         1.8  virginica
## 128          6.1         3.0          4.9         1.8  virginica
## 129          6.4         2.8          5.6         2.1  virginica
## 130          7.2         3.0          5.8         1.6  virginica
## 131          7.4         2.8          6.1         1.9  virginica
## 132          7.9         3.8          6.4         2.0  virginica
## 133          6.4         2.8          5.6         2.2  virginica
## 134          6.3         2.8          5.1         1.5  virginica
## 135          6.1         2.6          5.6         1.4  virginica
## 136          7.7         3.0          6.1         2.3  virginica
## 137          6.3         3.4          5.6         2.4  virginica
## 138          6.4         3.1          5.5         1.8  virginica
## 139          6.0         3.0          4.8         1.8  virginica
## 140          6.9         3.1          5.4         2.1  virginica
## 141          6.7         3.1          5.6         2.4  virginica
## 142          6.9         3.1          5.1         2.3  virginica
## 143          5.8         2.7          5.1         1.9  virginica
## 144          6.8         3.2          5.9         2.3  virginica
## 145          6.7         3.3          5.7         2.5  virginica
## 146          6.7         3.0          5.2         2.3  virginica
## 147          6.3         2.5          5.0         1.9  virginica
## 148          6.5         3.0          5.2         2.0  virginica
## 149          6.2         3.4          5.4         2.3  virginica
## 150          5.9         3.0          5.1         1.8  virginica

Convert a regular data frame to tibble:

as_tibble(iris)

## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # … with 140 more rows

Convert a tibble to data frame:

as.data.frame(tb)

Create tibble from individual vectors.

tibble(
  x = 1:5, 
  y = 1, 
  z = x ^ 2 + y
)

## # A tibble: 5 × 3
##       x     y     z
##   <int> <dbl> <dbl>
## 1     1     1     2
## 2     2     1     5
## 3     3     1    10
## 4     4     1    17
## 5     5     1    26

Note values for y are recycled

We know that scalars are just length-1 vectors, how does R perform operations between vectors of different length?
- If the longer object length is multiple of the shorter object length, the shorter object is recycled
- If the longer object length is not multiple of the shorter object length, R outputs a warning

long.vec <- 1:10
short.vec.1 <- 1:2
short.vec.2 <- 1:3
(long.vec * short.vec.1)

##  [1]  1  4  3  8  5 12  7 16  9 20

(long.vec + short.vec.1)

##  [1]  2  4  4  6  6  8  8 10 10 12

(long.vec * short.vec.2)

## Warning in long.vec * short.vec.2: longer object length is not a multiple of
## shorter object length

##  [1]  1  4  9  4 10 18  7 16 27 10

(long.vec + short.vec.2)

## Warning in long.vec + short.vec.2: longer object length is not a multiple of
## shorter object length

##  [1]  2  4  6  5  7  9  8 10 12 11

Only length one vectors are recycled

tibble(
  x = 1:5, 
  y = 1:2, 
  z = x ^ 2 + y
)

## Error: Tibble columns must have compatible sizes.
## * Size 5: Existing data.
## * Size 2: Column `y`.
## ℹ Only values of size one are recycled.

tibble() does less than data.frame():
1. never changes the type of the inputs (e.g. it never converts strings to factors)
2. never changes the names of variables
3. never creates row names

Transposed tibbles:

tribble(
  ~x, ~y, ~z,
  #--|--|----
  "a", 2, 3.6,
  "b", 1, 8.5
)

## # A tibble: 2 × 3
##   x         y     z
##   <chr> <dbl> <dbl>
## 1 a         2   3.6
## 2 b         1   8.5

Printing of a tibble

By default, tibble prints the first 10 rows and all columns that fit on screen.

you don’t accidentally overwhelm your console when printing large data frames

nycflights13::flights

## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # … with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

To change number of rows and columns to display:

nycflights13::flights %>% 
  print(n = 10, width = Inf)

## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
##    arr_delay carrier flight tailnum origin dest  air_time distance  hour minute
##        <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl>
##  1        11 UA        1545 N14228  EWR    IAH        227     1400     5     15
##  2        20 UA        1714 N24211  LGA    IAH        227     1416     5     29
##  3        33 AA        1141 N619AA  JFK    MIA        160     1089     5     40
##  4       -18 B6         725 N804JB  JFK    BQN        183     1576     5     45
##  5       -25 DL         461 N668DN  LGA    ATL        116      762     6      0
##  6        12 UA        1696 N39463  EWR    ORD        150      719     5     58
##  7        19 B6         507 N516JB  EWR    FLL        158     1065     6      0
##  8       -14 EV        5708 N829AS  LGA    IAD         53      229     6      0
##  9        -8 B6          79 N593JB  JFK    MCO        140      944     6      0
## 10         8 AA         301 N3ALAA  LGA    ORD        138      733     6      0
##    time_hour          
##    <dttm>             
##  1 2013-01-01 05:00:00
##  2 2013-01-01 05:00:00
##  3 2013-01-01 05:00:00
##  4 2013-01-01 05:00:00
##  5 2013-01-01 06:00:00
##  6 2013-01-01 05:00:00
##  7 2013-01-01 06:00:00
##  8 2013-01-01 06:00:00
##  9 2013-01-01 06:00:00
## 10 2013-01-01 06:00:00
## # … with 336,766 more rows

Here we see the pipe operator %>% pipes the output from previous command to the (first) argument of the next command.

To change the default print setting:
- options(tibble.print_max = n, tibble.print_min = m): if more than n rows, print only m rows.
- options(tibble.print_min = Inf): print all rows.
- options(tibble.width = Inf): print all columns.

Subsetting

df <- tibble(
  x = runif(5),
  y = rnorm(5)
)
df

## # A tibble: 5 × 2
##       x       y
##   <dbl>   <dbl>
## 1 0.757 -0.476 
## 2 0.872 -0.360 
## 3 0.476 -2.10  
## 4 0.828  0.0421
## 5 0.753  0.978

Extract by name:

df$x

## [1] 0.7565059 0.8716250 0.4760646 0.8283121 0.7528014

df[["x"]]

## [1] 0.7565059 0.8716250 0.4760646 0.8283121 0.7528014

Extract by position：

df[[1]]

## [1] 0.7565059 0.8716250 0.4760646 0.8283121 0.7528014

Pipe:

df %>% .$x

## [1] 0.7565059 0.8716250 0.4760646 0.8283121 0.7528014

df %>% .[["x"]]

## [1] 0.7565059 0.8716250 0.4760646 0.8283121 0.7528014

. is a special placeholder.

Data import | r4ds chapter 11

readr

readr package implements functions that turn flat files into tibbles.
- read_csv() (comma delimited files), read_csv2() (semicolon seperated files), read_tsv() (tab delimited files), read_delim() (any delimiter).
- read_fwf() (fixed width files), read_table().
- read_log() (Apache style log files).

An example file heights.csv:

head heights.csv

## "earn","height","sex","ed","age","race"
## 50000,74.4244387818035,"male",16,45,"white"
## 60000,65.5375428255647,"female",16,58,"white"
## 30000,63.6291977374349,"female",16,29,"white"
## 50000,63.1085616752971,"female",16,91,"other"
## 51000,63.4024835710879,"female",17,39,"white"
## 9000,64.3995075440034,"female",15,26,"white"
## 29000,61.6563258264214,"female",12,49,"white"
## 32000,72.6985437364783,"male",17,46,"white"
## 2000,72.0394668497611,"male",15,21,"hispanic"

Read from a local file heights.csv:

(heights <- read_csv("heights.csv"))

## Rows: 1192 Columns: 6

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): sex, race
## dbl (4): earn, height, ed, age

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 1,192 × 6
##     earn height sex       ed   age race    
##    <dbl>  <dbl> <chr>  <dbl> <dbl> <chr>   
##  1 50000   74.4 male      16    45 white   
##  2 60000   65.5 female    16    58 white   
##  3 30000   63.6 female    16    29 white   
##  4 50000   63.1 female    16    91 other   
##  5 51000   63.4 female    17    39 white   
##  6  9000   64.4 female    15    26 white   
##  7 29000   61.7 female    12    49 white   
##  8 32000   72.7 male      17    46 white   
##  9  2000   72.0 male      15    21 hispanic
## 10 27000   72.2 male      12    26 white   
## # … with 1,182 more rows

I’m curious about relation between earn and height and sex

ggplot(data = heights) + 
  geom_point(mapping = aes(x = height, y = earn, color = sex))

Read from inline csv file:

read_csv("a,b,c
  1,2,3
  4,5,6")

## Rows: 2 Columns: 3

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): a, b, c

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 2 × 3
##       a     b     c
##   <dbl> <dbl> <dbl>
## 1     1     2     3
## 2     4     5     6

Skip first n lines:

read_csv("The first line of metadata
  The second line of metadata
  x,y,z
  1,2,3", skip = 2)

## Rows: 1 Columns: 3

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): x, y, z

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 1 × 3
##       x     y     z
##   <dbl> <dbl> <dbl>
## 1     1     2     3

Skip comment lines:

read_csv("# A comment I want to skip
  x,y,z
  1,2,3", comment = "#")

## Rows: 1 Columns: 3

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): x, y, z

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 1 × 3
##       x     y     z
##   <dbl> <dbl> <dbl>
## 1     1     2     3

No header line:

read_csv("1,2,3\n4,5,6", col_names = FALSE)

## Rows: 2 Columns: 3

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): X1, X2, X3

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 2 × 3
##      X1    X2    X3
##   <dbl> <dbl> <dbl>
## 1     1     2     3
## 2     4     5     6

No header line and specify colnames:

read_csv("1,2,3\n4,5,6", col_names = c("x", "y", "z"))

## Rows: 2 Columns: 3

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (3): x, y, z

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 2 × 3
##       x     y     z
##   <dbl> <dbl> <dbl>
## 1     1     2     3
## 2     4     5     6

Specify the symbol representing missing values:

read_csv("a,b,c\n1,2,.", na = ".")

## Rows: 1 Columns: 3

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): a, b
## lgl (1): c

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 1 × 3
##       a     b c    
##   <dbl> <dbl> <lgl>
## 1     1     2 NA

Writing to a file

Write to csv:
```
write_csv(challenge, "challenge.csv")
```

Write (and read) RDS files:

write_rds(challenge, "challenge.rds")
read_rds("challenge.rds")

Excel files

readxl package (part of tidyverse) reads both xls and xlsx files:

library(readxl)
# xls file
read_excel("datasets.xls")

## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <chr>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # … with 140 more rows

# xls file
read_excel("datasets.xlsx")

## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <chr>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # … with 140 more rows

List the sheet name:

excel_sheets("datasets.xlsx")

## [1] "iris"     "mtcars"   "chickwts" "quakes"

Read in a specific sheet by name or number:

read_excel("datasets.xlsx", sheet = "mtcars")

## # A tibble: 32 × 11
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
##  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
##  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
##  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
##  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
##  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
## # … with 22 more rows

read_excel("datasets.xlsx", sheet = 4)

## # A tibble: 1,000 × 5
##      lat  long depth   mag stations
##    <dbl> <dbl> <dbl> <dbl>    <dbl>
##  1 -20.4  182.   562   4.8       41
##  2 -20.6  181.   650   4.2       15
##  3 -26    184.    42   5.4       43
##  4 -18.0  182.   626   4.1       19
##  5 -20.4  182.   649   4         11
##  6 -19.7  184.   195   4         12
##  7 -11.7  166.    82   4.8       43
##  8 -28.1  182.   194   4.4       15
##  9 -28.7  182.   211   4.7       35
## 10 -17.5  180.   622   4.3       19
## # … with 990 more rows

Control subset of cells to read:

# first 3 rows
read_excel("datasets.xlsx", n_max = 3)

## # A tibble: 3 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa

Excel type range

read_excel("datasets.xlsx", range = "C1:E4")

## # A tibble: 3 × 3
##   Petal.Length Petal.Width Species
##          <dbl>       <dbl> <chr>  
## 1          1.4         0.2 setosa 
## 2          1.4         0.2 setosa 
## 3          1.3         0.2 setosa

# first 4 rows
read_excel("datasets.xlsx", range = cell_rows(1:4))

## # A tibble: 3 × 5
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
## 1          5.1         3.5          1.4         0.2 setosa 
## 2          4.9         3            1.4         0.2 setosa 
## 3          4.7         3.2          1.3         0.2 setosa

# columns B-D
read_excel("datasets.xlsx", range = cell_cols("B:D"))

## # A tibble: 150 × 3
##    Sepal.Width Petal.Length Petal.Width
##          <dbl>        <dbl>       <dbl>
##  1         3.5          1.4         0.2
##  2         3            1.4         0.2
##  3         3.2          1.3         0.2
##  4         3.1          1.5         0.2
##  5         3.6          1.4         0.2
##  6         3.9          1.7         0.4
##  7         3.4          1.4         0.3
##  8         3.4          1.5         0.2
##  9         2.9          1.4         0.2
## 10         3.1          1.5         0.1
## # … with 140 more rows

# sheet
read_excel("datasets.xlsx", range = "mtcars!B1:D5")

## # A tibble: 4 × 3
##     cyl  disp    hp
##   <dbl> <dbl> <dbl>
## 1     6   160   110
## 2     6   160   110
## 3     4   108    93
## 4     6   258   110

Specify NAs:

read_excel("datasets.xlsx", na = "setosa")

## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <chr>  
##  1          5.1         3.5          1.4         0.2 <NA>   
##  2          4.9         3            1.4         0.2 <NA>   
##  3          4.7         3.2          1.3         0.2 <NA>   
##  4          4.6         3.1          1.5         0.2 <NA>   
##  5          5           3.6          1.4         0.2 <NA>   
##  6          5.4         3.9          1.7         0.4 <NA>   
##  7          4.6         3.4          1.4         0.3 <NA>   
##  8          5           3.4          1.5         0.2 <NA>   
##  9          4.4         2.9          1.4         0.2 <NA>   
## 10          4.9         3.1          1.5         0.1 <NA>   
## # … with 140 more rows

Writing Excel files: openxlsx and writexl packages.

Other types of data

haven reads SPSS, Stata, and SAS files.
DBI, along with a database specific backend (e.g. RMySQL, RSQLite, RPostgreSQL etc) allows you to run SQL queries against a database and return a data frame.
jsonlite reads json files.
xml2 reads XML files.
tidyxl reads non-tabular data from Excel.

Tidy data | r4ds chapter 12

“Happy families are all alike; every unhappy family is unhappy in its own way.” –– Leo Tolstoy

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham

Tidy data

There are three interrelated rules which make a dataset tidy:

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

Example table1

table1

## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

is tidy.

Example table2

table2

## # A tibble: 12 × 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

is not tidy.

Example table3

table3

## # A tibble: 6 × 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

is not tidy.

Example table4a

table4a

## # A tibble: 3 × 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

is not tidy.

Example table4b

table4b

## # A tibble: 3 × 3
##   country         `1999`     `2000`
## * <chr>            <int>      <int>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583

is not tidy.

Pivoting

Typical issues:

One variable might be spread across multiple columns
One variable might be scattered across multiple rows

Longer

pivot_longer() columns into a new pair of variables.

table4a %>%
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")

## # A tibble: 6 × 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Afghanistan 2000    2666
## 3 Brazil      1999   37737
## 4 Brazil      2000   80488
## 5 China       1999  212258
## 6 China       2000  213766

We can pivot table4b longer too and then join them

tidy4a <- table4a %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
tidy4b <- table4b %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")
left_join(tidy4a, tidy4b)

## Joining, by = c("country", "year")

## # A tibble: 6 × 4
##   country     year   cases population
##   <chr>       <chr>  <int>      <int>
## 1 Afghanistan 1999     745   19987071
## 2 Afghanistan 2000    2666   20595360
## 3 Brazil      1999   37737  172006362
## 4 Brazil      2000   80488  174504898
## 5 China       1999  212258 1272915272
## 6 China       2000  213766 1280428583

Wider

pivot_wider() is the opposite of pivot_longer().

table2

## # A tibble: 12 × 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

table2 %>%
  pivot_wider(names_from = type, values_from = count)

## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Separating

table3 %>% 
  separate(rate, into = c("cases", "population"))

## # A tibble: 6 × 4
##   country      year cases  population
##   <chr>       <int> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362 
## 4 Brazil       2000 80488  174504898 
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Seperate into numeric values:

table3 %>% 
  separate(rate, into = c("cases", "population"), convert = TRUE)

## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Separate at a fixed position:

table3 %>% 
  separate(year, into = c("century", "year"), sep = 2)

## # A tibble: 6 × 4
##   country     century year  rate             
##   <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583

Unite

table5

## # A tibble: 6 × 4
##   country     century year  rate             
## * <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583

unite() is the inverse of separate().

table5 %>% 
  unite(new, century, year, sep = "")

## # A tibble: 6 × 3
##   country     new   rate             
##   <chr>       <chr> <chr>            
## 1 Afghanistan 1999  745/19987071     
## 2 Afghanistan 2000  2666/20595360    
## 3 Brazil      1999  37737/172006362  
## 4 Brazil      2000  80488/174504898  
## 5 China       1999  212258/1272915272
## 6 China       2000  213766/1280428583

Import and Tidy Data

MATH-7360 Data Analysis

Dr. Xiang Ji @ Tulane University

Sep 29, 2021

Announcement

Acknowledgement

Source

Workflow

Data wrangling

Tidyverse

Tibbles

create tibbles

Printing of a tibble

Subsetting

Data import | r4ds chapter 11

readr

Writing to a file

Excel files

Other types of data

Tidy data | r4ds chapter 12

Tidy data

Pivoting

Longer

Wider

Separating

Unite