rm(list = ls()) # clean-up workspace
library("tidyverse")

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.5     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

sessionInfo()

## R version 4.1.1 (2021-08-10)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Big Sur 11.5.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] forcats_0.5.1   stringr_1.4.0   dplyr_1.0.7     purrr_0.3.4    
## [5] readr_2.0.1     tidyr_1.1.4     tibble_3.1.5    ggplot2_3.3.5  
## [9] tidyverse_1.3.1
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.1  xfun_0.25         bslib_0.2.5.1     haven_2.4.3      
##  [5] colorspace_2.0-2  vctrs_0.3.8       generics_0.1.0    htmltools_0.5.1.1
##  [9] yaml_2.2.1        utf8_1.2.2        rlang_0.4.11      jquerylib_0.1.4  
## [13] pillar_1.6.3      glue_1.4.2        withr_2.4.2       DBI_1.1.1        
## [17] dbplyr_2.1.1      modelr_0.1.8      readxl_1.3.1      lifecycle_1.0.1  
## [21] munsell_0.5.0     gtable_0.3.0      cellranger_1.1.0  rvest_1.0.1      
## [25] evaluate_0.14     knitr_1.33        tzdb_0.1.2        fansi_0.5.0      
## [29] broom_0.7.9       Rcpp_1.0.7        backports_1.2.1   scales_1.1.1     
## [33] jsonlite_1.7.2    fs_1.5.0          hms_1.1.0         digest_0.6.28    
## [37] stringi_1.7.3     grid_4.1.1        cli_3.0.1         tools_4.1.1      
## [41] magrittr_2.0.1    sass_0.4.0        crayon_1.4.1      pkgconfig_2.0.3  
## [45] ellipsis_0.3.2    xml2_1.3.2        reprex_2.0.1      lubridate_1.7.10 
## [49] rstudioapi_0.13   assertthat_0.2.1  rmarkdown_2.10    httr_1.4.2       
## [53] R6_2.5.1          compiler_4.1.1

Announcement

Mid-term evaluation (voluntary, anonymous, ~ 10 min)
Lab 3 solutions posted
HW1 deadline extended to this Wednesday 10/06
HW2 will be posted on Friday

Acknowledgement

Dr. Hua Zhou’s slides

A typical data science project:

Data visualization

“The simple graph has brought more information to the data analyst’s mind than any other device.”

John Tukey

`mpg` data

mpg data is available from the ggplot2 package:

mpg %>% print(width = Inf)

## # A tibble: 234 × 11
##    manufacturer model      displ  year   cyl trans      drv     cty   hwy fl   
##    <chr>        <chr>      <dbl> <int> <int> <chr>      <chr> <int> <int> <chr>
##  1 audi         a4           1.8  1999     4 auto(l5)   f        18    29 p    
##  2 audi         a4           1.8  1999     4 manual(m5) f        21    29 p    
##  3 audi         a4           2    2008     4 manual(m6) f        20    31 p    
##  4 audi         a4           2    2008     4 auto(av)   f        21    30 p    
##  5 audi         a4           2.8  1999     6 auto(l5)   f        16    26 p    
##  6 audi         a4           2.8  1999     6 manual(m5) f        18    26 p    
##  7 audi         a4           3.1  2008     6 auto(av)   f        18    27 p    
##  8 audi         a4 quattro   1.8  1999     4 manual(m5) 4        18    26 p    
##  9 audi         a4 quattro   1.8  1999     4 auto(l5)   4        16    25 p    
## 10 audi         a4 quattro   2    2008     4 manual(m6) 4        20    28 p    
##    class  
##    <chr>  
##  1 compact
##  2 compact
##  3 compact
##  4 compact
##  5 compact
##  6 compact
##  7 compact
##  8 compact
##  9 compact
## 10 compact
## # … with 224 more rows

Tibbles are a generalized form of data frames, which are extensively used in tidyverse.
displ: engine displacement, in litres.
hwy: highway fuel efficiency, in mile per gallen (mpg).

Aesthetic mappings | r4ds chapter 3.3

A graphing template

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

Scatter plot

hwy vs displ

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

An aesthetic maps data to a specifc feature of plot.
Check available aesthetics for a geometric object by ?geom_point.

Color of points

Color points according to class:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

Size of points

Assign different sizes to points according to class:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, size = class))

#> Warning: Using size for a discrete variable is not advised.

Transparency of points

Assign different transparency levels to points according to class:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, alpha = class))

## Warning: Using alpha for a discrete variable is not advised.

Shape of points

Assign different shapes to points according to class:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, shape = class))

Maximum of 6 shapes at a time. By default, additional groups will go unplotted.

Manual setting of an aesthetic

Set the color of all points to be blue:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

Facets | r4ds chapter 3.5

Facets

Facets divide a plot into subplots based on the values of one or more discrete variables.

A subplot for each car type:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

A subplot for each car type and drive:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ class)

Geometric objects | r4ds chapter 3.6

`geom_smooth()`: smooth line

How are these two plots similar?

hwy vs displ line:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy))

Different line types

Different line types according to drv:

ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))

Different line colors

Different line colors according to drv:

ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy, color = drv))

Points and lines

Lines overlaid over scatter plot:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(mapping = aes(x = displ, y = hwy))

Same as

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + geom_smooth()

Aesthetics for each geometric object

Different aesthetics in different layers:

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)

Bar plots | r4ds chapter 3.7

`diamonds` data

diamonds data:

diamonds

## # A tibble: 53,940 × 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
## # … with 53,930 more rows

Bar plot

geom_bar() creates bar chart:

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut))

Bar charts, like histograms, frequency polygons, smoothers, and boxplots, plot some computed variables instead of raw data.
Check available computed variables for a geometric object via help:
```
?geom_bar
```

Use stat_count() directly:

ggplot(data = diamonds) + 
  stat_count(mapping = aes(x = cut))

stat_count() has a default geom geom_bar().

Display frequency instead of counts:

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = stat(prop), group = 1))

Note the aesthetics mapping group=1 overwrites the default grouping (by cut) by considering all observations as a group. Without this we get

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = stat(prop)))

`geom_bar()` vs `geom_col()`

geom_bar() makes the height of the bar proportional to the number of cases in each group (or if the weight aesthetic is supplied, the sum of the weights).
```
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut))
```
The height of bar is the number of diamonds in each cut category.

geom_col() makes the heights of the bars to represent values in the data.

ggplot(data = diamonds) + 
  geom_col(mapping = aes(x = cut, y = carat))

The height of bar is total carat in each cut category.

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, weight = carat))

Positional adjustments | r4ds chapter 3.8

Color bar:

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, colour = cut))

Fill color:

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = cut))

Fill color according to another variable:

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity))

The stacking is performed automatically by the position adjustment specified by the position argument.
If you don’t want a stacked bar chart, you can use one of three other options:
- "identity"
- "dodge"
- "fill"
- "stack" (default)

position = "identity" will place each object exactly where it falls in the context of the graph.

ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) + 
  geom_bar(alpha = 1/5, position = "identity")

ggplot(data = diamonds, mapping = aes(x = cut, colour = clarity)) + 
  geom_bar(fill = NA, position = "identity")

setting alpha to a small value makes the bars slightly transparent
identity position adjustment is more useful (default) for 2d geoms

position = "fill" works like stacking, but makes each set of stacked bars the same height.
- easier to compare proportions across groups.
```
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")
```

position = "dodge" places overlapping objects directly beside one another.

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")

position_jitter() add random noise to X and Y position of each element to avoid overplotting:
```
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")
```

geom_jitter() is similar:

ggplot(data = mpg) + 
  geom_jitter(mapping = aes(x = displ, y = hwy))

Data visualization with ggplot2

MATH-7360 Data Analysis

Dr. Xiang Ji @ Tulane University

Oct 4, 2021

Announcement

Acknowledgement

Data visualization

`mpg` data

Aesthetic mappings | r4ds chapter 3.3

A graphing template

Scatter plot

Color of points

Size of points

Transparency of points

Shape of points

Manual setting of an aesthetic

Facets | r4ds chapter 3.5

Facets

Geometric objects | r4ds chapter 3.6

`geom_smooth()`: smooth line

Different line types

Different line colors

Points and lines

Aesthetics for each geometric object

Bar plots | r4ds chapter 3.7

`diamonds` data

Bar plot

`geom_bar()` vs `geom_col()`

Positional adjustments | r4ds chapter 3.8

Data visualization with ggplot2

MATH-7360 Data Analysis

Dr. Xiang Ji @ Tulane University

Oct 4, 2021

Announcement

Acknowledgement

Data visualization

mpg data

Aesthetic mappings | r4ds chapter 3.3

A graphing template

Scatter plot

Color of points

Size of points

Transparency of points

Shape of points

Manual setting of an aesthetic

Facets | r4ds chapter 3.5

Facets

Geometric objects | r4ds chapter 3.6

geom_smooth(): smooth line

Different line types

Different line colors

Points and lines

Aesthetics for each geometric object

Bar plots | r4ds chapter 3.7

diamonds data

Bar plot

geom_bar() vs geom_col()

Positional adjustments | r4ds chapter 3.8

`mpg` data

`geom_smooth()`: smooth line

`diamonds` data

`geom_bar()` vs `geom_col()`