rm(list = ls()) # clean-up workspace
library("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.5 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Big Sur 11.5.2
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4
## [5] readr_2.0.1 tidyr_1.1.4 tibble_3.1.5 ggplot2_3.3.5
## [9] tidyverse_1.3.1
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.1 xfun_0.25 bslib_0.2.5.1 haven_2.4.3
## [5] colorspace_2.0-2 vctrs_0.3.8 generics_0.1.0 htmltools_0.5.1.1
## [9] yaml_2.2.1 utf8_1.2.2 rlang_0.4.11 jquerylib_0.1.4
## [13] pillar_1.6.3 glue_1.4.2 withr_2.4.2 DBI_1.1.1
## [17] dbplyr_2.1.1 modelr_0.1.8 readxl_1.3.1 lifecycle_1.0.1
## [21] munsell_0.5.0 gtable_0.3.0 cellranger_1.1.0 rvest_1.0.1
## [25] evaluate_0.14 knitr_1.33 tzdb_0.1.2 fansi_0.5.0
## [29] broom_0.7.9 Rcpp_1.0.7 backports_1.2.1 scales_1.1.1
## [33] jsonlite_1.7.2 fs_1.5.0 hms_1.1.0 digest_0.6.28
## [37] stringi_1.7.3 grid_4.1.1 cli_3.0.1 tools_4.1.1
## [41] magrittr_2.0.1 sass_0.4.0 crayon_1.4.1 pkgconfig_2.0.3
## [45] ellipsis_0.3.2 xml2_1.3.2 reprex_2.0.1 lubridate_1.7.10
## [49] rstudioapi_0.13 assertthat_0.2.1 rmarkdown_2.10 httr_1.4.2
## [53] R6_2.5.1 compiler_4.1.1
Mid-term evaluation (voluntary, anonymous, ~ 10 min)
Lab 3 solutions posted
HW1 deadline extended to this Wednesday 10/06
HW2 will be posted on Friday
“The simple graph has brought more information to the data analyst’s mind than any other device.”
John Tukey
mpg
datampg
data is available from the ggplot2
package:
mpg %>% print(width = Inf)
## # A tibble: 234 × 11
## manufacturer model displ year cyl trans drv cty hwy fl
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p
## 4 audi a4 2 2008 4 auto(av) f 21 30 p
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p
## 7 audi a4 3.1 2008 6 auto(av) f 18 27 p
## 8 audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p
## 9 audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p
## 10 audi a4 quattro 2 2008 4 manual(m6) 4 20 28 p
## class
## <chr>
## 1 compact
## 2 compact
## 3 compact
## 4 compact
## 5 compact
## 6 compact
## 7 compact
## 8 compact
## 9 compact
## 10 compact
## # … with 224 more rows
Tibbles are a generalized form of data frames, which are extensively used in tidyverse.
displ
: engine displacement, in litres.
hwy
: highway fuel efficiency, in mile per gallen (mpg).
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
hwy
vs displ
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
An aesthetic maps data to a specifc feature of plot.
Check available aesthetics for a geometric object by ?geom_point
.
Color points according to class
:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
Assign different sizes to points according to class
:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class))
#> Warning: Using size for a discrete variable is not advised.
Assign different transparency levels to points according to class
:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
## Warning: Using alpha for a discrete variable is not advised.
Assign different shapes to points according to class
:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
Maximum of 6 shapes at a time. By default, additional groups will go unplotted.
Set the color of all points to be blue:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
Facets divide a plot into subplots based on the values of one or more discrete variables.
A subplot for each car type:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
A subplot for each car type and drive:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ class)
geom_smooth()
: smooth lineHow are these two plots similar?
hwy
vs displ
line:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy))
Different line types according to drv
:
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
Different line colors according to drv
:
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, color = drv))
Lines overlaid over scatter plot:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))
Same as
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() + geom_smooth()
Different aesthetics in different layers:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)
diamonds
datadiamonds
data:
diamonds
## # A tibble: 53,940 × 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # … with 53,930 more rows
geom_bar()
creates bar chart:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
Bar charts, like histograms, frequency polygons, smoothers, and boxplots, plot some computed variables instead of raw data.
Check available computed variables for a geometric object via help:
?geom_bar
Use stat_count()
directly:
ggplot(data = diamonds) +
stat_count(mapping = aes(x = cut))
stat_count()
has a default geom geom_bar()
.
Display frequency instead of counts:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = stat(prop), group = 1))
Note the aesthetics mapping group=1
overwrites the default grouping (by cut
) by considering all observations as a group. Without this we get
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = stat(prop)))
geom_bar()
vs geom_col()
geom_bar()
makes the height of the bar proportional to the number of cases in each group (or if the weight aesthetic is supplied, the sum of the weights).
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))
The height of bar is the number of diamonds in each cut category.
geom_col()
makes the heights of the bars to represent values in the data.
ggplot(data = diamonds) +
geom_col(mapping = aes(x = cut, y = carat))
The height of bar is total carat in each cut category.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, weight = carat))
Color bar:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, colour = cut))
Fill color:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))
Fill color according to another variable:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity))
The stacking is performed automatically by the position adjustment specified by the position
argument.
If you don’t want a stacked bar chart, you can use one of three other options:
"identity"
"dodge"
"fill"
"stack"
(default)
position = "identity"
will place each object exactly where it falls in the context of the graph.
ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +
geom_bar(alpha = 1/5, position = "identity")
ggplot(data = diamonds, mapping = aes(x = cut, colour = clarity)) +
geom_bar(fill = NA, position = "identity")
setting alpha
to a small value makes the bars slightly transparent
identity
position adjustment is more useful (default) for 2d geoms
position = "fill"
works like stacking, but makes each set of stacked bars the same height.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")
position = "dodge"
places overlapping objects directly beside one another.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")
position_jitter()
add random noise to X and Y position of each element to avoid overplotting:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")
geom_jitter()
is similar:
ggplot(data = mpg) +
geom_jitter(mapping = aes(x = displ, y = hwy))