Oliver’s question from last lecture: how to supply arguments to functions inside across
Take a look at the documentation of across
here
Notice the ...
and the .fns
input formats
A purrr-style lambda (?), e.g. ~ mean(.x, na.rm = TRUE)
purrr
package documentation
Anonymous functions are verbose in R, so we provide two convenient shorthands.
For unary functions,~ .x + 1
is equivalent to function(.x) .x + 1
.
For chains of transformations functions, . %>% f() %>% g()
is equivalent to function(.) . %>% f() %>% g()
(this shortcut is provided by magrittr).
library(nycflights13)
test.case.1 <- flights %>%
transmute(across(1:4, list(log = log, log2 = log2)))
test.case.2 <- flights %>%
transmute(across(1:4, list(log = log, log.2.base = ~ log(.x, base = 2))))
test.case.3 <- flights %>%
transmute(across(1:4, list(log = log, log.2.base = function(.x) log(.x, base = 2))))
all(test.case.1 == test.case.2, na.rm = TRUE)
## [1] TRUE
all(test.case.1 == test.case.3, na.rm = TRUE)
## [1] TRUE
HW2 questions
For problem 2 of Homework 2, are the two datasets we need to link “Louisiana-history.csv” and “CRDT Data -CRDT.csv”? Also, does linking mean combining the two datasets or just plotting them on the same graph for comparison?
No, the two datasets refer to the two data sources: (1) Louisiana Department of Health (2) The COVID tracking project
Both data sources have daily counts. Please compare them with whatever visualization methods you see fit.
…about Q3, the question says that “Recreate the 4 plots with 7-day average of Full range”, does this mean that I need to create a new table transfer the “by day” data to “by week”? Or there already have the dataset on website I can use.
Presumably, you should have the same amount of information as in the figures from the downloaded files. The question is asking you to present the information in the same fashion as on the website. I expect you to find answers about using “by day” or “by week” summaries through investigating the four figures, after all, you are asked to recreate them.
There are choices for “State-level Metrics” (Total vs Per 1M people, Last 90 days vs Historical). Feel free to pick one (enough for the homework) or several or all combinations for your plots.
stringr pacakge, by Hadley Wickham, provides utilities for handling strings.
Included in tidyverse.
library("tidyverse")
# load htmlwidgets
library(htmlwidgets)
Main functions:
str_detect(string, pattern)
: Detect the presence or absence of a pattern in a string.str_locate(string, pattern)
: Locate the first position of a pattern and return a matrix with start and end.str_extract(string, pattern)
: Extracts text corresponding to the first match.str_match(string, pattern)
: Extracts capture groups formed by () from the first match.str_split(string, pattern)
: Splits string into pieces and returns a list of character vectors.str_replace(string, pattern, replacement)
: Replaces the first matched pattern and returns a character vector.Variants with an _all
suffix will match more than 1 occurrence of the pattern in a given string.
Most functions are vectorized.
Strings are enclosed by double quotes or single quotes:
string1 <- "This is a string"
string2 <- 'If I want to include a "quote" inside a string, I use single quotes'
Literal single or double quote:
double_quote <- "\"" # or '"'
single_quote <- '\'' # or "'"
Printed representation:
x <- c("\"", "\\")
x
## [1] "\"" "\\"
vs cat
:
cat(x)
## " \
vs writeLines()
:
writeLines(x)
## "
## \
Other special characters: "\n"
(new line), "\t"
(tab), … Check
?"'"
for a complete list.
cat("a\"b")
## a"b
cat("a\tb")
## a b
cat("a\nb")
## a
## b
Unicode
x <- "\u00b5"
x
## [1] "µ"
Character vector (vector of strings):
c("one", "two", "three")
## [1] "one" "two" "three"
Length of a single string:
str_length("R for data science")
## [1] 18
Lengths of a character vector:
str_length(c("a", "R for data science", NA))
## [1] 1 18 NA
Read str_c
’s documentation
Combine two or more strings
str_c("x", "y")
## [1] "xy"
str_c("x", "y", "z")
## [1] "xyz"
Separator:
str_c("x", "y", sep = ", ")
## [1] "x, y"
str_c()
is vectorised:
str_c("prefix-", c("a", "b", "c"), "-suffix")
## [1] "prefix-a-suffix" "prefix-b-suffix" "prefix-c-suffix"
Objects of length 0 are silently dropped:
name <- "Hadley"
time_of_day <- "morning"
birthday <- FALSE
str_c(
"Good ", time_of_day, " ", name,
if (birthday) " and HAPPY BIRTHDAY",
"."
)
## [1] "Good morning Hadley."
Combine a vector of strings:
str_c(c("x", "y", "z"))
## [1] "x" "y" "z"
str_c(c("x", "y", "z"), collapse = ", ")
## [1] "x, y, z"
str_sub
’s documentation
By position:
str_sub("Apple", 1, 3)
## [1] "App"
x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
## [1] "App" "Ban" "Pea"
Negative numbers count backwards from end:
str_sub(x, -3, -1)
## [1] "ple" "ana" "ear"
Out of range:
str_sub("a", 1, 5)
## [1] "a"
str_sub("a", 2, 5)
## [1] ""
Assignment to a substring:
str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
x
## [1] "apple" "banana" "pear"
str_view()
shows the first match;
str_view_all()
shows all matches.
Match exact strings:
x <- c("apple", "banana", "pear")
str_view(x, "an")
str_view_all(x, "an")
.
matches any character apart from a newline:
str_view(x, ".a.")
Regex escapes live on top of regular string escapes, so there needs to be two levels of escapes..
To match a literal .
:
# doesn't work because "a\.c" is treated as a regular expression
str_view(c("abc", "a.c", "bef"), "a\.c")
## Error: '\.' is an unrecognized escape in character string starting ""a\."
# regular expression needs double escape
str_view(c("abc", "a.c", "bef"), "a\\.c")
To match a literal \
:
str_view("a\\b", "\\\\")
List of typographical symbols and punctuation marks wikipedia
^
matches the start of the string:
x <- c("apple", "banana", "pear")
str_view(x, "^a")
$
matches the end of the string:
str_view(x, "a$")
To force a regular expression to only match a complete string:
x <- c("apple pie", "apple", "apple cake")
str_view(x, "^apple$")
[abc]
: matches a, b, or c. Same as (a|b|c)
.[^abc]
: matches anything except a, b, or c.str_view(c("grey", "gray"), "gr(e|a)y")
str_view(c("grey", "gray"), "gr[ea]y")
?
: 0 or 1
+
: 1 or more
*
: 0 or more
x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
# match either C or CC, being greedy here
str_view(x, "CC?")
# greedy matches
str_view(x, "CC+")
# greedy matches
str_view(x, 'C[LX]+')
Specify number of matches:
{n}
: exactly n
{n,}
: n or more
{,m}
: at most m
{n,m}
: between n and m
str_view(x, "C{2}")
# greedy matches
str_view(x, "C{2,}")
# greedy matches
str_view(x, "C{2,3}")
Greedy (default) vs lazy (put ?
after repetition):
# lazy matches
str_view(x, 'C{2,3}?')
# lazy matches
str_view(x, 'C[LX]+?')
What went wrong here?
text = "<div class='main'> <div> <a href='here.pdf'>Here!</a> </div> </div>"
str_extract(text, "<div>.*</div>")
## [1] "<div> <a href='here.pdf'>Here!</a> </div> </div>"
If we add ? after a quantifier, the matching will be lazy (find the shortest possible match, not the longest).
str_extract(text, "<div>.*?</div>")
## [1] "<div> <a href='here.pdf'>Here!</a> </div>"
fruit
is a character vector pre-defined in stringr
package:
fruit
## [1] "apple" "apricot" "avocado"
## [4] "banana" "bell pepper" "bilberry"
## [7] "blackberry" "blackcurrant" "blood orange"
## [10] "blueberry" "boysenberry" "breadfruit"
## [13] "canary melon" "cantaloupe" "cherimoya"
## [16] "cherry" "chili pepper" "clementine"
## [19] "cloudberry" "coconut" "cranberry"
## [22] "cucumber" "currant" "damson"
## [25] "date" "dragonfruit" "durian"
## [28] "eggplant" "elderberry" "feijoa"
## [31] "fig" "goji berry" "gooseberry"
## [34] "grape" "grapefruit" "guava"
## [37] "honeydew" "huckleberry" "jackfruit"
## [40] "jambul" "jujube" "kiwi fruit"
## [43] "kumquat" "lemon" "lime"
## [46] "loquat" "lychee" "mandarine"
## [49] "mango" "mulberry" "nectarine"
## [52] "nut" "olive" "orange"
## [55] "pamelo" "papaya" "passionfruit"
## [58] "peach" "pear" "persimmon"
## [61] "physalis" "pineapple" "plum"
## [64] "pomegranate" "pomelo" "purple mangosteen"
## [67] "quince" "raisin" "rambutan"
## [70] "raspberry" "redcurrant" "rock melon"
## [73] "salal berry" "satsuma" "star fruit"
## [76] "strawberry" "tamarillo" "tangerine"
## [79] "ugli fruit" "watermelon"
Parentheses define groups, which can be back-referenced as \1
, \2
, …
# only show matched strings
str_view(fruit, "(..)\\1", match = TRUE)
x <- c("apple", "banana", "pear")
str_detect(x, "e")
## [1] TRUE FALSE TRUE
Vector words
contains about 1000 commonly used words:
length(words)
## [1] 980
head(words)
## [1] "a" "able" "about" "absolute" "accept" "account"
# How many common words start with t?
sum(str_detect(words, "^t"))
## [1] 65
# What proportion of common words end with a vowel?
mean(str_detect(words, "[aeiou]$"))
## [1] 0.2765306
Find words that end with x
:
words[str_detect(words, "x$")]
## [1] "box" "sex" "six" "tax"
same as
str_subset(words, "x$")
## [1] "box" "sex" "six" "tax"
Filter a data frame:
df <- tibble(
word = words,
i = seq_along(word)
)
df %>%
filter(str_detect(words, "x$"))
## # A tibble: 4 × 2
## word i
## <chr> <int>
## 1 box 108
## 2 sex 747
## 3 six 772
## 4 tax 841
str_count()
tells how many matches are found:
x <- c("apple", "banana", "pear")
str_count(x, "a")
## [1] 1 3 1
# On average, how many vowels per word?
mean(str_count(words, "[aeiou]"))
## [1] 1.991837
Matches never overlap:
str_count("abababa", "aba")
## [1] 2
str_view_all("abababa", "aba")
Mutate a data frame:
df %>%
mutate(
vowels = str_count(word, "[aeiou]"),
consonants = str_count(word, "[^aeiou]")
)
## # A tibble: 980 × 4
## word i vowels consonants
## <chr> <int> <int> <int>
## 1 a 1 1 0
## 2 able 2 2 2
## 3 about 3 3 2
## 4 absolute 4 4 4
## 5 accept 5 2 4
## 6 account 6 3 4
## 7 achieve 7 4 3
## 8 across 8 2 4
## 9 act 9 1 2
## 10 active 10 3 3
## # … with 970 more rows
sentences
is a collection of 720 phrases:
length(sentences)
## [1] 720
head(sentences)
## [1] "The birch canoe slid on the smooth planks."
## [2] "Glue the sheet to the dark blue background."
## [3] "It's easy to tell the depth of a well."
## [4] "These days a chicken leg is a rare dish."
## [5] "Rice is often served in round bowls."
## [6] "The juice of lemons makes fine punch."
Suppose we want to find all sentences that contain a colour.
Create a collection of colours:
(colours <- c("red", "orange", "yellow", "green", "blue", "purple"))
## [1] "red" "orange" "yellow" "green" "blue" "purple"
(colour_match <- str_c(colours, collapse = "|"))
## [1] "red|orange|yellow|green|blue|purple"
Select the sentences that contain a colour, and then extract the colour to figure out which one it is:
(has_colour <- str_subset(sentences, colour_match))
## [1] "Glue the sheet to the dark blue background."
## [2] "Two blue fish swam in the tank."
## [3] "The colt reared and threw the tall rider."
## [4] "The wide road shimmered in the hot sun."
## [5] "See the cat glaring at the scared mouse."
## [6] "A wisp of cloud hung in the blue air."
## [7] "Leaves turn brown and yellow in the fall."
## [8] "He ordered peach pie with ice cream."
## [9] "Pure bred poodles have curls."
## [10] "The spot on the blotter was made by green ink."
## [11] "Mud was spattered on the front of his white shirt."
## [12] "The sofa cushion is red and of light weight."
## [13] "The sky that morning was clear and bright blue."
## [14] "Torn scraps littered the stone floor."
## [15] "The doctor cured him with these pills."
## [16] "The new girl was fired today at noon."
## [17] "The third act was dull and tired the players."
## [18] "A blue crane is a tall wading bird."
## [19] "Lire wires should be kept covered."
## [20] "It is hard to erase blue or red ink."
## [21] "The wreck occurred by the bank on Main Street."
## [22] "The lamp shone with a steady green flame."
## [23] "The box is held by a bright red snapper."
## [24] "The prince ordered his head chopped off."
## [25] "The houses are built of red clay bricks."
## [26] "The red tape bound the smuggled food."
## [27] "Nine men were hired to dig the ruins."
## [28] "The flint sputtered and lit a pine torch."
## [29] "Hedge apples may stain your hands green."
## [30] "The old pan was covered with hard fudge."
## [31] "The plant grew large and green in the window."
## [32] "The store walls were lined with colored frocks."
## [33] "The purple tie was ten years old."
## [34] "Bathe and relax in the cool green grass."
## [35] "The clan gathered on each dull night."
## [36] "The lake sparkled in the red hot sun."
## [37] "Mark the spot with a sign painted red."
## [38] "Smoke poured out of every crack."
## [39] "Serve the hot rum to the tired heroes."
## [40] "The couch cover and hall drapes were blue."
## [41] "He offered proof in the form of a lsrge chart."
## [42] "A man in a blue sweater sat at the desk."
## [43] "The sip of tea revives his tired friend."
## [44] "The door was barred, locked, and bolted as well."
## [45] "A thick coat of black paint covered all."
## [46] "The small red neon lamp went out."
## [47] "Paint the sockets in the wall dull green."
## [48] "Wake and rise, and step into the green outdoors."
## [49] "The green light in the brown box flickered."
## [50] "He put his last cartridge into the gun and fired."
## [51] "The ram scared the school children off."
## [52] "Tear a thin sheet from the yellow pad."
## [53] "Dimes showered down from all sides."
## [54] "The sky in the west is tinged with orange red."
## [55] "The red paper brightened the dim stage."
## [56] "The hail pattered on the burnt brown grass."
## [57] "The big red apple fell to the ground."
(matches <- str_extract(has_colour, colour_match))
## [1] "blue" "blue" "red" "red" "red" "blue" "yellow" "red"
## [9] "red" "green" "red" "red" "blue" "red" "red" "red"
## [17] "red" "blue" "red" "blue" "red" "green" "red" "red"
## [25] "red" "red" "red" "red" "green" "red" "green" "red"
## [33] "purple" "green" "red" "red" "red" "red" "red" "blue"
## [41] "red" "blue" "red" "red" "red" "red" "green" "green"
## [49] "green" "red" "red" "yellow" "red" "orange" "red" "red"
## [57] "red"
str_extract()
only extracts the first match.
more <- sentences[str_count(sentences, colour_match) > 1]
str_view_all(more, colour_match)
str_extract_all()
extracts all matches:
str_extract_all(more, colour_match)
## [[1]]
## [1] "blue" "red"
##
## [[2]]
## [1] "green" "red"
##
## [[3]]
## [1] "orange" "red"
Setting simplify = TRUE
in str_extract_all()
will return a matrix with short matches expanded to the same length as the longest:
str_extract_all(more, colour_match, simplify = TRUE)
## [,1] [,2]
## [1,] "blue" "red"
## [2,] "green" "red"
## [3,] "orange" "red"
x <- c("a", "a b", "a b c")
str_extract_all(x, "[a-z]", simplify = TRUE)
## [,1] [,2] [,3]
## [1,] "a" "" ""
## [2,] "a" "b" ""
## [3,] "a" "b" "c"
str_extract()
gives us the complete match:
# why "([^ ]+)" match a word?
noun <- "(a|the) ([^ ]+)"
has_noun <- sentences %>%
str_subset(noun) %>%
head(10)
has_noun %>%
str_extract(noun)
## [1] "the smooth" "the sheet" "the depth" "a chicken" "the parked"
## [6] "the sun" "the huge" "the ball" "the woman" "a helps"
str_match()
gives each individual component:
has_noun %>%
str_match(noun)
## [,1] [,2] [,3]
## [1,] "the smooth" "the" "smooth"
## [2,] "the sheet" "the" "sheet"
## [3,] "the depth" "the" "depth"
## [4,] "a chicken" "a" "chicken"
## [5,] "the parked" "the" "parked"
## [6,] "the sun" "the" "sun"
## [7,] "the huge" "the" "huge"
## [8,] "the ball" "the" "ball"
## [9,] "the woman" "the" "woman"
## [10,] "a helps" "a" "helps"
tidyr::extract()
works with tibble:
tibble(sentence = sentences) %>%
tidyr::extract(
sentence, c("article", "noun"), "(a|the) ([^ ]+)",
remove = FALSE
)
## # A tibble: 720 × 3
## sentence article noun
## <chr> <chr> <chr>
## 1 The birch canoe slid on the smooth planks. the smooth
## 2 Glue the sheet to the dark blue background. the sheet
## 3 It's easy to tell the depth of a well. the depth
## 4 These days a chicken leg is a rare dish. a chicken
## 5 Rice is often served in round bowls. <NA> <NA>
## 6 The juice of lemons makes fine punch. <NA> <NA>
## 7 The box was thrown beside the parked truck. the parked
## 8 The hogs were fed chopped corn and garbage. <NA> <NA>
## 9 Four hours of steady work faced us. <NA> <NA>
## 10 Large size in stockings is hard to sell. <NA> <NA>
## # … with 710 more rows
Replace the first match:
x <- c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-")
## [1] "-pple" "p-ar" "b-nana"
Replace all matches:
str_replace_all(x, "[aeiou]", "-")
## [1] "-ppl-" "p--r" "b-n-n-"
Multiple replacement:
x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
## [1] "one house" "two cars" "three people"
Back-reference:
# flip the order of the second and third words
sentences %>%
str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>%
head(5)
## [1] "The canoe birch slid on the smooth planks."
## [2] "Glue sheet the to the dark blue background."
## [3] "It's to easy tell the depth of a well."
## [4] "These a days chicken leg is a rare dish."
## [5] "Rice often is served in round bowls."
Split a string up into pieces:
sentences %>%
head(5) %>%
str_split(" ")
## [[1]]
## [1] "The" "birch" "canoe" "slid" "on" "the" "smooth"
## [8] "planks."
##
## [[2]]
## [1] "Glue" "the" "sheet" "to" "the"
## [6] "dark" "blue" "background."
##
## [[3]]
## [1] "It's" "easy" "to" "tell" "the" "depth" "of" "a" "well."
##
## [[4]]
## [1] "These" "days" "a" "chicken" "leg" "is" "a"
## [8] "rare" "dish."
##
## [[5]]
## [1] "Rice" "is" "often" "served" "in" "round" "bowls."
Use simplify = TRUE
to return a matrix:
sentences %>%
head(5) %>%
str_split(" ", simplify = TRUE)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] "The" "birch" "canoe" "slid" "on" "the" "smooth" "planks."
## [2,] "Glue" "the" "sheet" "to" "the" "dark" "blue" "background."
## [3,] "It's" "easy" "to" "tell" "the" "depth" "of" "a"
## [4,] "These" "days" "a" "chicken" "leg" "is" "a" "rare"
## [5,] "Rice" "is" "often" "served" "in" "round" "bowls." ""
## [,9]
## [1,] ""
## [2,] ""
## [3,] "well."
## [4,] "dish."
## [5,] ""
A number of built-in convenience classes of characters:
.
: Any character except new line \n
.\s
: White space.\S
: Not white space.\d
: Digit (0-9).\D
: Not digit.\w
: Word (A-Z, a-z, 0-9, or _).\W
: Not word.Example: How to match a telephone number with the form (###) ###-####?
text = c("apple", "(219) 733-8965", "(329) 293-8753")
str_detect(text, "(\d\d\d) \d\d\d-\d\d\d\d")
## Error: '\d' is an unrecognized escape in character string starting ""(\d"
text = c("apple", "(219) 733-8965", "(329) 293-8753")
str_detect(text, "(\\d\\d\\d) \\d\\d\\d-\\d\\d\\d\\d")
## [1] FALSE FALSE FALSE
str_detect(text, "\\(\\d\\d\\d\\) \\d\\d\\d-\\d\\d\\d\\d")
## [1] FALSE TRUE TRUE
[abc]
: List (a or b or c)[^abc]
: Excluded list (not a or b or c)[a-q]
: Range lower case letter from a to q[A-Q]
: Range upper case letter from A to Q[0-7]
: Digit from 0 to 7text = c("apple", "(219) 733-8965", "(329) 293-8753")
str_replace_all(text, "[aeiou]", "") # strip all vowels
## [1] "ppl" "(219) 733-8965" "(329) 293-8753"
str_replace_all(text, "[13579]", "*")
## [1] "apple" "(2**) ***-8*6*" "(*2*) 2**-8***"
str_replace_all(text, "[1-5a-ep]", "*")
## [1] "***l*" "(**9) 7**-896*" "(**9) *9*-87**"