More HW2 questions (thank you for bringing them up)
weekly total case = weekly case = sum of new cases in a given week
.
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Big Sur 11.5.2
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.28 R6_2.5.1 jsonlite_1.7.2 magrittr_2.0.1
## [5] evaluate_0.14 rlang_0.4.11 stringi_1.7.3 jquerylib_0.1.4
## [9] bslib_0.2.5.1 rmarkdown_2.10 tools_4.1.1 stringr_1.4.0
## [13] xfun_0.25 yaml_2.2.1 compiler_4.1.1 htmltools_0.5.1.1
## [17] knitr_1.33 sass_0.4.0
Dr. Hua Zhou’s slides
Josh McCrain’s RSelenium tutorial
HTML Introduction from GeeksforGeeks
Getting started with HTML MDN Web Docs
Load tidyverse and other packages for this lecture:
library("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.5 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.1 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library("rvest")
##
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
##
## guess_encoding
We cover some survival amount of instroduction of HTML format first.
HTML stands for HyperText Markup Language.
used to design web pages using a markup language
combination of Hypertext and Markup language
Hypertext defines the link between the web pages
A markup language is used to define the text document within tag which defines the structure of web pages.
Elements can also have attributes. Attributes look like this:
Attributes contain extra information about the element that won’t appear in the content.
In this example, the class
attribute is an identifying name used to target the element with style information.
An attribute should have:
A space between it and the element name. (For an element with more than one attribute, the attributes should be separated by spaces too.)
The attribute name, followed by an equal sign.
An attribute value, wrapped with opening and closing quote marks.
Another example of an element is <a>
. This stands for anchor. An anchor can make the text it encloses into a hyperlink. Anchors can take a number of attributes, but several are as follows:
href
: This attribute’s value specifies the web address for the link. For example: href="https://www.mozilla.org/"
.
title
: The title
attribute specifies extra information about the link, such as a description of the page that is being linked to. For example, title="The Mozilla homepage"
. This appears as a tooltip when a cursor hovers over the element.
target
: The target
attribute specifies the browsing context used to display the link. For example, target="_blank"
will display the link in a new tab. If you want to display the linked content in the current tab, just omit this attribute.
The basic structure of an HTML page is laid out below.
It contains the essential building-block elements upon which all web pages are created.
doctype declaration
HTML
head
title
body elements
To write an HTML comment, wrap it in the special markers <!-- and -->
. For example:
<p>I'm not inside a comment</p>
generates:
<!-- <p>I am!</p> -->
I’m not inside a comment
There is a wealth of data on internet. How to scrape them and analyze them?
rvest is an R package written by Hadley Wickham which makes web scraping easy.
We follow instructions in a Blog by SAURAV KAUSHIK to find the most popular feature films of 2019.
Install the SelectorGadget extension for Chrome.
The 100 most popular feature films released in 2019 can be accessed at page https://www.imdb.com/search/title?count=100&release_date=2019,2019&title_type=feature.
#Loading the rvest and tidyverse package
#Specifying the url for desired website to be scraped
url <- "http://www.imdb.com/search/title?count=100&release_date=2019,2019&title_type=feature"
#Reading the HTML code from the website
(webpage <- read_html(url))
## {html_document}
## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body id="styleguide-v2" class="fixed">\n <img height="1" widt ...
Suppose we want to scrape following 11 features from this page:
Use the CSS selector to get the rankings
# Use CSS selectors to scrap the rankings section
(rank_data_html <- html_nodes(webpage, '.text-primary'))
## {xml_nodeset (100)}
## [1] <span class="lister-item-index unbold text-primary">1.</span>
## [2] <span class="lister-item-index unbold text-primary">2.</span>
## [3] <span class="lister-item-index unbold text-primary">3.</span>
## [4] <span class="lister-item-index unbold text-primary">4.</span>
## [5] <span class="lister-item-index unbold text-primary">5.</span>
## [6] <span class="lister-item-index unbold text-primary">6.</span>
## [7] <span class="lister-item-index unbold text-primary">7.</span>
## [8] <span class="lister-item-index unbold text-primary">8.</span>
## [9] <span class="lister-item-index unbold text-primary">9.</span>
## [10] <span class="lister-item-index unbold text-primary">10.</span>
## [11] <span class="lister-item-index unbold text-primary">11.</span>
## [12] <span class="lister-item-index unbold text-primary">12.</span>
## [13] <span class="lister-item-index unbold text-primary">13.</span>
## [14] <span class="lister-item-index unbold text-primary">14.</span>
## [15] <span class="lister-item-index unbold text-primary">15.</span>
## [16] <span class="lister-item-index unbold text-primary">16.</span>
## [17] <span class="lister-item-index unbold text-primary">17.</span>
## [18] <span class="lister-item-index unbold text-primary">18.</span>
## [19] <span class="lister-item-index unbold text-primary">19.</span>
## [20] <span class="lister-item-index unbold text-primary">20.</span>
## ...
# (rank_data_html <- html_nodes(webpage, '.lister-item-content .text-primary'))
# Convert the ranking data to text
(rank_data <- html_text(rank_data_html))
## [1] "1." "2." "3." "4." "5." "6." "7." "8." "9." "10."
## [11] "11." "12." "13." "14." "15." "16." "17." "18." "19." "20."
## [21] "21." "22." "23." "24." "25." "26." "27." "28." "29." "30."
## [31] "31." "32." "33." "34." "35." "36." "37." "38." "39." "40."
## [41] "41." "42." "43." "44." "45." "46." "47." "48." "49." "50."
## [51] "51." "52." "53." "54." "55." "56." "57." "58." "59." "60."
## [61] "61." "62." "63." "64." "65." "66." "67." "68." "69." "70."
## [71] "71." "72." "73." "74." "75." "76." "77." "78." "79." "80."
## [81] "81." "82." "83." "84." "85." "86." "87." "88." "89." "90."
## [91] "91." "92." "93." "94." "95." "96." "97." "98." "99." "100."
# Turn into numerical values
(rank_data <- as.integer(rank_data))
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## [91] 91 92 93 94 95 96 97 98 99 100
Use SelectorGadget to find the CSS selector .lister-item-header a
.
# Using CSS selectors to scrap the title section
(title_data_html <- html_nodes(webpage, '.lister-item-header a'))
## {xml_nodeset (100)}
## [1] <a href="/title/tt8946378/?ref_=adv_li_tt">Knives Out</a>
## [2] <a href="/title/tt6751668/?ref_=adv_li_tt">Parasite</a>
## [3] <a href="/title/tt8772262/?ref_=adv_li_tt">Midsommar</a>
## [4] <a href="/title/tt7286456/?ref_=adv_li_tt">Joker</a>
## [5] <a href="/title/tt7131622/?ref_=adv_li_tt">Once Upon a Time... In Hollyw ...
## [6] <a href="/title/tt1620981/?ref_=adv_li_tt">The Addams Family</a>
## [7] <a href="/title/tt4154796/?ref_=adv_li_tt">Avengers: Endgame</a>
## [8] <a href="/title/tt7984734/?ref_=adv_li_tt">The Lighthouse</a>
## [9] <a href="/title/tt5606664/?ref_=adv_li_tt">Doctor Sleep</a>
## [10] <a href="/title/tt8579674/?ref_=adv_li_tt">1917</a>
## [11] <a href="/title/tt3281548/?ref_=adv_li_tt">Little Women</a>
## [12] <a href="/title/tt7349950/?ref_=adv_li_tt">It Chapter Two</a>
## [13] <a href="/title/tt7984766/?ref_=adv_li_tt">The King</a>
## [14] <a href="/title/tt8367814/?ref_=adv_li_tt">The Gentlemen</a>
## [15] <a href="/title/tt4154664/?ref_=adv_li_tt">Captain Marvel</a>
## [16] <a href="/title/tt7798634/?ref_=adv_li_tt">Ready or Not</a>
## [17] <a href="/title/tt2584384/?ref_=adv_li_tt">Jojo Rabbit</a>
## [18] <a href="/title/tt2527338/?ref_=adv_li_tt">Star Wars: The Rise Of Skywal ...
## [19] <a href="/title/tt6320628/?ref_=adv_li_tt">Spider-Man: Far from Home</a>
## [20] <a href="/title/tt4126476/?ref_=adv_li_tt">After</a>
## ...
# Converting the title data to text
(title_data <- html_text(title_data_html))
## [1] "Knives Out"
## [2] "Parasite"
## [3] "Midsommar"
## [4] "Joker"
## [5] "Once Upon a Time... In Hollywood"
## [6] "The Addams Family"
## [7] "Avengers: Endgame"
## [8] "The Lighthouse"
## [9] "Doctor Sleep"
## [10] "1917"
## [11] "Little Women"
## [12] "It Chapter Two"
## [13] "The King"
## [14] "The Gentlemen"
## [15] "Captain Marvel"
## [16] "Ready or Not"
## [17] "Jojo Rabbit"
## [18] "Star Wars: The Rise Of Skywalker"
## [19] "Spider-Man: Far from Home"
## [20] "After"
## [21] "Jumanji: The Next Level"
## [22] "Rocketman"
## [23] "Shazam!"
## [24] "Escape Room"
## [25] "John Wick: Chapter 3 - Parabellum"
## [26] "Us"
## [27] "Downton Abbey"
## [28] "Scary Stories to Tell in the Dark"
## [29] "The Irishman"
## [30] "Alita: Battle Angel"
## [31] "The Platform"
## [32] "Fast & Furious Presents: Hobbs & Shaw"
## [33] "Cats"
## [34] "Charlie's Angels"
## [35] "Zombieland: Double Tap"
## [36] "Official Secrets"
## [37] "Ford v Ferrari"
## [38] "Yesterday"
## [39] "Child's Play"
## [40] "The Dead Don't Die"
## [41] "Fighting with My Family"
## [42] "Anna"
## [43] "Aladdin"
## [44] "Uncut Gems"
## [45] "Vivarium"
## [46] "Bombshell"
## [47] "Ad Astra"
## [48] "Terminator: Dark Fate"
## [49] "Gemini Man"
## [50] "6 Underground"
## [51] "Good Boys"
## [52] "Booksmart"
## [53] "21 Bridges"
## [54] "Ma"
## [55] "Marriage Story"
## [56] "Godzilla: King of the Monsters"
## [57] "The Lion King"
## [58] "Motherless Brooklyn"
## [59] "The Lodge"
## [60] "Men in Black: International"
## [61] "Saint Maud"
## [62] "Sound of Metal"
## [63] "Glass"
## [64] "Paradise Hills"
## [65] "X-Men: Dark Phoenix"
## [66] "A Rainy Day in New York"
## [67] "Just Mercy"
## [68] "Portrait of a Lady on Fire"
## [69] "The Outpost"
## [70] "The Room"
## [71] "El Camino: A Breaking Bad Movie"
## [72] "I See You"
## [73] "Midway"
## [74] "Frozen II"
## [75] "Hustlers"
## [76] "Angel Has Fallen"
## [77] "Color Out of Space"
## [78] "Dark Waters"
## [79] "Toy Story 4"
## [80] "Hellboy"
## [81] "Haunt"
## [82] "Polar"
## [83] "The Informer"
## [84] "Maleficent: Mistress of Evil"
## [85] "The Goldfinch"
## [86] "The Peanut Butter Falcon"
## [87] "Crawl"
## [88] "Benny Loves You"
## [89] "Extremely Wicked, Shockingly Evil and Vile"
## [90] "Annabelle Comes Home"
## [91] "Guns Akimbo"
## [92] "The Dirt"
## [93] "Murder Mystery"
## [94] "Fractured"
## [95] "Five Feet Apart"
## [96] "Swallow"
## [97] "Richard Jewell"
## [98] "A Beautiful Day in the Neighborhood"
## [99] "Judy"
## [100] "Velvet Buzzsaw"
# Using CSS selectors to scrap the description section
(description_data_html <- html_nodes(webpage, '.ratings-bar+ .text-muted'))
## {xml_nodeset (100)}
## [1] <p class="text-muted">\nA detective investigates the death of a patriarc ...
## [2] <p class="text-muted">\nGreed and class discrimination threaten the newl ...
## [3] <p class="text-muted">\nA couple travels to Scandinavia to visit a rural ...
## [4] <p class="text-muted">\nIn Gotham City, mentally troubled comedian Arthu ...
## [5] <p class="text-muted">\nA faded television actor and his stunt double st ...
## [6] <p class="text-muted">\nThe eccentrically macabre family moves to a blan ...
## [7] <p class="text-muted">\nAfter the devastating events of <a href="/title/ ...
## [8] <p class="text-muted">\nTwo lighthouse keepers try to maintain their san ...
## [9] <p class="text-muted">\nYears following the events of <a href="/title/tt ...
## [10] <p class="text-muted">\nApril 6th, 1917. As a regiment assembles to wage ...
## [11] <p class="text-muted">\nJo March reflects back and forth on her life, te ...
## [12] <p class="text-muted">\nTwenty-seven years after their first encounter w ...
## [13] <p class="text-muted">\nHal, wayward prince and heir to the English thro ...
## [14] <p class="text-muted">\nAn American expat tries to sell off his highly p ...
## [15] <p class="text-muted">\nCarol Danvers becomes one of the universe's most ...
## [16] <p class="text-muted">\nA bride's wedding night takes a sinister turn wh ...
## [17] <p class="text-muted">\nA young German boy in the Hitler Youth whose her ...
## [18] <p class="text-muted">\nIn the riveting conclusion of the landmark Skywa ...
## [19] <p class="text-muted">\nFollowing the events of <a href="/title/tt415479 ...
## [20] <p class="text-muted">\nA young woman falls for a guy with a dark secret ...
## ...
# Converting the description data to text
description_data <- html_text(description_data_html)
# take a look at first few
head(description_data)
## [1] "\nA detective investigates the death of a patriarch of an eccentric, combative family."
## [2] "\nGreed and class discrimination threaten the newly formed symbiotic relationship between the wealthy Park family and the destitute Kim clan."
## [3] "\nA couple travels to Scandinavia to visit a rural hometown's fabled Swedish mid-summer festival. What begins as an idyllic retreat quickly devolves into an increasingly violent and bizarre competition at the hands of a pagan cult."
## [4] "\nIn Gotham City, mentally troubled comedian Arthur Fleck is disregarded and mistreated by society. He then embarks on a downward spiral of revolution and bloody crime. This path brings him face-to-face with his alter-ego: the Joker."
## [5] "\nA faded television actor and his stunt double strive to achieve fame and success in the final years of Hollywood's Golden Age in 1969 Los Angeles."
## [6] "\nThe eccentrically macabre family moves to a bland suburb where Wednesday Addams' friendship with the daughter of a hostile and conformist local reality show host exacerbates conflict between the families."
# strip the '\n'
description_data <- str_replace(description_data, "^\\n\\s+", "")
head(description_data)
## [1] "\nA detective investigates the death of a patriarch of an eccentric, combative family."
## [2] "\nGreed and class discrimination threaten the newly formed symbiotic relationship between the wealthy Park family and the destitute Kim clan."
## [3] "\nA couple travels to Scandinavia to visit a rural hometown's fabled Swedish mid-summer festival. What begins as an idyllic retreat quickly devolves into an increasingly violent and bizarre competition at the hands of a pagan cult."
## [4] "\nIn Gotham City, mentally troubled comedian Arthur Fleck is disregarded and mistreated by society. He then embarks on a downward spiral of revolution and bloody crime. This path brings him face-to-face with his alter-ego: the Joker."
## [5] "\nA faded television actor and his stunt double strive to achieve fame and success in the final years of Hollywood's Golden Age in 1969 Los Angeles."
## [6] "\nThe eccentrically macabre family moves to a bland suburb where Wednesday Addams' friendship with the daughter of a hostile and conformist local reality show host exacerbates conflict between the families."
# Using CSS selectors to scrap the Movie runtime section
(runtime_data <- webpage %>%
html_nodes('.runtime') %>%
html_text() %>%
str_replace(" min", "") %>%
as.integer())
## [1] 130 132 148 122 161 86 181 109 152 119 135 169 140 113 123 95 108 141
## [19] 129 105 123 121 132 99 130 116 122 108 209 122 94 137 110 118 99 112
## [37] 152 116 90 104 108 118 128 135 97 109 123 128 117 128 90 102 99 99
## [55] 137 132 118 144 108 114 84 120 129 95 113 92 137 122 123 100 122 98
## [73] 138 103 110 121 111 126 100 120 92 118 113 119 149 97 87 94 110 106
## [91] 98 107 97 99 116 94 131 109 118 113
# Using CSS selectors to scrap the Movie runtime section
runtime_data_html <- html_nodes(webpage, '.runtime')
# Converting the runtime data to text
runtime_data <- html_text(runtime_data_html)
# Let's have a look at the runtime
head(runtime_data)
## [1] "130 min" "132 min" "148 min" "122 min" "161 min" "86 min"
# Data-Preprocessing: removing mins and converting it to numerical
runtime_data <- str_replace(runtime_data, " min", "")
runtime_data <- as.numeric(runtime_data)
#Let's have another look at the runtime data
head(runtime_data)
## [1] 130 132 148 122 161 86
Collect the (first) genre of each movie:
# Using CSS selectors to scrap the Movie genre section
genre_data_html <- html_nodes(webpage, '.genre')
# Converting the genre data to text
genre_data <- html_text(genre_data_html)
# Let's have a look at the genre data
head(genre_data)
## [1] "\nComedy, Crime, Drama "
## [2] "\nComedy, Drama, Thriller "
## [3] "\nDrama, Horror, Mystery "
## [4] "\nCrime, Drama, Thriller "
## [5] "\nComedy, Drama "
## [6] "\nAnimation, Adventure, Comedy "
# Data-Preprocessing: retrieve the first word
genre_data <- str_extract(genre_data, "[:alpha:]+")
# Convering each genre from text to factor
#genre_data <- as.factor(genre_data)
# Let's have another look at the genre data
head(genre_data)
## [1] "Comedy" "Comedy" "Drama" "Crime" "Comedy" "Animation"
# Using CSS selectors to scrap the IMDB rating section
rating_data_html <- html_nodes(webpage, '.ratings-imdb-rating strong')
# Converting the ratings data to text
rating_data <- html_text(rating_data_html)
# Let's have a look at the ratings
head(rating_data)
## [1] "7.9" "8.6" "7.1" "8.4" "7.6" "5.8"
# Data-Preprocessing: converting ratings to numerical
rating_data <- as.numeric(rating_data)
# Let's have another look at the ratings data
rating_data
## [1] 7.9 8.6 7.1 8.4 7.6 5.8 8.4 7.5 7.3 8.3 7.8 6.5 7.2 7.8 6.8 6.9 7.9 6.5
## [19] 7.4 5.3 6.7 7.3 7.0 6.4 7.4 6.8 7.4 6.2 7.8 7.3 7.0 6.4 2.7 4.9 6.7 7.3
## [37] 8.1 6.8 5.7 5.5 7.1 6.6 6.9 7.4 5.8 6.8 6.5 6.2 5.7 6.1 6.7 7.2 6.6 5.6
## [55] 7.9 6.0 6.8 6.8 6.1 5.6 6.7 7.8 6.6 5.4 5.7 6.5 7.6 8.1 6.8 6.0 7.3 6.8
## [73] 6.7 6.8 6.3 6.4 6.2 7.6 7.7 5.2 6.3 6.3 6.6 6.6 6.4 7.6 6.1 5.6 6.7 5.9
## [91] 6.3 7.0 6.0 6.4 7.2 6.5 7.5 7.3 6.8 5.7
# Using CSS selectors to scrap the votes section
votes_data_html <- html_nodes(webpage, '.sort-num_votes-visible span:nth-child(2)')
# Converting the votes data to text
votes_data <- html_text(votes_data_html)
# Let's have a look at the votes data
head(votes_data)
## [1] "539,726" "672,715" "257,951" "1,081,344" "645,227" "34,097"
# Data-Preprocessing: removing commas
votes_data <- str_replace(votes_data, ",", "")
# Data-Preprocessing: converting votes to numerical
votes_data <- as.numeric(votes_data)
## Warning: NAs introduced by coercion
#Let's have another look at the votes data
votes_data
## [1] 539726 672715 257951 NA 645227 34097 950476 172075 167900 498772
## [11] 168213 232537 103174 284132 495734 129209 342098 405846 385311 46855
## [21] 216961 159413 297457 107447 308155 259098 46463 69059 353833 248405
## [31] 192166 195296 47014 65436 161629 41798 339026 138933 48600 70622
## [41] 74880 72569 246013 247750 48331 103768 215593 163833 103893 155416
## [51] 69180 107290 58584 49561 275254 167840 227563 51255 39500 123774
## [61] 26566 107229 225378 21233 171068 37844 57396 78349 28616 18245
## [71] 207300 43169 76013 153408 91463 89369 41762 74596 224392 84367
## [81] 25048 82522 30965 95116 20187 82993 78224 1656 86712 68016
## [91] 53835 44228 113914 65270 55514 20041 75827 70250 46100 57652
# Using CSS selectors to scrap the directors section
(directors_data_html <- html_nodes(webpage,'.text-muted+ p a:nth-child(1)'))
## {xml_nodeset (100)}
## [1] <a href="/name/nm0426059/?ref_=adv_li_dr_0">Rian Johnson</a>
## [2] <a href="/name/nm0094435/?ref_=adv_li_dr_0">Bong Joon Ho</a>
## [3] <a href="/name/nm4170048/?ref_=adv_li_dr_0">Ari Aster</a>
## [4] <a href="/name/nm0680846/?ref_=adv_li_dr_0">Todd Phillips</a>
## [5] <a href="/name/nm0000233/?ref_=adv_li_dr_0">Quentin Tarantino</a>
## [6] <a href="/name/nm0862911/?ref_=adv_li_dr_0">Greg Tiernan</a>
## [7] <a href="/name/nm0751577/?ref_=adv_li_dr_0">Anthony Russo</a>
## [8] <a href="/name/nm3211470/?ref_=adv_li_dr_0">Robert Eggers</a>
## [9] <a href="/name/nm1093039/?ref_=adv_li_dr_0">Mike Flanagan</a>
## [10] <a href="/name/nm0005222/?ref_=adv_li_dr_0">Sam Mendes</a>
## [11] <a href="/name/nm1950086/?ref_=adv_li_dr_0">Greta Gerwig</a>
## [12] <a href="/name/nm0615592/?ref_=adv_li_dr_0">Andy Muschietti</a>
## [13] <a href="/name/nm2391575/?ref_=adv_li_dr_0">David Michôd</a>
## [14] <a href="/name/nm0005363/?ref_=adv_li_dr_0">Guy Ritchie</a>
## [15] <a href="/name/nm1349818/?ref_=adv_li_dr_0">Anna Boden</a>
## [16] <a href="/name/nm2366012/?ref_=adv_li_dr_0">Matt Bettinelli-Olpin</a>
## [17] <a href="/name/nm0169806/?ref_=adv_li_dr_0">Taika Waititi</a>
## [18] <a href="/name/nm0009190/?ref_=adv_li_dr_0">J.J. Abrams</a>
## [19] <a href="/name/nm1218281/?ref_=adv_li_dr_0">Jon Watts</a>
## [20] <a href="/name/nm1788310/?ref_=adv_li_dr_0">Jenny Gage</a>
## ...
# Converting the directors data to text
directors_data <- html_text(directors_data_html)
# Let's have a look at the directors data
directors_data
## [1] "Rian Johnson" "Bong Joon Ho" "Ari Aster"
## [4] "Todd Phillips" "Quentin Tarantino" "Greg Tiernan"
## [7] "Anthony Russo" "Robert Eggers" "Mike Flanagan"
## [10] "Sam Mendes" "Greta Gerwig" "Andy Muschietti"
## [13] "David Michôd" "Guy Ritchie" "Anna Boden"
## [16] "Matt Bettinelli-Olpin" "Taika Waititi" "J.J. Abrams"
## [19] "Jon Watts" "Jenny Gage" "Jake Kasdan"
## [22] "Dexter Fletcher" "David F. Sandberg" "Adam Robitel"
## [25] "Chad Stahelski" "Jordan Peele" "Michael Engler"
## [28] "André Øvredal" "Martin Scorsese" "Robert Rodriguez"
## [31] "Galder Gaztelu-Urrutia" "David Leitch" "Tom Hooper"
## [34] "Elizabeth Banks" "Ruben Fleischer" "Gavin Hood"
## [37] "James Mangold" "Danny Boyle" "Lars Klevberg"
## [40] "Jim Jarmusch" "Stephen Merchant" "Luc Besson"
## [43] "Guy Ritchie" "Benny Safdie" "Lorcan Finnegan"
## [46] "Jay Roach" "James Gray" "Tim Miller"
## [49] "Ang Lee" "Michael Bay" "Gene Stupnitsky"
## [52] "Olivia Wilde" "Brian Kirk" "Tate Taylor"
## [55] "Noah Baumbach" "Michael Dougherty" "Jon Favreau"
## [58] "Edward Norton" "Severin Fiala" "F. Gary Gray"
## [61] "Rose Glass" "Darius Marder" "M. Night Shyamalan"
## [64] "Alice Waddington" "Simon Kinberg" "Woody Allen"
## [67] "Destin Daniel Cretton" "Céline Sciamma" "Rod Lurie"
## [70] "Christian Volckman" "Vince Gilligan" "Adam Randall"
## [73] "Roland Emmerich" "Chris Buck" "Lorene Scafaria"
## [76] "Ric Roman Waugh" "Richard Stanley" "Todd Haynes"
## [79] "Josh Cooley" "Neil Marshall" "Scott Beck"
## [82] "Jonas Åkerlund" "Andrea Di Stefano" "Joachim Rønning"
## [85] "John Crowley" "Tyler Nilson" "Alexandre Aja"
## [88] "Karl Holt" "Joe Berlinger" "Gary Dauberman"
## [91] "Jason Howden" "Jeff Tremaine" "Kyle Newacheck"
## [94] "Brad Anderson" "Justin Baldoni" "Carlo Mirabella-Davis"
## [97] "Clint Eastwood" "Marielle Heller" "Rupert Goold"
## [100] "Dan Gilroy"
# Using CSS selectors to scrap the actors section
(actors_data_html <- html_nodes(webpage, '.lister-item-content .ghost+ a'))
## {xml_nodeset (100)}
## [1] <a href="/name/nm0185819/?ref_=adv_li_st_0">Daniel Craig</a>
## [2] <a href="/name/nm0814280/?ref_=adv_li_st_0">Kang-ho Song</a>
## [3] <a href="/name/nm6073955/?ref_=adv_li_st_0">Florence Pugh</a>
## [4] <a href="/name/nm0001618/?ref_=adv_li_st_0">Joaquin Phoenix</a>
## [5] <a href="/name/nm0000138/?ref_=adv_li_st_0">Leonardo DiCaprio</a>
## [6] <a href="/name/nm1209966/?ref_=adv_li_st_0">Oscar Isaac</a>
## [7] <a href="/name/nm0000375/?ref_=adv_li_st_0">Robert Downey Jr.</a>
## [8] <a href="/name/nm1500155/?ref_=adv_li_st_0">Robert Pattinson</a>
## [9] <a href="/name/nm0000191/?ref_=adv_li_st_0">Ewan McGregor</a>
## [10] <a href="/name/nm2835616/?ref_=adv_li_st_0">Dean-Charles Chapman</a>
## [11] <a href="/name/nm1519680/?ref_=adv_li_st_0">Saoirse Ronan</a>
## [12] <a href="/name/nm1567113/?ref_=adv_li_st_0">Jessica Chastain</a>
## [13] <a href="/name/nm6077951/?ref_=adv_li_st_0">Tom Glynn-Carney</a>
## [14] <a href="/name/nm0000190/?ref_=adv_li_st_0">Matthew McConaughey</a>
## [15] <a href="/name/nm0488953/?ref_=adv_li_st_0">Brie Larson</a>
## [16] <a href="/name/nm3034977/?ref_=adv_li_st_0">Samara Weaving</a>
## [17] <a href="/name/nm9877392/?ref_=adv_li_st_0">Roman Griffin Davis</a>
## [18] <a href="/name/nm5397459/?ref_=adv_li_st_0">Daisy Ridley</a>
## [19] <a href="/name/nm4043618/?ref_=adv_li_st_0">Tom Holland</a>
## [20] <a href="/name/nm6466214/?ref_=adv_li_st_0">Josephine Langford</a>
## ...
# Converting the gross actors data to text
actors_data <- html_text(actors_data_html)
# Let's have a look at the actors data
head(actors_data)
## [1] "Daniel Craig" "Kang-ho Song" "Florence Pugh"
## [4] "Joaquin Phoenix" "Leonardo DiCaprio" "Oscar Isaac"
Be careful with missing data.
# Using CSS selectors to scrap the metascore section
metascore_data_html <- html_nodes(webpage, '.metascore')
# Converting the runtime data to text
metascore_data <- html_text(metascore_data_html)
# Let's have a look at the metascore
head(metascore_data)
## [1] "82 " "96 " "72 " "59 " "83 "
## [6] "46 "
# Data-Preprocessing: removing extra space in metascore
metascore_data <- str_replace(metascore_data, "\\s*$", "")
metascore_data <- as.numeric(metascore_data)
metascore_data
## [1] 82 96 72 59 83 46 78 83 59 78 91 58 62 51 64 64 58 53 69 30 58 69 71 48 73
## [26] 81 64 61 94 53 73 60 32 52 55 63 81 55 48 53 68 40 53 91 64 64 80 54 38 41
## [51] 84 51 53 94 48 55 60 64 38 83 82 43 49 43 38 68 95 71 72 65 47 64 79 45 70
## [76] 73 84 31 69 19 61 43 40 70 60 52 53 42 39 38 36 53 65 68 80 66 61
# Lets check the length of metascore data
length(metascore_data)
## [1] 97
# Visual inspection finds 24, 85, 100 don't have metascore
ms <- rep(NA, 100)
ms[-c(24, 85, 100)] <- metascore_data
(metascore_data <- ms)
## [1] 82 96 72 59 83 46 78 83 59 78 91 58 62 51 64 64 58 53 69 30 58 69 71 NA 48
## [26] 73 81 64 61 94 53 73 60 32 52 55 63 81 55 48 53 68 40 53 91 64 64 80 54 38
## [51] 41 84 51 53 94 48 55 60 64 38 83 82 43 49 43 38 68 95 71 72 65 47 64 79 45
## [76] 70 73 84 31 69 19 61 43 40 NA 70 60 52 53 42 39 38 36 53 65 68 80 66 61 NA
Be careful with missing data.
# Using CSS selectors to scrap the gross revenue section
gross_data_html <- html_nodes(webpage,'.ghost~ .text-muted+ span')
# Converting the gross revenue data to text
gross_data <- html_text(gross_data_html)
# Let's have a look at the gross data
head(gross_data)
## [1] "$165.36M" "$53.37M" "$27.33M" "$335.45M" "$142.50M" "$100.04M"
# Data-Preprocessing: removing '$' and 'M' signs
gross_data <- str_replace(gross_data, "M", "")
gross_data <- str_sub(gross_data, 2, 10)
#(gross_data <- str_extract(gross_data, "[:digit:]+.[:digit:]+"))
gross_data <- as.numeric(gross_data)
# Let's check the length of gross data
length(gross_data)
## [1] 62
# Visual inspection finds below movies don't have gross
#gs_data <- rep(NA, 100)
#gs_data[-c(1, 2, 3, 5, 61, 69, 71, 74, 78, 82, 84:87, 90)] <- gross_data
#(gross_data <- gs_data)
60 (out of 100) movies don’t have gross data yet! We need a better way to figure out missing entries.
(rank_and_gross <- webpage %>%
html_nodes('.ghost~ .text-muted+ span , .text-primary') %>%
html_text() %>%
str_replace("\\s+", "") %>%
str_replace_all("[$M]", ""))
## [1] "1." "165.36" "2." "53.37" "3." "27.33" "4." "335.45"
## [9] "5." "142.50" "6." "100.04" "7." "858.37" "8." "0.43"
## [17] "9." "10." "159.23" "11." "108.10" "12." "211.59" "13."
## [25] "14." "15." "426.83" "16." "26.74" "17." "0.35" "18."
## [33] "515.20" "19." "390.53" "20." "12.14" "21." "316.83" "22."
## [41] "96.37" "23." "140.37" "24." "57.01" "25." "171.02" "26."
## [49] "175.08" "27." "96.85" "28." "62.74" "29." "7.00" "30."
## [57] "85.71" "31." "32." "173.96" "33." "34." "35." "26.80"
## [65] "36." "0.40" "37." "117.62" "38." "73.29" "39." "29.21"
## [73] "40." "6.56" "41." "22.96" "42." "7.74" "43." "355.56"
## [81] "44." "45." "46." "47." "35.40" "48." "62.25" "49."
## [89] "20.55" "50." "51." "69.06" "52." "22.68" "53." "54."
## [97] "45.37" "55." "2.00" "56." "110.50" "57." "543.64" "58."
## [105] "59." "60." "80.00" "61." "62." "63." "111.05" "64."
## [113] "65." "65.85" "66." "67." "68." "3.76" "69." "70."
## [121] "71." "72." "73." "74." "477.37" "75." "80.55" "76."
## [129] "67.16" "77." "78." "79." "434.04" "80." "21.90" "81."
## [137] "82." "83." "84." "113.93" "85." "5.33" "86." "13.12"
## [145] "87." "39.01" "88." "89." "90." "74.15" "91." "92."
## [153] "93." "94." "95." "45.73" "96." "97." "98." "61.70"
## [161] "99." "100."
isrank <- str_detect(rank_and_gross, "\\.$")
ismissing <- isrank[1:(length(rank_and_gross) - 1)] & isrank[2:(length(rank_and_gross))]
ismissing[length(ismissing)+1] <- isrank[length(isrank)]
missingpos <- as.integer(rank_and_gross[ismissing])
gs_data <- rep(NA, 100)
gs_data[-missingpos] <- gross_data
(gross_data <- gs_data)
## [1] 165.36 53.37 27.33 335.45 142.50 100.04 858.37 0.43 NA 159.23
## [11] 108.10 211.59 NA NA 426.83 26.74 0.35 515.20 390.53 12.14
## [21] 316.83 96.37 140.37 57.01 171.02 175.08 96.85 62.74 7.00 85.71
## [31] NA 173.96 NA NA 26.80 0.40 117.62 73.29 29.21 6.56
## [41] 22.96 7.74 355.56 NA NA NA 35.40 62.25 20.55 NA
## [51] 69.06 22.68 NA 45.37 2.00 110.50 543.64 NA NA 80.00
## [61] NA NA 111.05 NA 65.85 NA NA 3.76 NA NA
## [71] NA NA NA 477.37 80.55 67.16 NA NA 434.04 21.90
## [81] NA NA NA 113.93 5.33 13.12 39.01 NA NA 74.15
## [91] NA NA NA NA 45.73 NA NA 61.70 NA NA
Following code programatically figures out missing entries for metascore.
# Use CSS selectors to scrap the rankings section
(rank_metascore_data_html <- html_nodes(webpage, '.unfavorable , .favorable , .mixed , .text-primary'))
## {xml_nodeset (197)}
## [1] <span class="lister-item-index unbold text-primary">1.</span>
## [2] <span class="metascore favorable">82 </span>
## [3] <span class="lister-item-index unbold text-primary">2.</span>
## [4] <span class="metascore favorable">96 </span>
## [5] <span class="lister-item-index unbold text-primary">3.</span>
## [6] <span class="metascore favorable">72 </span>
## [7] <span class="lister-item-index unbold text-primary">4.</span>
## [8] <span class="metascore mixed">59 </span>
## [9] <span class="lister-item-index unbold text-primary">5.</span>
## [10] <span class="metascore favorable">83 </span>
## [11] <span class="lister-item-index unbold text-primary">6.</span>
## [12] <span class="metascore mixed">46 </span>
## [13] <span class="lister-item-index unbold text-primary">7.</span>
## [14] <span class="metascore favorable">78 </span>
## [15] <span class="lister-item-index unbold text-primary">8.</span>
## [16] <span class="metascore favorable">83 </span>
## [17] <span class="lister-item-index unbold text-primary">9.</span>
## [18] <span class="metascore mixed">59 </span>
## [19] <span class="lister-item-index unbold text-primary">10.</span>
## [20] <span class="metascore favorable">78 </span>
## ...
# Convert the ranking data to text
(rank_metascore_data <- html_text(rank_metascore_data_html))
## [1] "1." "82 " "2." "96 " "3."
## [6] "72 " "4." "59 " "5." "83 "
## [11] "6." "46 " "7." "78 " "8."
## [16] "83 " "9." "59 " "10." "78 "
## [21] "11." "91 " "12." "58 " "13."
## [26] "62 " "14." "51 " "15." "64 "
## [31] "16." "64 " "17." "58 " "18."
## [36] "53 " "19." "69 " "20." "30 "
## [41] "21." "58 " "22." "69 " "23."
## [46] "71 " "24." "48 " "25." "73 "
## [51] "26." "81 " "27." "64 " "28."
## [56] "61 " "29." "94 " "30." "53 "
## [61] "31." "73 " "32." "60 " "33."
## [66] "32 " "34." "52 " "35." "55 "
## [71] "36." "63 " "37." "81 " "38."
## [76] "55 " "39." "48 " "40." "53 "
## [81] "41." "68 " "42." "40 " "43."
## [86] "53 " "44." "91 " "45." "64 "
## [91] "46." "64 " "47." "80 " "48."
## [96] "54 " "49." "38 " "50." "41 "
## [101] "51." "52." "84 " "53." "51 "
## [106] "54." "53 " "55." "94 " "56."
## [111] "48 " "57." "55 " "58." "60 "
## [116] "59." "64 " "60." "38 " "61."
## [121] "83 " "62." "82 " "63." "43 "
## [126] "64." "49 " "65." "43 " "66."
## [131] "38 " "67." "68 " "68." "95 "
## [136] "69." "71 " "70." "71." "72 "
## [141] "72." "65 " "73." "47 " "74."
## [146] "64 " "75." "79 " "76." "45 "
## [151] "77." "70 " "78." "73 " "79."
## [156] "84 " "80." "31 " "81." "69 "
## [161] "82." "19 " "83." "61 " "84."
## [166] "43 " "85." "40 " "86." "70 "
## [171] "87." "60 " "88." "89." "52 "
## [176] "90." "53 " "91." "42 " "92."
## [181] "39 " "93." "38 " "94." "36 "
## [186] "95." "53 " "96." "65 " "97."
## [191] "68 " "98." "80 " "99." "66 "
## [196] "100." "61 "
# Strip spaces
(rank_metascore_data <- str_replace(rank_metascore_data, "\\s+", ""))
## [1] "1." "82" "2." "96" "3." "72" "4." "59" "5." "83"
## [11] "6." "46" "7." "78" "8." "83" "9." "59" "10." "78"
## [21] "11." "91" "12." "58" "13." "62" "14." "51" "15." "64"
## [31] "16." "64" "17." "58" "18." "53" "19." "69" "20." "30"
## [41] "21." "58" "22." "69" "23." "71" "24." "48" "25." "73"
## [51] "26." "81" "27." "64" "28." "61" "29." "94" "30." "53"
## [61] "31." "73" "32." "60" "33." "32" "34." "52" "35." "55"
## [71] "36." "63" "37." "81" "38." "55" "39." "48" "40." "53"
## [81] "41." "68" "42." "40" "43." "53" "44." "91" "45." "64"
## [91] "46." "64" "47." "80" "48." "54" "49." "38" "50." "41"
## [101] "51." "52." "84" "53." "51" "54." "53" "55." "94" "56."
## [111] "48" "57." "55" "58." "60" "59." "64" "60." "38" "61."
## [121] "83" "62." "82" "63." "43" "64." "49" "65." "43" "66."
## [131] "38" "67." "68" "68." "95" "69." "71" "70." "71." "72"
## [141] "72." "65" "73." "47" "74." "64" "75." "79" "76." "45"
## [151] "77." "70" "78." "73" "79." "84" "80." "31" "81." "69"
## [161] "82." "19" "83." "61" "84." "43" "85." "40" "86." "70"
## [171] "87." "60" "88." "89." "52" "90." "53" "91." "42" "92."
## [181] "39" "93." "38" "94." "36" "95." "53" "96." "65" "97."
## [191] "68" "98." "80" "99." "66" "100." "61"
# a rank followed by another rank means the metascore for the 1st rank is missing
(isrank <- str_detect(rank_metascore_data, "\\.$"))
## [1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [13] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [25] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [37] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [49] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [61] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [73] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [85] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [97] TRUE FALSE TRUE FALSE TRUE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
## [109] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
## [121] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
## [133] FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE TRUE FALSE TRUE FALSE
## [145] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [157] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
## [169] TRUE FALSE TRUE FALSE TRUE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
## [181] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
## [193] FALSE TRUE FALSE TRUE FALSE
ismissing <- isrank[1:length(rank_metascore_data)-1] &
isrank[2:length(rank_metascore_data)]
ismissing[length(ismissing)+1] <- isrank[length(isrank)]
(missingpos <- as.integer(rank_metascore_data[ismissing]))
## [1] 51 70 88
#(rank_metascore_data <- as.integer(rank_metascore_data))
You (students) should work out the code for finding missing positions for gross.
Form a tibble:
# Combining all the lists to form a data frame
movies <- tibble(Rank = rank_data,
Title = title_data,
Description = description_data,
Runtime = runtime_data,
Genre = genre_data,
Rating = rating_data,
Metascore = metascore_data,
Votes = votes_data,
Gross_Earning_in_Mil = gross_data,
Director = directors_data,
Actor = actors_data)
movies %>% print(width=Inf)
## # A tibble: 100 × 11
## Rank Title
## <int> <chr>
## 1 1 Knives Out
## 2 2 Parasite
## 3 3 Midsommar
## 4 4 Joker
## 5 5 Once Upon a Time... In Hollywood
## 6 6 The Addams Family
## 7 7 Avengers: Endgame
## 8 8 The Lighthouse
## 9 9 Doctor Sleep
## 10 10 1917
## Description
## <chr>
## 1 "\nA detective investigates the death of a patriarch of an eccentric, combat…
## 2 "\nGreed and class discrimination threaten the newly formed symbiotic relati…
## 3 "\nA couple travels to Scandinavia to visit a rural hometown's fabled Swedis…
## 4 "\nIn Gotham City, mentally troubled comedian Arthur Fleck is disregarded an…
## 5 "\nA faded television actor and his stunt double strive to achieve fame and …
## 6 "\nThe eccentrically macabre family moves to a bland suburb where Wednesday …
## 7 "\nAfter the devastating events of Avengers: Infinity War (2018), the univer…
## 8 "\nTwo lighthouse keepers try to maintain their sanity while living on a rem…
## 9 "\nYears following the events of The Shining (1980), a now-adult Dan Torranc…
## 10 "\nApril 6th, 1917. As a regiment assembles to wage war deep in enemy territ…
## Runtime Genre Rating Metascore Votes Gross_Earning_in_Mil
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 130 Comedy 7.9 82 539726 165.
## 2 132 Comedy 8.6 96 672715 53.4
## 3 148 Drama 7.1 72 257951 27.3
## 4 122 Crime 8.4 59 NA 335.
## 5 161 Comedy 7.6 83 645227 142.
## 6 86 Animation 5.8 46 34097 100.
## 7 181 Action 8.4 78 950476 858.
## 8 109 Drama 7.5 83 172075 0.43
## 9 152 Drama 7.3 59 167900 NA
## 10 119 Action 8.3 78 498772 159.
## Director Actor
## <chr> <chr>
## 1 Rian Johnson Daniel Craig
## 2 Bong Joon Ho Kang-ho Song
## 3 Ari Aster Florence Pugh
## 4 Todd Phillips Joaquin Phoenix
## 5 Quentin Tarantino Leonardo DiCaprio
## 6 Greg Tiernan Oscar Isaac
## 7 Anthony Russo Robert Downey Jr.
## 8 Robert Eggers Robert Pattinson
## 9 Mike Flanagan Ewan McGregor
## 10 Sam Mendes Dean-Charles Chapman
## # … with 90 more rows
How many top 100 movies are in each genre? (Be careful with interpretation.)
movies %>%
ggplot() +
geom_bar(mapping = aes(x = Genre))
Which genre is most profitable in terms of average gross earnings?
movies %>%
group_by(Genre) %>%
summarise(avg_earning = mean(Gross_Earning_in_Mil, na.rm=TRUE)) %>%
ggplot() +
geom_col(mapping = aes(x = Genre, y = avg_earning)) +
labs(y = "avg earning in millions")
## Warning: Removed 2 rows containing missing values (position_stack).
ggplot(data = movies) +
geom_boxplot(mapping = aes(x = Genre, y = Gross_Earning_in_Mil)) +
labs(y = "Gross earning in millions")
## Warning: Removed 38 rows containing non-finite values (stat_boxplot).
Is there a relationship between gross earning and rating? Find the best selling movie (by gross earning) in each genre
library("ggrepel")
(best_in_genre <- movies %>%
group_by(Genre) %>%
filter(row_number(desc(Gross_Earning_in_Mil)) == 1))
## # A tibble: 8 × 11
## # Groups: Genre [8]
## Rank Title Description Runtime Genre Rating Metascore Votes Gross_Earning_i…
## <int> <chr> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 Kniv… "\nA detec… 130 Come… 7.9 82 539726 165.
## 2 4 Joker "\nIn Goth… 122 Crime 8.4 59 NA 335.
## 3 7 Aven… "\nAfter t… 181 Acti… 8.4 78 950476 858.
## 4 12 It C… "\nTwenty-… 169 Drama 6.5 58 232537 212.
## 5 22 Rock… "\nA music… 121 Biog… 7.3 69 159413 96.4
## 6 26 Us "\nA famil… 116 Horr… 6.8 73 259098 175.
## 7 43 Alad… "\nA kind-… 128 Adve… 6.9 40 246013 356.
## 8 57 The … "\nAfter t… 118 Anim… 6.8 55 227563 544.
## # … with 2 more variables: Director <chr>, Actor <chr>
ggplot(movies, mapping = aes(x = Rating, y = Gross_Earning_in_Mil)) +
geom_point(mapping = aes(size = Votes, color = Genre)) +
ggrepel::geom_label_repel(aes(label = Title), data = best_in_genre) +
labs(y = "Gross earning in millions")
## Warning: Removed 39 rows containing missing values (geom_point).
Many websites dynamically pull data from databases using JavasScript and JQuery that make them difficult to scrape.
The FCC’s dtvmaps webpage has a simple form in which you enter a zip code and it gives you the available local TV stations in that zip code and their signal strength.
You’ll also notice the URL stays fixed with different zip codes.
RSelenium loads the page that we want to scrape and download the HTML from that page.
particularly useful when scraping something behind a login
simulate human behavior on a website (e.g., mouse clicking)
rvest provides typical scraping tools
rm(list = ls()) # clean-up workspace
library("RSelenium")
library("tidyverse")
library("rvest")
rD <- rsDriver(browser="firefox", port=sample(1:7360L, 1), verbose=F)
remDr <- rD[["client"]]
Open a webpage
remDr$navigate("https://www.fcc.gov/media/engineering/dtvmaps")
We want to send a string of text (zip code) into the form.
zip <- "70118"
# remDr$findElement(using = "id", value = "startpoint")$clearElement()
remDr$findElement(using = "id", value = "startpoint")$sendKeysToElement(list(zip))
# other possible ("xpath", "css selector", "id", "name", "tag name", "class name", "link text", "partial link text")
Click on the button Go!
remDr$findElements("id", "btnSub")[[1]]$clickElement()
save HTML to an object
use rvest for the rest
Sys.sleep(5) # give the page time to fully load, in seconds
html <- remDr$getPageSource()[[1]]
# important to close the client
remDr$close()
signals <- read_html(html) %>%
html_nodes("table.tbl_mapReception") %>% # extract table nodes with class = "tbl_mapReception"
.[3] %>% # keep the third of these tables
.[[1]] %>% # keep the first element of this list
html_table(fill=T) # have rvest turn it into a dataframe
signals
## # A tibble: 39 × 6
## Callsign Callsign Network `Ch#` Band IA
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 "Click on callsign for detail" "Click on… "Click o… "Click… "Clic… <NA>
## 2 "" "WWL-TV" "CBS" "4" "UHF" "RThis st…
## 3 "" "" "" "" "" ""
## 4 "" "WUPL" "MYNE" "54" "UHF" "RThis st…
## 5 "" "" "" "" "" ""
## 6 "" "WVUE-DT" "FOX" "8" "UHF" ""
## 7 "" "" "" "" "" ""
## 8 "" "WPXL-TV" "ION" "49" "UHF" "RThis st…
## 9 "" "" "" "" "" ""
## 10 "" "WHNO" "IND" "20" "UHF" ""
## # … with 29 more rows
More formatting on signals
names(signals) <- c("rm", "callsign", "network", "ch_num", "band", "rm2") # rename columns
signals <- signals %>%
slice(2:n()) %>% # drop unnecessary first row
filter(callsign != "") %>% # drop blank rows
select(callsign:band) # drop unnecessary columns
signals
## # A tibble: 19 × 4
## callsign network ch_num band
## <chr> <chr> <chr> <chr>
## 1 WWL-TV "CBS" "4" UHF
## 2 WUPL "MYNE" "54" UHF
## 3 WVUE-DT "FOX" "8" UHF
## 4 WPXL-TV "ION" "49" UHF
## 5 WHNO "IND" "20" UHF
## 6 WGNO "ABC" "26" UHF
## 7 WDSU "NBC" "6" UHF
## 8 WNOL-TV "CW" "38" UHF
## 9 WYES-TV "PBS" "12" Hi-V
## 10 WTNO-LP "" "" UHF
## 11 WLAE-TV "PBS" "32" UHF
## 12 KNOV-CD "" "" UHF
## 13 WBXN-CD "" "" UHF
## 14 KGLA-DT "IND" "42" UHF
## 15 WBRZ-TV "ABC" "2" Hi-V
## 16 WVLA-TV "NBC" "33" UHF
## 17 WLPB-TV "PBS" "27" UHF
## 18 WGMB-TV "FOX" "44" UHF
## 19 WAFB "CBS" "9" Hi-V
Capture all text by clicking on each Callsign
read_html(html) %>%
html_nodes(".callsign") %>%
html_attr("onclick")
## [1] "getdetail(11158,74192,'WWL-TV Facility ID: 74192 <br>WWL-TV (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=74192 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/74192 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 27<br>RX Strength: 115 dbuV/m<br>Tower Distance: 5 mi; Direction: 116°<br>Repacked Channel: 27<br>Repacking Dates: 10/19/2019 to 1/17/2020','WWL-TV<br>Distance to Tower: 5 miles<br>Direction to Tower: 116 deg',29.90636111111111,-90.03947222222222,'WWL-TV')"
## [2] "getdetail(11137,13938,'WUPL Facility ID: 13938 <br>WUPL (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=13938 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/13938 target=_new>Public File</a>)<br>City of License: SLIDELL, LA<br>RF Channel: 17<br>RX Strength: 114 dbuV/m<br>Tower Distance: 5 mi; Direction: 116°<br>Repacked Channel: 17<br>Repacking Dates: 10/19/2019 to 1/17/2020','WUPL<br>Distance to Tower: 5 miles<br>Direction to Tower: 116 deg',29.90636111111111,-90.03947222222222,'WUPL')"
## [3] "getdetail(10815,4149,'WVUE-DT Facility ID: 4149 <br>WVUE-DT (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=4149 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/4149 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 29<br>RX Strength: 112 dbuV/m<br>Tower Distance: 10 mi; Direction: 84°','WVUE-DT<br>Distance to Tower: 10 miles<br>Direction to Tower: 84 deg',29.954138888888888,-89.94952777777779,'WVUE-DT')"
## [4] "getdetail(11203,21729,'WPXL-TV Facility ID: 21729 <br>WPXL-TV (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=21729 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/21729 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 33<br>RX Strength: 111 dbuV/m<br>Tower Distance: 11 mi; Direction: 74°<br>Repacked Channel: 33<br>Repacking Dates: 10/19/2019 to 1/17/2020','WPXL-TV<br>Distance to Tower: 11 miles<br>Direction to Tower: 74 deg',29.982777777777777,-89.94944444444445,'WPXL-TV')"
## [5] "getdetail(12228,37106,'WHNO Facility ID: 37106 <br>WHNO (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=37106 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/37106 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 21<br>RX Strength: 111 dbuV/m<br>Tower Distance: 6 mi; Direction: 103°','WHNO<br>Distance to Tower: 6 miles<br>Direction to Tower: 103 deg',29.920305555555558,-90.02458333333334,'WHNO')"
## [6] "getdetail(11737,72119,'WGNO Facility ID: 72119 <br>WGNO (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=72119 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/72119 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 26<br>RX Strength: 111 dbuV/m<br>Tower Distance: 10 mi; Direction: 86°','WGNO<br>Distance to Tower: 10 miles<br>Direction to Tower: 86 deg',29.95,-89.95777777777778,'WGNO')"
## [7] "getdetail(11226,71357,'WDSU Facility ID: 71357 <br>WDSU (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=71357 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/71357 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 19<br>RX Strength: 111 dbuV/m<br>Tower Distance: 10 mi; Direction: 86°<br>Repacked Channel: 19<br>Repacking Dates: 10/19/2019 to 1/17/2020','WDSU<br>Distance to Tower: 10 miles<br>Direction to Tower: 86 deg',29.95,-89.95777777777778,'WDSU')"
## [8] "getdetail(11738,54280,'WNOL-TV Facility ID: 54280 <br>WNOL-TV (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=54280 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/54280 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 15<br>RX Strength: 110 dbuV/m<br>Tower Distance: 10 mi; Direction: 86°','WNOL-TV<br>Distance to Tower: 10 miles<br>Direction to Tower: 86 deg',29.95,-89.95777777777778,'WNOL-TV')"
## [9] "getdetail(11911,25090,'WYES-TV Facility ID: 25090 <br>WYES-TV (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=25090 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/25090 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 11<br>RX Strength: 102 dbuV/m<br>Tower Distance: 10 mi; Direction: 85°','WYES-TV<br>Distance to Tower: 10 miles<br>Direction to Tower: 85 deg',29.953888888888887,-89.94944444444445,'WYES-TV')"
## [10] "getdetail(12360,24981,'WTNO-LP Facility ID: 24981 <br>WTNO-LP (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=24981 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/24981 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 22<br>RX Strength: 106 dbuV/m<br>Tower Distance: 3 mi; Direction: 330°','WTNO-LP<br>Distance to Tower: 3 miles<br>Direction to Tower: 330 deg',29.97461111111111,-90.14347222222223,'WTNO-LP')"
## [11] "getdetail(11281,18819,'WLAE-TV Facility ID: 18819 <br>WLAE-TV (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=18819 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/18819 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 23<br>RX Strength: 104 dbuV/m<br>Tower Distance: 10 mi; Direction: 74°<br>Repacked Channel: 23<br>Repacking Dates: 10/19/2019 to 1/17/2020','WLAE-TV<br>Distance to Tower: 10 miles<br>Direction to Tower: 74 deg',29.982777777777777,-89.9525,'WLAE-TV')"
## [12] "getdetail(12467,64048,'KNOV-CD Facility ID: 64048 <br>KNOV-CD (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=64048 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/64048 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 31<br>RX Strength: 101 dbuV/m<br>Tower Distance: 3 mi; Direction: 74°<br>Repacked Channel: 31<br>Repacking Dates: 10/19/2019 to 1/17/2020','KNOV-CD<br>Distance to Tower: 3 miles<br>Direction to Tower: 74 deg',29.95213888888889,-90.07027777777778,'KNOV-CD')"
## [13] "getdetail(12443,70419,'WBXN-CD Facility ID: 70419 <br>WBXN-CD (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=70419 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/70419 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 36<br>RX Strength: 98 dbuV/m<br>Tower Distance: 5 mi; Direction: 116°<br>Repacked Channel: 36<br>Repacking Dates: 10/19/2019 to 1/17/2020','WBXN-CD<br>Distance to Tower: 5 miles<br>Direction to Tower: 116 deg',29.90636111111111,-90.03947222222222,'WBXN-CD')"
## [14] "getdetail(10726,83945,'KGLA-DT Facility ID: 83945 <br>KGLA-DT (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=83945 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/83945 target=_new>Public File</a>)<br>City of License: HAMMOND, LA<br>RF Channel: 35<br>RX Strength: 93 dbuV/m<br>Tower Distance: 11 mi; Direction: 76°<br>Repacked Channel: 35<br>Repacking Dates: 3/14/2020 to 5/1/2020','KGLA-DT<br>Distance to Tower: 11 miles<br>Direction to Tower: 76 deg',29.97833333333333,-89.94055555555556,'KGLA-DT')"
## [15] "getdetail(11797,38616,'WBRZ-TV Facility ID: 38616 <br>WBRZ-TV (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=38616 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/38616 target=_new>Public File</a>)<br>City of License: BATON ROUGE, LA<br>RF Channel: 13<br>RX Strength: 45 dbuV/m<br>Tower Distance: 69 mi; Direction: 291°','WBRZ-TV<br>Distance to Tower: 69 miles<br>Direction to Tower: 291 deg',30.296944444444446,-91.19361111111111,'WBRZ-TV')"
## [16] "getdetail(12198,70021,'WVLA-TV Facility ID: 70021 <br>WVLA-TV (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=70021 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/70021 target=_new>Public File</a>)<br>City of License: BATON ROUGE, LA<br>RF Channel: 34<br>RX Strength: 46 dbuV/m<br>Tower Distance: 74 mi; Direction: 291°','WVLA-TV<br>Distance to Tower: 74 miles<br>Direction to Tower: 291 deg',30.32627777777778,-91.27669444444444,'WVLA-TV')"
## [17] "getdetail(10829,38586,'WLPB-TV Facility ID: 38586 <br>WLPB-TV (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=38586 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/38586 target=_new>Public File</a>)<br>City of License: BATON ROUGE, LA<br>RF Channel: 25<br>RX Strength: 44 dbuV/m<br>Tower Distance: 71 mi; Direction: 295°','WLPB-TV<br>Distance to Tower: 71 miles<br>Direction to Tower: 295 deg',30.372972222222224,-91.20455555555556,'WLPB-TV')"
## [18] "getdetail(11101,12520,'WGMB-TV Facility ID: 12520 <br>WGMB-TV (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=12520 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/12520 target=_new>Public File</a>)<br>City of License: BATON ROUGE, LA<br>RF Channel: 24<br>RX Strength: 43 dbuV/m<br>Tower Distance: 74 mi; Direction: 291°<br>Repacked Channel: 24<br>Repacking Dates: 1/18/2020 to 3/13/2020','WGMB-TV<br>Distance to Tower: 74 miles<br>Direction to Tower: 291 deg',30.32627777777778,-91.27669444444444,'WGMB-TV')"
## [19] "getdetail(11961,589,'WAFB Facility ID: 589 <br>WAFB (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=589 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/589 target=_new>Public File</a>)<br>City of License: BATON ROUGE, LA<br>RF Channel: 9<br>RX Strength: 37 dbuV/m<br>Tower Distance: 72 mi; Direction: 295°','WAFB<br>Distance to Tower: 72 miles<br>Direction to Tower: 295 deg',30.366388888888892,-91.21305555555556,'WAFB')"
Extract signal by string operations
strength <- read_html(html) %>%
html_nodes(".callsign") %>%
html_attr("onclick") %>%
str_extract("(?<=RX Strength: )\\s*\\-*[0-9.]+")
# (?<=…) is a special regex expression for positive lookbehind
signals <- cbind(signals, strength)
signals
## callsign network ch_num band strength
## 1 WWL-TV CBS 4 UHF 115
## 2 WUPL MYNE 54 UHF 114
## 3 WVUE-DT FOX 8 UHF 112
## 4 WPXL-TV ION 49 UHF 111
## 5 WHNO IND 20 UHF 111
## 6 WGNO ABC 26 UHF 111
## 7 WDSU NBC 6 UHF 111
## 8 WNOL-TV CW 38 UHF 110
## 9 WYES-TV PBS 12 Hi-V 102
## 10 WTNO-LP UHF 106
## 11 WLAE-TV PBS 32 UHF 104
## 12 KNOV-CD UHF 101
## 13 WBXN-CD UHF 98
## 14 KGLA-DT IND 42 UHF 93
## 15 WBRZ-TV ABC 2 Hi-V 45
## 16 WVLA-TV NBC 33 UHF 46
## 17 WLPB-TV PBS 27 UHF 44
## 18 WGMB-TV FOX 44 UHF 43
## 19 WAFB CBS 9 Hi-V 37