Announcement

sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Big Sur 11.5.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.28     R6_2.5.1          jsonlite_1.7.2    magrittr_2.0.1   
##  [5] evaluate_0.14     rlang_0.4.11      stringi_1.7.3     jquerylib_0.1.4  
##  [9] bslib_0.2.5.1     rmarkdown_2.10    tools_4.1.1       stringr_1.4.0    
## [13] xfun_0.25         yaml_2.2.1        compiler_4.1.1    htmltools_0.5.1.1
## [17] knitr_1.33        sass_0.4.0

Acknowledgement

Dr. Hua Zhou’s slides

Josh McCrain’s RSelenium tutorial

HTML Introduction from GeeksforGeeks

Getting started with HTML MDN Web Docs

Load tidyverse and other packages for this lecture:

library("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.5     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library("rvest")
## 
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
## 
##     guess_encoding

HTML introduction

We cover some survival amount of instroduction of HTML format first.

HTML stands for HyperText Markup Language.

Elements and Tags

  • HTML uses predefined tags and elements which tell the browser how to properly display the content.

  • Remember to include closing tags. If omitted, the browser applies the effect of the opening tag until the end of the page.

Attributes

Elements can also have attributes. Attributes look like this:

  • Attributes contain extra information about the element that won’t appear in the content.

  • In this example, the class attribute is an identifying name used to target the element with style information.

An attribute should have:

  • A space between it and the element name. (For an element with more than one attribute, the attributes should be separated by spaces too.)

  • The attribute name, followed by an equal sign.

  • An attribute value, wrapped with opening and closing quote marks.

Anchors

Another example of an element is <a>. This stands for anchor. An anchor can make the text it encloses into a hyperlink. Anchors can take a number of attributes, but several are as follows:

  • href: This attribute’s value specifies the web address for the link. For example: href="https://www.mozilla.org/".

  • title: The title attribute specifies extra information about the link, such as a description of the page that is being linked to. For example, title="The Mozilla homepage". This appears as a tooltip when a cursor hovers over the element.

  • target: The target attribute specifies the browsing context used to display the link. For example, target="_blank" will display the link in a new tab. If you want to display the linked content in the current tab, just omit this attribute.

HTML page structure

  • The basic structure of an HTML page is laid out below.

  • It contains the essential building-block elements upon which all web pages are created.

    • doctype declaration

    • HTML

    • head

    • title

    • body elements

HTML comments

To write an HTML comment, wrap it in the special markers <!-- and -->. For example:

<p>I'm not inside a comment</p>

<!-- <p>I am!</p> -->

generates:

I’m not inside a comment


Web scraping

There is a wealth of data on internet. How to scrape them and analyze them?

rvest

rvest is an R package written by Hadley Wickham which makes web scraping easy.

Example: Scraping from webpage

Rank

  • Use SelectorGadget to highlight the element we want to scrape

  • Use the CSS selector to get the rankings

    # Use CSS selectors to scrap the rankings section
    (rank_data_html <- html_nodes(webpage, '.text-primary'))
    ## {xml_nodeset (100)}
    ##  [1] <span class="lister-item-index unbold text-primary">1.</span>
    ##  [2] <span class="lister-item-index unbold text-primary">2.</span>
    ##  [3] <span class="lister-item-index unbold text-primary">3.</span>
    ##  [4] <span class="lister-item-index unbold text-primary">4.</span>
    ##  [5] <span class="lister-item-index unbold text-primary">5.</span>
    ##  [6] <span class="lister-item-index unbold text-primary">6.</span>
    ##  [7] <span class="lister-item-index unbold text-primary">7.</span>
    ##  [8] <span class="lister-item-index unbold text-primary">8.</span>
    ##  [9] <span class="lister-item-index unbold text-primary">9.</span>
    ## [10] <span class="lister-item-index unbold text-primary">10.</span>
    ## [11] <span class="lister-item-index unbold text-primary">11.</span>
    ## [12] <span class="lister-item-index unbold text-primary">12.</span>
    ## [13] <span class="lister-item-index unbold text-primary">13.</span>
    ## [14] <span class="lister-item-index unbold text-primary">14.</span>
    ## [15] <span class="lister-item-index unbold text-primary">15.</span>
    ## [16] <span class="lister-item-index unbold text-primary">16.</span>
    ## [17] <span class="lister-item-index unbold text-primary">17.</span>
    ## [18] <span class="lister-item-index unbold text-primary">18.</span>
    ## [19] <span class="lister-item-index unbold text-primary">19.</span>
    ## [20] <span class="lister-item-index unbold text-primary">20.</span>
    ## ...
    # (rank_data_html <- html_nodes(webpage, '.lister-item-content .text-primary'))
    # Convert the ranking data to text
    (rank_data <- html_text(rank_data_html))
    ##   [1] "1."   "2."   "3."   "4."   "5."   "6."   "7."   "8."   "9."   "10." 
    ##  [11] "11."  "12."  "13."  "14."  "15."  "16."  "17."  "18."  "19."  "20." 
    ##  [21] "21."  "22."  "23."  "24."  "25."  "26."  "27."  "28."  "29."  "30." 
    ##  [31] "31."  "32."  "33."  "34."  "35."  "36."  "37."  "38."  "39."  "40." 
    ##  [41] "41."  "42."  "43."  "44."  "45."  "46."  "47."  "48."  "49."  "50." 
    ##  [51] "51."  "52."  "53."  "54."  "55."  "56."  "57."  "58."  "59."  "60." 
    ##  [61] "61."  "62."  "63."  "64."  "65."  "66."  "67."  "68."  "69."  "70." 
    ##  [71] "71."  "72."  "73."  "74."  "75."  "76."  "77."  "78."  "79."  "80." 
    ##  [81] "81."  "82."  "83."  "84."  "85."  "86."  "87."  "88."  "89."  "90." 
    ##  [91] "91."  "92."  "93."  "94."  "95."  "96."  "97."  "98."  "99."  "100."
    # Turn into numerical values
    (rank_data <- as.integer(rank_data))
    ##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
    ##  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
    ##  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
    ##  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
    ##  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
    ##  [91]  91  92  93  94  95  96  97  98  99 100

Title

  • Use SelectorGadget to find the CSS selector .lister-item-header a.

  • CSS selector reference

    # Using CSS selectors to scrap the title section
    (title_data_html <- html_nodes(webpage, '.lister-item-header a'))
    ## {xml_nodeset (100)}
    ##  [1] <a href="/title/tt8946378/?ref_=adv_li_tt">Knives Out</a>
    ##  [2] <a href="/title/tt6751668/?ref_=adv_li_tt">Parasite</a>
    ##  [3] <a href="/title/tt8772262/?ref_=adv_li_tt">Midsommar</a>
    ##  [4] <a href="/title/tt7286456/?ref_=adv_li_tt">Joker</a>
    ##  [5] <a href="/title/tt7131622/?ref_=adv_li_tt">Once Upon a Time... In Hollyw ...
    ##  [6] <a href="/title/tt1620981/?ref_=adv_li_tt">The Addams Family</a>
    ##  [7] <a href="/title/tt4154796/?ref_=adv_li_tt">Avengers: Endgame</a>
    ##  [8] <a href="/title/tt7984734/?ref_=adv_li_tt">The Lighthouse</a>
    ##  [9] <a href="/title/tt5606664/?ref_=adv_li_tt">Doctor Sleep</a>
    ## [10] <a href="/title/tt8579674/?ref_=adv_li_tt">1917</a>
    ## [11] <a href="/title/tt3281548/?ref_=adv_li_tt">Little Women</a>
    ## [12] <a href="/title/tt7349950/?ref_=adv_li_tt">It Chapter Two</a>
    ## [13] <a href="/title/tt7984766/?ref_=adv_li_tt">The King</a>
    ## [14] <a href="/title/tt8367814/?ref_=adv_li_tt">The Gentlemen</a>
    ## [15] <a href="/title/tt4154664/?ref_=adv_li_tt">Captain Marvel</a>
    ## [16] <a href="/title/tt7798634/?ref_=adv_li_tt">Ready or Not</a>
    ## [17] <a href="/title/tt2584384/?ref_=adv_li_tt">Jojo Rabbit</a>
    ## [18] <a href="/title/tt2527338/?ref_=adv_li_tt">Star Wars: The Rise Of Skywal ...
    ## [19] <a href="/title/tt6320628/?ref_=adv_li_tt">Spider-Man: Far from Home</a>
    ## [20] <a href="/title/tt4126476/?ref_=adv_li_tt">After</a>
    ## ...
    # Converting the title data to text
    (title_data <- html_text(title_data_html))
    ##   [1] "Knives Out"                                
    ##   [2] "Parasite"                                  
    ##   [3] "Midsommar"                                 
    ##   [4] "Joker"                                     
    ##   [5] "Once Upon a Time... In Hollywood"          
    ##   [6] "The Addams Family"                         
    ##   [7] "Avengers: Endgame"                         
    ##   [8] "The Lighthouse"                            
    ##   [9] "Doctor Sleep"                              
    ##  [10] "1917"                                      
    ##  [11] "Little Women"                              
    ##  [12] "It Chapter Two"                            
    ##  [13] "The King"                                  
    ##  [14] "The Gentlemen"                             
    ##  [15] "Captain Marvel"                            
    ##  [16] "Ready or Not"                              
    ##  [17] "Jojo Rabbit"                               
    ##  [18] "Star Wars: The Rise Of Skywalker"          
    ##  [19] "Spider-Man: Far from Home"                 
    ##  [20] "After"                                     
    ##  [21] "Jumanji: The Next Level"                   
    ##  [22] "Rocketman"                                 
    ##  [23] "Shazam!"                                   
    ##  [24] "Escape Room"                               
    ##  [25] "John Wick: Chapter 3 - Parabellum"         
    ##  [26] "Us"                                        
    ##  [27] "Downton Abbey"                             
    ##  [28] "Scary Stories to Tell in the Dark"         
    ##  [29] "The Irishman"                              
    ##  [30] "Alita: Battle Angel"                       
    ##  [31] "The Platform"                              
    ##  [32] "Fast & Furious Presents: Hobbs & Shaw"     
    ##  [33] "Cats"                                      
    ##  [34] "Charlie's Angels"                          
    ##  [35] "Zombieland: Double Tap"                    
    ##  [36] "Official Secrets"                          
    ##  [37] "Ford v Ferrari"                            
    ##  [38] "Yesterday"                                 
    ##  [39] "Child's Play"                              
    ##  [40] "The Dead Don't Die"                        
    ##  [41] "Fighting with My Family"                   
    ##  [42] "Anna"                                      
    ##  [43] "Aladdin"                                   
    ##  [44] "Uncut Gems"                                
    ##  [45] "Vivarium"                                  
    ##  [46] "Bombshell"                                 
    ##  [47] "Ad Astra"                                  
    ##  [48] "Terminator: Dark Fate"                     
    ##  [49] "Gemini Man"                                
    ##  [50] "6 Underground"                             
    ##  [51] "Good Boys"                                 
    ##  [52] "Booksmart"                                 
    ##  [53] "21 Bridges"                                
    ##  [54] "Ma"                                        
    ##  [55] "Marriage Story"                            
    ##  [56] "Godzilla: King of the Monsters"            
    ##  [57] "The Lion King"                             
    ##  [58] "Motherless Brooklyn"                       
    ##  [59] "The Lodge"                                 
    ##  [60] "Men in Black: International"               
    ##  [61] "Saint Maud"                                
    ##  [62] "Sound of Metal"                            
    ##  [63] "Glass"                                     
    ##  [64] "Paradise Hills"                            
    ##  [65] "X-Men: Dark Phoenix"                       
    ##  [66] "A Rainy Day in New York"                   
    ##  [67] "Just Mercy"                                
    ##  [68] "Portrait of a Lady on Fire"                
    ##  [69] "The Outpost"                               
    ##  [70] "The Room"                                  
    ##  [71] "El Camino: A Breaking Bad Movie"           
    ##  [72] "I See You"                                 
    ##  [73] "Midway"                                    
    ##  [74] "Frozen II"                                 
    ##  [75] "Hustlers"                                  
    ##  [76] "Angel Has Fallen"                          
    ##  [77] "Color Out of Space"                        
    ##  [78] "Dark Waters"                               
    ##  [79] "Toy Story 4"                               
    ##  [80] "Hellboy"                                   
    ##  [81] "Haunt"                                     
    ##  [82] "Polar"                                     
    ##  [83] "The Informer"                              
    ##  [84] "Maleficent: Mistress of Evil"              
    ##  [85] "The Goldfinch"                             
    ##  [86] "The Peanut Butter Falcon"                  
    ##  [87] "Crawl"                                     
    ##  [88] "Benny Loves You"                           
    ##  [89] "Extremely Wicked, Shockingly Evil and Vile"
    ##  [90] "Annabelle Comes Home"                      
    ##  [91] "Guns Akimbo"                               
    ##  [92] "The Dirt"                                  
    ##  [93] "Murder Mystery"                            
    ##  [94] "Fractured"                                 
    ##  [95] "Five Feet Apart"                           
    ##  [96] "Swallow"                                   
    ##  [97] "Richard Jewell"                            
    ##  [98] "A Beautiful Day in the Neighborhood"       
    ##  [99] "Judy"                                      
    ## [100] "Velvet Buzzsaw"

Description

  • # Using CSS selectors to scrap the description section
    (description_data_html <- html_nodes(webpage, '.ratings-bar+ .text-muted'))
    ## {xml_nodeset (100)}
    ##  [1] <p class="text-muted">\nA detective investigates the death of a patriarc ...
    ##  [2] <p class="text-muted">\nGreed and class discrimination threaten the newl ...
    ##  [3] <p class="text-muted">\nA couple travels to Scandinavia to visit a rural ...
    ##  [4] <p class="text-muted">\nIn Gotham City, mentally troubled comedian Arthu ...
    ##  [5] <p class="text-muted">\nA faded television actor and his stunt double st ...
    ##  [6] <p class="text-muted">\nThe eccentrically macabre family moves to a blan ...
    ##  [7] <p class="text-muted">\nAfter the devastating events of <a href="/title/ ...
    ##  [8] <p class="text-muted">\nTwo lighthouse keepers try to maintain their san ...
    ##  [9] <p class="text-muted">\nYears following the events of <a href="/title/tt ...
    ## [10] <p class="text-muted">\nApril 6th, 1917. As a regiment assembles to wage ...
    ## [11] <p class="text-muted">\nJo March reflects back and forth on her life, te ...
    ## [12] <p class="text-muted">\nTwenty-seven years after their first encounter w ...
    ## [13] <p class="text-muted">\nHal, wayward prince and heir to the English thro ...
    ## [14] <p class="text-muted">\nAn American expat tries to sell off his highly p ...
    ## [15] <p class="text-muted">\nCarol Danvers becomes one of the universe's most ...
    ## [16] <p class="text-muted">\nA bride's wedding night takes a sinister turn wh ...
    ## [17] <p class="text-muted">\nA young German boy in the Hitler Youth whose her ...
    ## [18] <p class="text-muted">\nIn the riveting conclusion of the landmark Skywa ...
    ## [19] <p class="text-muted">\nFollowing the events of <a href="/title/tt415479 ...
    ## [20] <p class="text-muted">\nA young woman falls for a guy with a dark secret ...
    ## ...
    # Converting the description data to text
    description_data <- html_text(description_data_html)
    # take a look at first few
    head(description_data)
    ## [1] "\nA detective investigates the death of a patriarch of an eccentric, combative family."                                                                                                                                                   
    ## [2] "\nGreed and class discrimination threaten the newly formed symbiotic relationship between the wealthy Park family and the destitute Kim clan."                                                                                            
    ## [3] "\nA couple travels to Scandinavia to visit a rural hometown's fabled Swedish mid-summer festival. What begins as an idyllic retreat quickly devolves into an increasingly violent and bizarre competition at the hands of a pagan cult."  
    ## [4] "\nIn Gotham City, mentally troubled comedian Arthur Fleck is disregarded and mistreated by society. He then embarks on a downward spiral of revolution and bloody crime. This path brings him face-to-face with his alter-ego: the Joker."
    ## [5] "\nA faded television actor and his stunt double strive to achieve fame and success in the final years of Hollywood's Golden Age in 1969 Los Angeles."                                                                                     
    ## [6] "\nThe eccentrically macabre family moves to a bland suburb where Wednesday Addams' friendship with the daughter of a hostile and conformist local reality show host exacerbates conflict between the families."
    # strip the '\n'
    description_data <- str_replace(description_data, "^\\n\\s+", "")
    head(description_data)
    ## [1] "\nA detective investigates the death of a patriarch of an eccentric, combative family."                                                                                                                                                   
    ## [2] "\nGreed and class discrimination threaten the newly formed symbiotic relationship between the wealthy Park family and the destitute Kim clan."                                                                                            
    ## [3] "\nA couple travels to Scandinavia to visit a rural hometown's fabled Swedish mid-summer festival. What begins as an idyllic retreat quickly devolves into an increasingly violent and bizarre competition at the hands of a pagan cult."  
    ## [4] "\nIn Gotham City, mentally troubled comedian Arthur Fleck is disregarded and mistreated by society. He then embarks on a downward spiral of revolution and bloody crime. This path brings him face-to-face with his alter-ego: the Joker."
    ## [5] "\nA faded television actor and his stunt double strive to achieve fame and success in the final years of Hollywood's Golden Age in 1969 Los Angeles."                                                                                     
    ## [6] "\nThe eccentrically macabre family moves to a bland suburb where Wednesday Addams' friendship with the daughter of a hostile and conformist local reality show host exacerbates conflict between the families."

Runtime

# Using CSS selectors to scrap the Movie runtime section
(runtime_data <- webpage %>%
  html_nodes('.runtime') %>%
  html_text() %>%
  str_replace(" min", "") %>%
  as.integer())
##   [1] 130 132 148 122 161  86 181 109 152 119 135 169 140 113 123  95 108 141
##  [19] 129 105 123 121 132  99 130 116 122 108 209 122  94 137 110 118  99 112
##  [37] 152 116  90 104 108 118 128 135  97 109 123 128 117 128  90 102  99  99
##  [55] 137 132 118 144 108 114  84 120 129  95 113  92 137 122 123 100 122  98
##  [73] 138 103 110 121 111 126 100 120  92 118 113 119 149  97  87  94 110 106
##  [91]  98 107  97  99 116  94 131 109 118 113
# Using CSS selectors to scrap the Movie runtime section
runtime_data_html <- html_nodes(webpage, '.runtime')
# Converting the runtime data to text
runtime_data <- html_text(runtime_data_html)
# Let's have a look at the runtime
head(runtime_data)
## [1] "130 min" "132 min" "148 min" "122 min" "161 min" "86 min"
# Data-Preprocessing: removing mins and converting it to numerical
runtime_data <- str_replace(runtime_data, " min", "")
runtime_data <- as.numeric(runtime_data)
#Let's have another look at the runtime data
head(runtime_data)
## [1] 130 132 148 122 161  86

Genre

  • Collect the (first) genre of each movie:

    # Using CSS selectors to scrap the Movie genre section
    genre_data_html <- html_nodes(webpage, '.genre')
    # Converting the genre data to text
    genre_data <- html_text(genre_data_html)
    # Let's have a look at the genre data
    head(genre_data)    
    ## [1] "\nComedy, Crime, Drama            "        
    ## [2] "\nComedy, Drama, Thriller            "     
    ## [3] "\nDrama, Horror, Mystery            "      
    ## [4] "\nCrime, Drama, Thriller            "      
    ## [5] "\nComedy, Drama            "               
    ## [6] "\nAnimation, Adventure, Comedy            "
    # Data-Preprocessing: retrieve the first word
    genre_data <- str_extract(genre_data, "[:alpha:]+")
    # Convering each genre from text to factor
    #genre_data <- as.factor(genre_data)
    # Let's have another look at the genre data
    head(genre_data)
    ## [1] "Comedy"    "Comedy"    "Drama"     "Crime"     "Comedy"    "Animation"

Rating

  • # Using CSS selectors to scrap the IMDB rating section
    rating_data_html <- html_nodes(webpage, '.ratings-imdb-rating strong')
    # Converting the ratings data to text
    rating_data <- html_text(rating_data_html)
    # Let's have a look at the ratings
    head(rating_data)
    ## [1] "7.9" "8.6" "7.1" "8.4" "7.6" "5.8"
    # Data-Preprocessing: converting ratings to numerical
    rating_data <- as.numeric(rating_data)
    # Let's have another look at the ratings data
    rating_data
    ##   [1] 7.9 8.6 7.1 8.4 7.6 5.8 8.4 7.5 7.3 8.3 7.8 6.5 7.2 7.8 6.8 6.9 7.9 6.5
    ##  [19] 7.4 5.3 6.7 7.3 7.0 6.4 7.4 6.8 7.4 6.2 7.8 7.3 7.0 6.4 2.7 4.9 6.7 7.3
    ##  [37] 8.1 6.8 5.7 5.5 7.1 6.6 6.9 7.4 5.8 6.8 6.5 6.2 5.7 6.1 6.7 7.2 6.6 5.6
    ##  [55] 7.9 6.0 6.8 6.8 6.1 5.6 6.7 7.8 6.6 5.4 5.7 6.5 7.6 8.1 6.8 6.0 7.3 6.8
    ##  [73] 6.7 6.8 6.3 6.4 6.2 7.6 7.7 5.2 6.3 6.3 6.6 6.6 6.4 7.6 6.1 5.6 6.7 5.9
    ##  [91] 6.3 7.0 6.0 6.4 7.2 6.5 7.5 7.3 6.8 5.7

Votes

  • # Using CSS selectors to scrap the votes section
    votes_data_html <- html_nodes(webpage, '.sort-num_votes-visible span:nth-child(2)')
    # Converting the votes data to text
    votes_data <- html_text(votes_data_html)
    # Let's have a look at the votes data
    head(votes_data)
    ## [1] "539,726"   "672,715"   "257,951"   "1,081,344" "645,227"   "34,097"
    # Data-Preprocessing: removing commas
    votes_data <- str_replace(votes_data, ",", "")
    # Data-Preprocessing: converting votes to numerical
    votes_data <- as.numeric(votes_data)
    ## Warning: NAs introduced by coercion
    #Let's have another look at the votes data
    votes_data
    ##   [1] 539726 672715 257951     NA 645227  34097 950476 172075 167900 498772
    ##  [11] 168213 232537 103174 284132 495734 129209 342098 405846 385311  46855
    ##  [21] 216961 159413 297457 107447 308155 259098  46463  69059 353833 248405
    ##  [31] 192166 195296  47014  65436 161629  41798 339026 138933  48600  70622
    ##  [41]  74880  72569 246013 247750  48331 103768 215593 163833 103893 155416
    ##  [51]  69180 107290  58584  49561 275254 167840 227563  51255  39500 123774
    ##  [61]  26566 107229 225378  21233 171068  37844  57396  78349  28616  18245
    ##  [71] 207300  43169  76013 153408  91463  89369  41762  74596 224392  84367
    ##  [81]  25048  82522  30965  95116  20187  82993  78224   1656  86712  68016
    ##  [91]  53835  44228 113914  65270  55514  20041  75827  70250  46100  57652

Director

  • CSS selector reference

    # Using CSS selectors to scrap the directors section
    (directors_data_html <- html_nodes(webpage,'.text-muted+ p a:nth-child(1)'))
    ## {xml_nodeset (100)}
    ##  [1] <a href="/name/nm0426059/?ref_=adv_li_dr_0">Rian Johnson</a>
    ##  [2] <a href="/name/nm0094435/?ref_=adv_li_dr_0">Bong Joon Ho</a>
    ##  [3] <a href="/name/nm4170048/?ref_=adv_li_dr_0">Ari Aster</a>
    ##  [4] <a href="/name/nm0680846/?ref_=adv_li_dr_0">Todd Phillips</a>
    ##  [5] <a href="/name/nm0000233/?ref_=adv_li_dr_0">Quentin Tarantino</a>
    ##  [6] <a href="/name/nm0862911/?ref_=adv_li_dr_0">Greg Tiernan</a>
    ##  [7] <a href="/name/nm0751577/?ref_=adv_li_dr_0">Anthony Russo</a>
    ##  [8] <a href="/name/nm3211470/?ref_=adv_li_dr_0">Robert Eggers</a>
    ##  [9] <a href="/name/nm1093039/?ref_=adv_li_dr_0">Mike Flanagan</a>
    ## [10] <a href="/name/nm0005222/?ref_=adv_li_dr_0">Sam Mendes</a>
    ## [11] <a href="/name/nm1950086/?ref_=adv_li_dr_0">Greta Gerwig</a>
    ## [12] <a href="/name/nm0615592/?ref_=adv_li_dr_0">Andy Muschietti</a>
    ## [13] <a href="/name/nm2391575/?ref_=adv_li_dr_0">David Michôd</a>
    ## [14] <a href="/name/nm0005363/?ref_=adv_li_dr_0">Guy Ritchie</a>
    ## [15] <a href="/name/nm1349818/?ref_=adv_li_dr_0">Anna Boden</a>
    ## [16] <a href="/name/nm2366012/?ref_=adv_li_dr_0">Matt Bettinelli-Olpin</a>
    ## [17] <a href="/name/nm0169806/?ref_=adv_li_dr_0">Taika Waititi</a>
    ## [18] <a href="/name/nm0009190/?ref_=adv_li_dr_0">J.J. Abrams</a>
    ## [19] <a href="/name/nm1218281/?ref_=adv_li_dr_0">Jon Watts</a>
    ## [20] <a href="/name/nm1788310/?ref_=adv_li_dr_0">Jenny Gage</a>
    ## ...
    # Converting the directors data to text
    directors_data <- html_text(directors_data_html)
    # Let's have a look at the directors data
    directors_data
    ##   [1] "Rian Johnson"           "Bong Joon Ho"           "Ari Aster"             
    ##   [4] "Todd Phillips"          "Quentin Tarantino"      "Greg Tiernan"          
    ##   [7] "Anthony Russo"          "Robert Eggers"          "Mike Flanagan"         
    ##  [10] "Sam Mendes"             "Greta Gerwig"           "Andy Muschietti"       
    ##  [13] "David Michôd"           "Guy Ritchie"            "Anna Boden"            
    ##  [16] "Matt Bettinelli-Olpin"  "Taika Waititi"          "J.J. Abrams"           
    ##  [19] "Jon Watts"              "Jenny Gage"             "Jake Kasdan"           
    ##  [22] "Dexter Fletcher"        "David F. Sandberg"      "Adam Robitel"          
    ##  [25] "Chad Stahelski"         "Jordan Peele"           "Michael Engler"        
    ##  [28] "André Øvredal"          "Martin Scorsese"        "Robert Rodriguez"      
    ##  [31] "Galder Gaztelu-Urrutia" "David Leitch"           "Tom Hooper"            
    ##  [34] "Elizabeth Banks"        "Ruben Fleischer"        "Gavin Hood"            
    ##  [37] "James Mangold"          "Danny Boyle"            "Lars Klevberg"         
    ##  [40] "Jim Jarmusch"           "Stephen Merchant"       "Luc Besson"            
    ##  [43] "Guy Ritchie"            "Benny Safdie"           "Lorcan Finnegan"       
    ##  [46] "Jay Roach"              "James Gray"             "Tim Miller"            
    ##  [49] "Ang Lee"                "Michael Bay"            "Gene Stupnitsky"       
    ##  [52] "Olivia Wilde"           "Brian Kirk"             "Tate Taylor"           
    ##  [55] "Noah Baumbach"          "Michael Dougherty"      "Jon Favreau"           
    ##  [58] "Edward Norton"          "Severin Fiala"          "F. Gary Gray"          
    ##  [61] "Rose Glass"             "Darius Marder"          "M. Night Shyamalan"    
    ##  [64] "Alice Waddington"       "Simon Kinberg"          "Woody Allen"           
    ##  [67] "Destin Daniel Cretton"  "Céline Sciamma"         "Rod Lurie"             
    ##  [70] "Christian Volckman"     "Vince Gilligan"         "Adam Randall"          
    ##  [73] "Roland Emmerich"        "Chris Buck"             "Lorene Scafaria"       
    ##  [76] "Ric Roman Waugh"        "Richard Stanley"        "Todd Haynes"           
    ##  [79] "Josh Cooley"            "Neil Marshall"          "Scott Beck"            
    ##  [82] "Jonas Åkerlund"         "Andrea Di Stefano"      "Joachim Rønning"       
    ##  [85] "John Crowley"           "Tyler Nilson"           "Alexandre Aja"         
    ##  [88] "Karl Holt"              "Joe Berlinger"          "Gary Dauberman"        
    ##  [91] "Jason Howden"           "Jeff Tremaine"          "Kyle Newacheck"        
    ##  [94] "Brad Anderson"          "Justin Baldoni"         "Carlo Mirabella-Davis" 
    ##  [97] "Clint Eastwood"         "Marielle Heller"        "Rupert Goold"          
    ## [100] "Dan Gilroy"

Actor

  • # Using CSS selectors to scrap the actors section
    (actors_data_html <- html_nodes(webpage, '.lister-item-content .ghost+ a'))
    ## {xml_nodeset (100)}
    ##  [1] <a href="/name/nm0185819/?ref_=adv_li_st_0">Daniel Craig</a>
    ##  [2] <a href="/name/nm0814280/?ref_=adv_li_st_0">Kang-ho Song</a>
    ##  [3] <a href="/name/nm6073955/?ref_=adv_li_st_0">Florence Pugh</a>
    ##  [4] <a href="/name/nm0001618/?ref_=adv_li_st_0">Joaquin Phoenix</a>
    ##  [5] <a href="/name/nm0000138/?ref_=adv_li_st_0">Leonardo DiCaprio</a>
    ##  [6] <a href="/name/nm1209966/?ref_=adv_li_st_0">Oscar Isaac</a>
    ##  [7] <a href="/name/nm0000375/?ref_=adv_li_st_0">Robert Downey Jr.</a>
    ##  [8] <a href="/name/nm1500155/?ref_=adv_li_st_0">Robert Pattinson</a>
    ##  [9] <a href="/name/nm0000191/?ref_=adv_li_st_0">Ewan McGregor</a>
    ## [10] <a href="/name/nm2835616/?ref_=adv_li_st_0">Dean-Charles Chapman</a>
    ## [11] <a href="/name/nm1519680/?ref_=adv_li_st_0">Saoirse Ronan</a>
    ## [12] <a href="/name/nm1567113/?ref_=adv_li_st_0">Jessica Chastain</a>
    ## [13] <a href="/name/nm6077951/?ref_=adv_li_st_0">Tom Glynn-Carney</a>
    ## [14] <a href="/name/nm0000190/?ref_=adv_li_st_0">Matthew McConaughey</a>
    ## [15] <a href="/name/nm0488953/?ref_=adv_li_st_0">Brie Larson</a>
    ## [16] <a href="/name/nm3034977/?ref_=adv_li_st_0">Samara Weaving</a>
    ## [17] <a href="/name/nm9877392/?ref_=adv_li_st_0">Roman Griffin Davis</a>
    ## [18] <a href="/name/nm5397459/?ref_=adv_li_st_0">Daisy Ridley</a>
    ## [19] <a href="/name/nm4043618/?ref_=adv_li_st_0">Tom Holland</a>
    ## [20] <a href="/name/nm6466214/?ref_=adv_li_st_0">Josephine Langford</a>
    ## ...
    # Converting the gross actors data to text
    actors_data <- html_text(actors_data_html)
    # Let's have a look at the actors data
    head(actors_data)
    ## [1] "Daniel Craig"      "Kang-ho Song"      "Florence Pugh"    
    ## [4] "Joaquin Phoenix"   "Leonardo DiCaprio" "Oscar Isaac"

Metascore

  • Be careful with missing data.

    # Using CSS selectors to scrap the metascore section
    metascore_data_html <- html_nodes(webpage, '.metascore')
    # Converting the runtime data to text
    metascore_data <- html_text(metascore_data_html)
    # Let's have a look at the metascore 
    head(metascore_data)
    ## [1] "82        " "96        " "72        " "59        " "83        "
    ## [6] "46        "
    # Data-Preprocessing: removing extra space in metascore
    metascore_data <- str_replace(metascore_data, "\\s*$", "")
    metascore_data <- as.numeric(metascore_data)
    metascore_data
    ##  [1] 82 96 72 59 83 46 78 83 59 78 91 58 62 51 64 64 58 53 69 30 58 69 71 48 73
    ## [26] 81 64 61 94 53 73 60 32 52 55 63 81 55 48 53 68 40 53 91 64 64 80 54 38 41
    ## [51] 84 51 53 94 48 55 60 64 38 83 82 43 49 43 38 68 95 71 72 65 47 64 79 45 70
    ## [76] 73 84 31 69 19 61 43 40 70 60 52 53 42 39 38 36 53 65 68 80 66 61
    # Lets check the length of metascore data
    length(metascore_data)
    ## [1] 97
    # Visual inspection finds 24, 85, 100 don't have metascore
    ms <- rep(NA, 100)
    ms[-c(24, 85, 100)] <- metascore_data
    (metascore_data <- ms)
    ##   [1] 82 96 72 59 83 46 78 83 59 78 91 58 62 51 64 64 58 53 69 30 58 69 71 NA 48
    ##  [26] 73 81 64 61 94 53 73 60 32 52 55 63 81 55 48 53 68 40 53 91 64 64 80 54 38
    ##  [51] 41 84 51 53 94 48 55 60 64 38 83 82 43 49 43 38 68 95 71 72 65 47 64 79 45
    ##  [76] 70 73 84 31 69 19 61 43 40 NA 70 60 52 53 42 39 38 36 53 65 68 80 66 61 NA

Gross

  • Be careful with missing data.

    # Using CSS selectors to scrap the gross revenue section
    gross_data_html <- html_nodes(webpage,'.ghost~ .text-muted+ span')
    # Converting the gross revenue data to text
    gross_data <- html_text(gross_data_html)
    # Let's have a look at the gross data
    head(gross_data)
    ## [1] "$165.36M" "$53.37M"  "$27.33M"  "$335.45M" "$142.50M" "$100.04M"
    # Data-Preprocessing: removing '$' and 'M' signs
    gross_data <- str_replace(gross_data, "M", "")
    gross_data <- str_sub(gross_data, 2, 10)
    #(gross_data <- str_extract(gross_data, "[:digit:]+.[:digit:]+"))
    gross_data <- as.numeric(gross_data)
    # Let's check the length of gross data
    length(gross_data)
    ## [1] 62
    # Visual inspection finds below movies don't have gross
    #gs_data <- rep(NA, 100)
    #gs_data[-c(1, 2, 3, 5, 61, 69, 71, 74, 78, 82, 84:87, 90)] <- gross_data
    #(gross_data <- gs_data)

    60 (out of 100) movies don’t have gross data yet! We need a better way to figure out missing entries.

    (rank_and_gross <- webpage %>%
      html_nodes('.ghost~ .text-muted+ span , .text-primary') %>%
      html_text() %>%
      str_replace("\\s+", "") %>%
      str_replace_all("[$M]", ""))
    ##   [1] "1."     "165.36" "2."     "53.37"  "3."     "27.33"  "4."     "335.45"
    ##   [9] "5."     "142.50" "6."     "100.04" "7."     "858.37" "8."     "0.43"  
    ##  [17] "9."     "10."    "159.23" "11."    "108.10" "12."    "211.59" "13."   
    ##  [25] "14."    "15."    "426.83" "16."    "26.74"  "17."    "0.35"   "18."   
    ##  [33] "515.20" "19."    "390.53" "20."    "12.14"  "21."    "316.83" "22."   
    ##  [41] "96.37"  "23."    "140.37" "24."    "57.01"  "25."    "171.02" "26."   
    ##  [49] "175.08" "27."    "96.85"  "28."    "62.74"  "29."    "7.00"   "30."   
    ##  [57] "85.71"  "31."    "32."    "173.96" "33."    "34."    "35."    "26.80" 
    ##  [65] "36."    "0.40"   "37."    "117.62" "38."    "73.29"  "39."    "29.21" 
    ##  [73] "40."    "6.56"   "41."    "22.96"  "42."    "7.74"   "43."    "355.56"
    ##  [81] "44."    "45."    "46."    "47."    "35.40"  "48."    "62.25"  "49."   
    ##  [89] "20.55"  "50."    "51."    "69.06"  "52."    "22.68"  "53."    "54."   
    ##  [97] "45.37"  "55."    "2.00"   "56."    "110.50" "57."    "543.64" "58."   
    ## [105] "59."    "60."    "80.00"  "61."    "62."    "63."    "111.05" "64."   
    ## [113] "65."    "65.85"  "66."    "67."    "68."    "3.76"   "69."    "70."   
    ## [121] "71."    "72."    "73."    "74."    "477.37" "75."    "80.55"  "76."   
    ## [129] "67.16"  "77."    "78."    "79."    "434.04" "80."    "21.90"  "81."   
    ## [137] "82."    "83."    "84."    "113.93" "85."    "5.33"   "86."    "13.12" 
    ## [145] "87."    "39.01"  "88."    "89."    "90."    "74.15"  "91."    "92."   
    ## [153] "93."    "94."    "95."    "45.73"  "96."    "97."    "98."    "61.70" 
    ## [161] "99."    "100."
    isrank <- str_detect(rank_and_gross, "\\.$")
    ismissing <- isrank[1:(length(rank_and_gross) - 1)] & isrank[2:(length(rank_and_gross))]
    ismissing[length(ismissing)+1] <- isrank[length(isrank)]
    missingpos <- as.integer(rank_and_gross[ismissing])
    gs_data <- rep(NA, 100)
    gs_data[-missingpos] <- gross_data
    (gross_data <- gs_data)
    ##   [1] 165.36  53.37  27.33 335.45 142.50 100.04 858.37   0.43     NA 159.23
    ##  [11] 108.10 211.59     NA     NA 426.83  26.74   0.35 515.20 390.53  12.14
    ##  [21] 316.83  96.37 140.37  57.01 171.02 175.08  96.85  62.74   7.00  85.71
    ##  [31]     NA 173.96     NA     NA  26.80   0.40 117.62  73.29  29.21   6.56
    ##  [41]  22.96   7.74 355.56     NA     NA     NA  35.40  62.25  20.55     NA
    ##  [51]  69.06  22.68     NA  45.37   2.00 110.50 543.64     NA     NA  80.00
    ##  [61]     NA     NA 111.05     NA  65.85     NA     NA   3.76     NA     NA
    ##  [71]     NA     NA     NA 477.37  80.55  67.16     NA     NA 434.04  21.90
    ##  [81]     NA     NA     NA 113.93   5.33  13.12  39.01     NA     NA  74.15
    ##  [91]     NA     NA     NA     NA  45.73     NA     NA  61.70     NA     NA

Missing entries - more reproducible way

  • Following code programatically figures out missing entries for metascore.

    # Use CSS selectors to scrap the rankings section
    (rank_metascore_data_html <- html_nodes(webpage, '.unfavorable , .favorable , .mixed , .text-primary'))
    ## {xml_nodeset (197)}
    ##  [1] <span class="lister-item-index unbold text-primary">1.</span>
    ##  [2] <span class="metascore  favorable">82        </span>
    ##  [3] <span class="lister-item-index unbold text-primary">2.</span>
    ##  [4] <span class="metascore  favorable">96        </span>
    ##  [5] <span class="lister-item-index unbold text-primary">3.</span>
    ##  [6] <span class="metascore  favorable">72        </span>
    ##  [7] <span class="lister-item-index unbold text-primary">4.</span>
    ##  [8] <span class="metascore  mixed">59        </span>
    ##  [9] <span class="lister-item-index unbold text-primary">5.</span>
    ## [10] <span class="metascore  favorable">83        </span>
    ## [11] <span class="lister-item-index unbold text-primary">6.</span>
    ## [12] <span class="metascore  mixed">46        </span>
    ## [13] <span class="lister-item-index unbold text-primary">7.</span>
    ## [14] <span class="metascore  favorable">78        </span>
    ## [15] <span class="lister-item-index unbold text-primary">8.</span>
    ## [16] <span class="metascore  favorable">83        </span>
    ## [17] <span class="lister-item-index unbold text-primary">9.</span>
    ## [18] <span class="metascore  mixed">59        </span>
    ## [19] <span class="lister-item-index unbold text-primary">10.</span>
    ## [20] <span class="metascore  favorable">78        </span>
    ## ...
    # Convert the ranking data to text
    (rank_metascore_data <- html_text(rank_metascore_data_html))
    ##   [1] "1."         "82        " "2."         "96        " "3."        
    ##   [6] "72        " "4."         "59        " "5."         "83        "
    ##  [11] "6."         "46        " "7."         "78        " "8."        
    ##  [16] "83        " "9."         "59        " "10."        "78        "
    ##  [21] "11."        "91        " "12."        "58        " "13."       
    ##  [26] "62        " "14."        "51        " "15."        "64        "
    ##  [31] "16."        "64        " "17."        "58        " "18."       
    ##  [36] "53        " "19."        "69        " "20."        "30        "
    ##  [41] "21."        "58        " "22."        "69        " "23."       
    ##  [46] "71        " "24."        "48        " "25."        "73        "
    ##  [51] "26."        "81        " "27."        "64        " "28."       
    ##  [56] "61        " "29."        "94        " "30."        "53        "
    ##  [61] "31."        "73        " "32."        "60        " "33."       
    ##  [66] "32        " "34."        "52        " "35."        "55        "
    ##  [71] "36."        "63        " "37."        "81        " "38."       
    ##  [76] "55        " "39."        "48        " "40."        "53        "
    ##  [81] "41."        "68        " "42."        "40        " "43."       
    ##  [86] "53        " "44."        "91        " "45."        "64        "
    ##  [91] "46."        "64        " "47."        "80        " "48."       
    ##  [96] "54        " "49."        "38        " "50."        "41        "
    ## [101] "51."        "52."        "84        " "53."        "51        "
    ## [106] "54."        "53        " "55."        "94        " "56."       
    ## [111] "48        " "57."        "55        " "58."        "60        "
    ## [116] "59."        "64        " "60."        "38        " "61."       
    ## [121] "83        " "62."        "82        " "63."        "43        "
    ## [126] "64."        "49        " "65."        "43        " "66."       
    ## [131] "38        " "67."        "68        " "68."        "95        "
    ## [136] "69."        "71        " "70."        "71."        "72        "
    ## [141] "72."        "65        " "73."        "47        " "74."       
    ## [146] "64        " "75."        "79        " "76."        "45        "
    ## [151] "77."        "70        " "78."        "73        " "79."       
    ## [156] "84        " "80."        "31        " "81."        "69        "
    ## [161] "82."        "19        " "83."        "61        " "84."       
    ## [166] "43        " "85."        "40        " "86."        "70        "
    ## [171] "87."        "60        " "88."        "89."        "52        "
    ## [176] "90."        "53        " "91."        "42        " "92."       
    ## [181] "39        " "93."        "38        " "94."        "36        "
    ## [186] "95."        "53        " "96."        "65        " "97."       
    ## [191] "68        " "98."        "80        " "99."        "66        "
    ## [196] "100."       "61        "
    # Strip spaces
    (rank_metascore_data <- str_replace(rank_metascore_data, "\\s+", ""))
    ##   [1] "1."   "82"   "2."   "96"   "3."   "72"   "4."   "59"   "5."   "83"  
    ##  [11] "6."   "46"   "7."   "78"   "8."   "83"   "9."   "59"   "10."  "78"  
    ##  [21] "11."  "91"   "12."  "58"   "13."  "62"   "14."  "51"   "15."  "64"  
    ##  [31] "16."  "64"   "17."  "58"   "18."  "53"   "19."  "69"   "20."  "30"  
    ##  [41] "21."  "58"   "22."  "69"   "23."  "71"   "24."  "48"   "25."  "73"  
    ##  [51] "26."  "81"   "27."  "64"   "28."  "61"   "29."  "94"   "30."  "53"  
    ##  [61] "31."  "73"   "32."  "60"   "33."  "32"   "34."  "52"   "35."  "55"  
    ##  [71] "36."  "63"   "37."  "81"   "38."  "55"   "39."  "48"   "40."  "53"  
    ##  [81] "41."  "68"   "42."  "40"   "43."  "53"   "44."  "91"   "45."  "64"  
    ##  [91] "46."  "64"   "47."  "80"   "48."  "54"   "49."  "38"   "50."  "41"  
    ## [101] "51."  "52."  "84"   "53."  "51"   "54."  "53"   "55."  "94"   "56." 
    ## [111] "48"   "57."  "55"   "58."  "60"   "59."  "64"   "60."  "38"   "61." 
    ## [121] "83"   "62."  "82"   "63."  "43"   "64."  "49"   "65."  "43"   "66." 
    ## [131] "38"   "67."  "68"   "68."  "95"   "69."  "71"   "70."  "71."  "72"  
    ## [141] "72."  "65"   "73."  "47"   "74."  "64"   "75."  "79"   "76."  "45"  
    ## [151] "77."  "70"   "78."  "73"   "79."  "84"   "80."  "31"   "81."  "69"  
    ## [161] "82."  "19"   "83."  "61"   "84."  "43"   "85."  "40"   "86."  "70"  
    ## [171] "87."  "60"   "88."  "89."  "52"   "90."  "53"   "91."  "42"   "92." 
    ## [181] "39"   "93."  "38"   "94."  "36"   "95."  "53"   "96."  "65"   "97." 
    ## [191] "68"   "98."  "80"   "99."  "66"   "100." "61"
    # a rank followed by another rank means the metascore for the 1st rank is missing
    (isrank <- str_detect(rank_metascore_data, "\\.$"))
    ##   [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
    ##  [13]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
    ##  [25]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
    ##  [37]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
    ##  [49]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
    ##  [61]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
    ##  [73]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
    ##  [85]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
    ##  [97]  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
    ## [109] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
    ## [121] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
    ## [133] FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE
    ## [145]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
    ## [157]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
    ## [169]  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
    ## [181] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
    ## [193] FALSE  TRUE FALSE  TRUE FALSE
    ismissing <- isrank[1:length(rank_metascore_data)-1] & 
      isrank[2:length(rank_metascore_data)]
    ismissing[length(ismissing)+1] <- isrank[length(isrank)]
    (missingpos <- as.integer(rank_metascore_data[ismissing]))
    ## [1] 51 70 88
    #(rank_metascore_data <- as.integer(rank_metascore_data))
  • You (students) should work out the code for finding missing positions for gross.

Visualizing movie data

  • Form a tibble:

    # Combining all the lists to form a data frame
    movies <- tibble(Rank = rank_data, 
                     Title = title_data,
                     Description = description_data, 
                     Runtime = runtime_data,
                     Genre = genre_data, 
                     Rating = rating_data,
                     Metascore = metascore_data, 
                     Votes = votes_data,
                     Gross_Earning_in_Mil = gross_data,
                     Director = directors_data, 
                     Actor = actors_data)
    movies %>% print(width=Inf)
    ## # A tibble: 100 × 11
    ##     Rank Title                           
    ##    <int> <chr>                           
    ##  1     1 Knives Out                      
    ##  2     2 Parasite                        
    ##  3     3 Midsommar                       
    ##  4     4 Joker                           
    ##  5     5 Once Upon a Time... In Hollywood
    ##  6     6 The Addams Family               
    ##  7     7 Avengers: Endgame               
    ##  8     8 The Lighthouse                  
    ##  9     9 Doctor Sleep                    
    ## 10    10 1917                            
    ##    Description                                                                  
    ##    <chr>                                                                        
    ##  1 "\nA detective investigates the death of a patriarch of an eccentric, combat…
    ##  2 "\nGreed and class discrimination threaten the newly formed symbiotic relati…
    ##  3 "\nA couple travels to Scandinavia to visit a rural hometown's fabled Swedis…
    ##  4 "\nIn Gotham City, mentally troubled comedian Arthur Fleck is disregarded an…
    ##  5 "\nA faded television actor and his stunt double strive to achieve fame and …
    ##  6 "\nThe eccentrically macabre family moves to a bland suburb where Wednesday …
    ##  7 "\nAfter the devastating events of Avengers: Infinity War (2018), the univer…
    ##  8 "\nTwo lighthouse keepers try to maintain their sanity while living on a rem…
    ##  9 "\nYears following the events of The Shining (1980), a now-adult Dan Torranc…
    ## 10 "\nApril 6th, 1917. As a regiment assembles to wage war deep in enemy territ…
    ##    Runtime Genre     Rating Metascore  Votes Gross_Earning_in_Mil
    ##      <dbl> <chr>      <dbl>     <dbl>  <dbl>                <dbl>
    ##  1     130 Comedy       7.9        82 539726               165.  
    ##  2     132 Comedy       8.6        96 672715                53.4 
    ##  3     148 Drama        7.1        72 257951                27.3 
    ##  4     122 Crime        8.4        59     NA               335.  
    ##  5     161 Comedy       7.6        83 645227               142.  
    ##  6      86 Animation    5.8        46  34097               100.  
    ##  7     181 Action       8.4        78 950476               858.  
    ##  8     109 Drama        7.5        83 172075                 0.43
    ##  9     152 Drama        7.3        59 167900                NA   
    ## 10     119 Action       8.3        78 498772               159.  
    ##    Director          Actor               
    ##    <chr>             <chr>               
    ##  1 Rian Johnson      Daniel Craig        
    ##  2 Bong Joon Ho      Kang-ho Song        
    ##  3 Ari Aster         Florence Pugh       
    ##  4 Todd Phillips     Joaquin Phoenix     
    ##  5 Quentin Tarantino Leonardo DiCaprio   
    ##  6 Greg Tiernan      Oscar Isaac         
    ##  7 Anthony Russo     Robert Downey Jr.   
    ##  8 Robert Eggers     Robert Pattinson    
    ##  9 Mike Flanagan     Ewan McGregor       
    ## 10 Sam Mendes        Dean-Charles Chapman
    ## # … with 90 more rows
  • How many top 100 movies are in each genre? (Be careful with interpretation.)

    movies %>%
      ggplot() +
      geom_bar(mapping = aes(x = Genre))

  • Which genre is most profitable in terms of average gross earnings?

    movies %>%
      group_by(Genre) %>%
      summarise(avg_earning = mean(Gross_Earning_in_Mil, na.rm=TRUE)) %>%
      ggplot() +
        geom_col(mapping = aes(x = Genre, y = avg_earning)) + 
        labs(y = "avg earning in millions")
    ## Warning: Removed 2 rows containing missing values (position_stack).

    ggplot(data = movies) +
      geom_boxplot(mapping = aes(x = Genre, y = Gross_Earning_in_Mil)) + 
      labs(y = "Gross earning in millions")
    ## Warning: Removed 38 rows containing non-finite values (stat_boxplot).

  • Is there a relationship between gross earning and rating? Find the best selling movie (by gross earning) in each genre

    library("ggrepel")
    (best_in_genre <- movies %>%
        group_by(Genre) %>%
        filter(row_number(desc(Gross_Earning_in_Mil)) == 1))
    ## # A tibble: 8 × 11
    ## # Groups:   Genre [8]
    ##    Rank Title Description Runtime Genre Rating Metascore  Votes Gross_Earning_i…
    ##   <int> <chr> <chr>         <dbl> <chr>  <dbl>     <dbl>  <dbl>            <dbl>
    ## 1     1 Kniv… "\nA detec…     130 Come…    7.9        82 539726            165. 
    ## 2     4 Joker "\nIn Goth…     122 Crime    8.4        59     NA            335. 
    ## 3     7 Aven… "\nAfter t…     181 Acti…    8.4        78 950476            858. 
    ## 4    12 It C… "\nTwenty-…     169 Drama    6.5        58 232537            212. 
    ## 5    22 Rock… "\nA music…     121 Biog…    7.3        69 159413             96.4
    ## 6    26 Us    "\nA famil…     116 Horr…    6.8        73 259098            175. 
    ## 7    43 Alad… "\nA kind-…     128 Adve…    6.9        40 246013            356. 
    ## 8    57 The … "\nAfter t…     118 Anim…    6.8        55 227563            544. 
    ## # … with 2 more variables: Director <chr>, Actor <chr>
    ggplot(movies, mapping = aes(x = Rating, y = Gross_Earning_in_Mil)) +
      geom_point(mapping = aes(size = Votes, color = Genre)) + 
      ggrepel::geom_label_repel(aes(label = Title), data = best_in_genre) +
      labs(y = "Gross earning in millions")
    ## Warning: Removed 39 rows containing missing values (geom_point).

RSelenium Example: FCC’s television broadcast signal strength

Many websites dynamically pull data from databases using JavasScript and JQuery that make them difficult to scrape.

The FCC’s dtvmaps webpage has a simple form in which you enter a zip code and it gives you the available local TV stations in that zip code and their signal strength.

You’ll also notice the URL stays fixed with different zip codes.

Why RSelenium

  • RSelenium loads the page that we want to scrape and download the HTML from that page.

    • particularly useful when scraping something behind a login

    • simulate human behavior on a website (e.g., mouse clicking)

  • rvest provides typical scraping tools

rm(list = ls()) # clean-up workspace
library("RSelenium")
library("tidyverse")
library("rvest")

Open up a browser

rD <- rsDriver(browser="firefox", port=sample(1:7360L, 1), verbose=F)
remDr <- rD[["client"]]

Open a webpage

remDr$navigate("https://www.fcc.gov/media/engineering/dtvmaps")

We want to send a string of text (zip code) into the form.

zip <- "70118"
# remDr$findElement(using = "id", value = "startpoint")$clearElement()
remDr$findElement(using = "id", value = "startpoint")$sendKeysToElement(list(zip))
# other possible ("xpath", "css selector", "id", "name", "tag name", "class name", "link text", "partial link text")

Click on the button Go!

remDr$findElements("id", "btnSub")[[1]]$clickElement()

Extract data from HTML

  • save HTML to an object

  • use rvest for the rest

Sys.sleep(5) # give the page time to fully load, in seconds
html <- remDr$getPageSource()[[1]]
# important to close the client
remDr$close()

signals <- read_html(html) %>% 
  html_nodes("table.tbl_mapReception") %>% # extract table nodes with class = "tbl_mapReception"
  .[3] %>% # keep the third of these tables
  .[[1]] %>% # keep the first element of this list
  html_table(fill=T) # have rvest turn it into a dataframe
signals
## # A tibble: 39 × 6
##    Callsign                       Callsign   Network   `Ch#`   Band   IA        
##    <chr>                          <chr>      <chr>     <chr>   <chr>  <chr>     
##  1 "Click on callsign for detail" "Click on… "Click o… "Click… "Clic…  <NA>     
##  2 ""                             "WWL-TV"   "CBS"     "4"     "UHF"  "RThis st…
##  3 ""                             ""         ""        ""      ""     ""        
##  4 ""                             "WUPL"     "MYNE"    "54"    "UHF"  "RThis st…
##  5 ""                             ""         ""        ""      ""     ""        
##  6 ""                             "WVUE-DT"  "FOX"     "8"     "UHF"  ""        
##  7 ""                             ""         ""        ""      ""     ""        
##  8 ""                             "WPXL-TV"  "ION"     "49"    "UHF"  "RThis st…
##  9 ""                             ""         ""        ""      ""     ""        
## 10 ""                             "WHNO"     "IND"     "20"    "UHF"  ""        
## # … with 29 more rows

More formatting on signals

names(signals) <- c("rm", "callsign", "network", "ch_num", "band", "rm2") # rename columns

signals <- signals %>%
  slice(2:n()) %>% # drop unnecessary first row
  filter(callsign != "") %>% # drop blank rows
  select(callsign:band) # drop unnecessary columns
signals
## # A tibble: 19 × 4
##    callsign network ch_num band 
##    <chr>    <chr>   <chr>  <chr>
##  1 WWL-TV   "CBS"   "4"    UHF  
##  2 WUPL     "MYNE"  "54"   UHF  
##  3 WVUE-DT  "FOX"   "8"    UHF  
##  4 WPXL-TV  "ION"   "49"   UHF  
##  5 WHNO     "IND"   "20"   UHF  
##  6 WGNO     "ABC"   "26"   UHF  
##  7 WDSU     "NBC"   "6"    UHF  
##  8 WNOL-TV  "CW"    "38"   UHF  
##  9 WYES-TV  "PBS"   "12"   Hi-V 
## 10 WTNO-LP  ""      ""     UHF  
## 11 WLAE-TV  "PBS"   "32"   UHF  
## 12 KNOV-CD  ""      ""     UHF  
## 13 WBXN-CD  ""      ""     UHF  
## 14 KGLA-DT  "IND"   "42"   UHF  
## 15 WBRZ-TV  "ABC"   "2"    Hi-V 
## 16 WVLA-TV  "NBC"   "33"   UHF  
## 17 WLPB-TV  "PBS"   "27"   UHF  
## 18 WGMB-TV  "FOX"   "44"   UHF  
## 19 WAFB     "CBS"   "9"    Hi-V

Capture all text by clicking on each Callsign

read_html(html) %>% 
  html_nodes(".callsign") %>% 
  html_attr("onclick")
##  [1] "getdetail(11158,74192,'WWL-TV Facility ID: 74192 <br>WWL-TV (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=74192 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/74192 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 27<br>RX Strength: 115 dbuV/m<br>Tower Distance: 5 mi; Direction: 116°<br>Repacked Channel: 27<br>Repacking Dates: 10/19/2019 to 1/17/2020','WWL-TV<br>Distance to Tower: 5 miles<br>Direction to Tower: 116 deg',29.90636111111111,-90.03947222222222,'WWL-TV')"     
##  [2] "getdetail(11137,13938,'WUPL Facility ID: 13938 <br>WUPL (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=13938 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/13938 target=_new>Public File</a>)<br>City of License: SLIDELL, LA<br>RF Channel: 17<br>RX Strength: 114 dbuV/m<br>Tower Distance: 5 mi; Direction: 116°<br>Repacked Channel: 17<br>Repacking Dates: 10/19/2019 to 1/17/2020','WUPL<br>Distance to Tower: 5 miles<br>Direction to Tower: 116 deg',29.90636111111111,-90.03947222222222,'WUPL')"                 
##  [3] "getdetail(10815,4149,'WVUE-DT Facility ID: 4149 <br>WVUE-DT (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=4149 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/4149 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 29<br>RX Strength: 112 dbuV/m<br>Tower Distance: 10 mi; Direction: 84°','WVUE-DT<br>Distance to Tower: 10 miles<br>Direction to Tower: 84 deg',29.954138888888888,-89.94952777777779,'WVUE-DT')"                                                                        
##  [4] "getdetail(11203,21729,'WPXL-TV Facility ID: 21729 <br>WPXL-TV (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=21729 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/21729 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 33<br>RX Strength: 111 dbuV/m<br>Tower Distance: 11 mi; Direction: 74°<br>Repacked Channel: 33<br>Repacking Dates: 10/19/2019 to 1/17/2020','WPXL-TV<br>Distance to Tower: 11 miles<br>Direction to Tower: 74 deg',29.982777777777777,-89.94944444444445,'WPXL-TV')"
##  [5] "getdetail(12228,37106,'WHNO Facility ID: 37106 <br>WHNO (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=37106 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/37106 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 21<br>RX Strength: 111 dbuV/m<br>Tower Distance: 6 mi; Direction: 103°','WHNO<br>Distance to Tower: 6 miles<br>Direction to Tower: 103 deg',29.920305555555558,-90.02458333333334,'WHNO')"                                                                                
##  [6] "getdetail(11737,72119,'WGNO Facility ID: 72119 <br>WGNO (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=72119 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/72119 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 26<br>RX Strength: 111 dbuV/m<br>Tower Distance: 10 mi; Direction: 86°','WGNO<br>Distance to Tower: 10 miles<br>Direction to Tower: 86 deg',29.95,-89.95777777777778,'WGNO')"                                                                                             
##  [7] "getdetail(11226,71357,'WDSU Facility ID: 71357 <br>WDSU (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=71357 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/71357 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 19<br>RX Strength: 111 dbuV/m<br>Tower Distance: 10 mi; Direction: 86°<br>Repacked Channel: 19<br>Repacking Dates: 10/19/2019 to 1/17/2020','WDSU<br>Distance to Tower: 10 miles<br>Direction to Tower: 86 deg',29.95,-89.95777777777778,'WDSU')"                         
##  [8] "getdetail(11738,54280,'WNOL-TV Facility ID: 54280 <br>WNOL-TV (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=54280 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/54280 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 15<br>RX Strength: 110 dbuV/m<br>Tower Distance: 10 mi; Direction: 86°','WNOL-TV<br>Distance to Tower: 10 miles<br>Direction to Tower: 86 deg',29.95,-89.95777777777778,'WNOL-TV')"                                                                                 
##  [9] "getdetail(11911,25090,'WYES-TV Facility ID: 25090 <br>WYES-TV (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=25090 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/25090 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 11<br>RX Strength: 102 dbuV/m<br>Tower Distance: 10 mi; Direction: 85°','WYES-TV<br>Distance to Tower: 10 miles<br>Direction to Tower: 85 deg',29.953888888888887,-89.94944444444445,'WYES-TV')"                                                                    
## [10] "getdetail(12360,24981,'WTNO-LP Facility ID: 24981 <br>WTNO-LP (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=24981 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/24981 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 22<br>RX Strength: 106 dbuV/m<br>Tower Distance: 3 mi; Direction: 330°','WTNO-LP<br>Distance to Tower: 3 miles<br>Direction to Tower: 330 deg',29.97461111111111,-90.14347222222223,'WTNO-LP')"                                                                     
## [11] "getdetail(11281,18819,'WLAE-TV Facility ID: 18819 <br>WLAE-TV (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=18819 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/18819 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 23<br>RX Strength: 104 dbuV/m<br>Tower Distance: 10 mi; Direction: 74°<br>Repacked Channel: 23<br>Repacking Dates: 10/19/2019 to 1/17/2020','WLAE-TV<br>Distance to Tower: 10 miles<br>Direction to Tower: 74 deg',29.982777777777777,-89.9525,'WLAE-TV')"          
## [12] "getdetail(12467,64048,'KNOV-CD Facility ID: 64048 <br>KNOV-CD (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=64048 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/64048 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 31<br>RX Strength: 101 dbuV/m<br>Tower Distance: 3 mi; Direction: 74°<br>Repacked Channel: 31<br>Repacking Dates: 10/19/2019 to 1/17/2020','KNOV-CD<br>Distance to Tower: 3 miles<br>Direction to Tower: 74 deg',29.95213888888889,-90.07027777777778,'KNOV-CD')"   
## [13] "getdetail(12443,70419,'WBXN-CD Facility ID: 70419 <br>WBXN-CD (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=70419 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/70419 target=_new>Public File</a>)<br>City of License: NEW ORLEANS, LA<br>RF Channel: 36<br>RX Strength: 98 dbuV/m<br>Tower Distance: 5 mi; Direction: 116°<br>Repacked Channel: 36<br>Repacking Dates: 10/19/2019 to 1/17/2020','WBXN-CD<br>Distance to Tower: 5 miles<br>Direction to Tower: 116 deg',29.90636111111111,-90.03947222222222,'WBXN-CD')"  
## [14] "getdetail(10726,83945,'KGLA-DT Facility ID: 83945 <br>KGLA-DT (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=83945 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/83945 target=_new>Public File</a>)<br>City of License: HAMMOND, LA<br>RF Channel: 35<br>RX Strength: 93 dbuV/m<br>Tower Distance: 11 mi; Direction: 76°<br>Repacked Channel: 35<br>Repacking Dates: 3/14/2020 to 5/1/2020','KGLA-DT<br>Distance to Tower: 11 miles<br>Direction to Tower: 76 deg',29.97833333333333,-89.94055555555556,'KGLA-DT')"        
## [15] "getdetail(11797,38616,'WBRZ-TV Facility ID: 38616 <br>WBRZ-TV (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=38616 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/38616 target=_new>Public File</a>)<br>City of License: BATON ROUGE, LA<br>RF Channel: 13<br>RX Strength: 45 dbuV/m<br>Tower Distance: 69 mi; Direction: 291°','WBRZ-TV<br>Distance to Tower: 69 miles<br>Direction to Tower: 291 deg',30.296944444444446,-91.19361111111111,'WBRZ-TV')"                                                                   
## [16] "getdetail(12198,70021,'WVLA-TV Facility ID: 70021 <br>WVLA-TV (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=70021 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/70021 target=_new>Public File</a>)<br>City of License: BATON ROUGE, LA<br>RF Channel: 34<br>RX Strength: 46 dbuV/m<br>Tower Distance: 74 mi; Direction: 291°','WVLA-TV<br>Distance to Tower: 74 miles<br>Direction to Tower: 291 deg',30.32627777777778,-91.27669444444444,'WVLA-TV')"                                                                    
## [17] "getdetail(10829,38586,'WLPB-TV Facility ID: 38586 <br>WLPB-TV (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=38586 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/38586 target=_new>Public File</a>)<br>City of License: BATON ROUGE, LA<br>RF Channel: 25<br>RX Strength: 44 dbuV/m<br>Tower Distance: 71 mi; Direction: 295°','WLPB-TV<br>Distance to Tower: 71 miles<br>Direction to Tower: 295 deg',30.372972222222224,-91.20455555555556,'WLPB-TV')"                                                                   
## [18] "getdetail(11101,12520,'WGMB-TV Facility ID: 12520 <br>WGMB-TV (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=12520 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/12520 target=_new>Public File</a>)<br>City of License: BATON ROUGE, LA<br>RF Channel: 24<br>RX Strength: 43 dbuV/m<br>Tower Distance: 74 mi; Direction: 291°<br>Repacked Channel: 24<br>Repacking Dates: 1/18/2020 to 3/13/2020','WGMB-TV<br>Distance to Tower: 74 miles<br>Direction to Tower: 291 deg',30.32627777777778,-91.27669444444444,'WGMB-TV')" 
## [19] "getdetail(11961,589,'WAFB Facility ID: 589 <br>WAFB (<a href=https://enterpriseefiling.fcc.gov/dataentry/public/tv/publicFacilityDetails.html?facilityId=589 target=_new>Licensing</a>) (<a href=https://publicfiles.fcc.gov/tv-profile/589 target=_new>Public File</a>)<br>City of License: BATON ROUGE, LA<br>RF Channel: 9<br>RX Strength: 37 dbuV/m<br>Tower Distance: 72 mi; Direction: 295°','WAFB<br>Distance to Tower: 72 miles<br>Direction to Tower: 295 deg',30.366388888888892,-91.21305555555556,'WAFB')"

Extract signal by string operations

strength <- read_html(html) %>% 
  html_nodes(".callsign") %>% 
  html_attr("onclick") %>% 
  str_extract("(?<=RX Strength: )\\s*\\-*[0-9.]+")

# (?<=…)  is a special regex expression for positive lookbehind

signals <- cbind(signals, strength)
signals
##    callsign network ch_num band strength
## 1    WWL-TV     CBS      4  UHF      115
## 2      WUPL    MYNE     54  UHF      114
## 3   WVUE-DT     FOX      8  UHF      112
## 4   WPXL-TV     ION     49  UHF      111
## 5      WHNO     IND     20  UHF      111
## 6      WGNO     ABC     26  UHF      111
## 7      WDSU     NBC      6  UHF      111
## 8   WNOL-TV      CW     38  UHF      110
## 9   WYES-TV     PBS     12 Hi-V      102
## 10  WTNO-LP                 UHF      106
## 11  WLAE-TV     PBS     32  UHF      104
## 12  KNOV-CD                 UHF      101
## 13  WBXN-CD                 UHF       98
## 14  KGLA-DT     IND     42  UHF       93
## 15  WBRZ-TV     ABC      2 Hi-V       45
## 16  WVLA-TV     NBC     33  UHF       46
## 17  WLPB-TV     PBS     27  UHF       44
## 18  WGMB-TV     FOX     44  UHF       43
## 19     WAFB     CBS      9 Hi-V       37