rweb-scrapingrvestibm-jazz

Jazzy Scraping with R and without Selectors


I've been using rvest to scrape pages, and I am aware of the benefits of selectorGadget. However, one page has data WITHOUT selectors. A snippet of the HTML is below. The page is here . I am trying to scrape the list of personnel on each jazz album listed. In the snippet of HTML below, the personnel data begins with "Sonny Rollins, tenor sax..." As you can see, that text is not surrounded by any CSS selectors. Any advice on scraping this out?

<h1>Blue Note Records Catalog: 4000 series</h1>
<div id="start-here"><!-- id="start-here" --></div>
<div id="catalog-data">
<h2>Modern Jazz 4000 series (12 inch LP)</h2>
<h3><a href="./album-index/#blp-4001" name="blp-4001">BLP 4001 &nbsp; Sonny         
Rollins - Newk's Time &nbsp; <i>1959</i></a></h3>
Sonny Rollins, tenor sax; Wynton Kelly, piano #1,2,4-6; Doug Watkins, bass     
#1,2,4-6; Philly Joe Jones, drums.
<div class="date">Van Gelder Studio, Hackensack, NJ, September 22, 
1957</div>
<table width="100%">
<tr><td width="15%">1. tk.5<td>Tune Up

Etc...


Solution

  • Extract using xpath and using regular expression to filter out elements. Following script should work.

    library(rvest)
    library(stringr)
    
    texts <- read_html("https://www.jazzdisco.org/blue-note-records/catalog-4000-series/") %>% 
        html_nodes(xpath = '//*[@id="catalog-data"]/text()') %>%
        html_text()
    
    texts[!str_detect(texts,"(^\\n$)|(^\\n\\*\\*)")] # I just notcie this line doesn't clean up the string entirely, you can figure out better regex.
    

    About splitting the string, you may try following code:

    sample_str <- "\nIke Quebec, tenor sax; Sonny Clark, piano; Grant Green, guitar; Sam Jones, bass; Louis Hayes, drums.\n" 
    str_trim(sample_str) %>%
        str_split(",")
    

    returns:

    [[1]]
    [1] "Ike Quebec"              " tenor sax; Sonny Clark" " piano; Grant Green"     " guitar; Sam Jones"      " bass; Louis Hayes"      " drums."