rcsvurl

Retrieve whole lyrics from URL


I am trying to retrieve the whole lyrics of a band from the web. I have noticed that they build URLs using ".../firstletter/bandname/songname.html"

Here is an example.

http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html

I was thinkining about creating a function that would read.csv the URLs. That part was kind of easy because I can get the titles by a simple copy paste and save as .csv. Then, use that vector to pass the function for each value in order to construct the URL name.

But I tried to read the first one just to see what it looks like and I found that there will be too much "cleaning the data" if my goal is to build a csv file with each lyric.

x <-read.csv(url("http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html"))  

I think my approach is not the best (or maybe I need a better data cleaning strategy)


Solution

  • The HTML page has a tell on where the lyrics begin:

    Usage of azlyrics.com content by any third-party lyrics provider is prohibited by our licensing agreement. Sorry about that.

    Taking advantage of that, you can detect this string, and then read everything up to the end of the div:

    m <- readLines("http://www.azlyrics.com/lyrics/acdc/itsalongwaytothetopifyouwannarocknroll.html")
    
    giveaway <- "Sorry about that."
    #You can add the full line in case you think one of the lyrics might have this sentence in it.
    
    start <- grep(giveaway, m) + 1 # Where the lyric starts
    end <- grep("</div>", m[start:length(m)])[1] + start
    # Take the first </div> after the start of the lyric, and then fix the position by adding the start
    
    lyrics <- paste(gsub("<br>|</div>", "", m[start:end]), collapse = "\n") 
    #This is just an example of how to clear the remaining tags and join the text.
    

    And then:

    > cat(lyrics) #using cat() prints the line breaks
    Ridin' down the highway
    Goin' to a show
    Stop in all the byways
    Playin' rock 'n' roll 
    .
    .
    .
    Well it's a long way
    It's a long way, you should've told me
    It's a long way, such a long way