htmlweb-scrapingxpath

Difficulties with web scraping


I have just came to an article called The 500 Greatest Songs of All Time and thought "oh that's cool I bet they also made a Spotify/Apple music list that I can follow". Well...they don't.

So in a nutshell, I wonder if it's possible to 1) scrap the website to extract the songs and 2) then do some kind of bulk upload to Spotify to create the list.

Songs' titles and authors are structured like this in the website: Website screenshot. I have already tried to scrap the web with the importxml() formula in google sheets but with no success.

I understand the scrapping part is easier than the other and, as I am new to programming, I would be happy to manage to partially achieve this goal. I am sure this task can be achieved easily on python.


Solution

  • I feel like explaining everything would go beyond the scope here, so I tried to comment the code well enough.

    1. Scrape the songs

    I used python3 and selenium, their website doesn't block that. Be sure to adjust your chromedriver path, and the output path of the .txt file at the bottom if necessary. Once it's done and you have your .txt file you can close it.

    import time
    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.chrome.service import Service
    
    s = Service(r'/Users/main/Desktop/chromedriver')
    driver = webdriver.Chrome(service=s)
    
    # just setting some vars, I used Xpath because I know that
    top_500 = 'https://www.rollingstone.com/music/music-lists/best-songs-of-all-time-1224767/'
    cookie_button_xpath = "// button [@id = 'onetrust-accept-btn-handler']"
    div_containing_links_xpath = "// div [@id = 'pmc-gallery-list-nav-bar-render'] // child :: a"
    song_names_xpath = "// article [@class = 'c-gallery-vertical-album'] / child :: h2"
    
    links = []
    songs = []
    
    
    driver.get(top_500)
    
    
    # accept cookies, give time to load
    time.sleep(3)
    cookie_btn = driver.find_element(By.XPATH, cookie_button_xpath)
    cookie_btn.click()
    time.sleep(1)
    
    
    # extracting all the links since there are only 50 songs per page
    links_to_next_pages = driver.find_elements(By.XPATH, div_containing_links_xpath)
    
    for element in links_to_next_pages:
        l = element.get_attribute('href')
        links.append(l)
    
    
    # extracting the songs, then going to next page and so on until we hit 500
    counter = 1         # were starting with 1 here since links[0] is the current page we are already on
    
    while True:
        list = driver.find_elements(By.XPATH, song_names_xpath)
    
        for element in list:
            s = element.text
            songs.append(s)
        
        if len(songs) == 500:
            break
    
        driver.get(links[counter])
        counter += 1
    
        time.sleep(2)
    
    
    # verify that there are no duplicates, if there were, something would be off
    if len(songs) != len( set(songs) ):
        print('something went wrong')
    else:
        print('seems fine')
    
    
    with open('/Users/main/Desktop/output_songs.txt', 'w') as file:
        file.writelines(line + '\n' for line in songs)
    

    2. Prepare Spotify

    3. Prepare Your Environment

    4. Run the Spotify side of things

    import requests
    import re
    import json
    
    # this is NOT you display name, it's your user name!!
    user_id = 'YOUR_USERNAME'
    # paste your auth token from spotify; it can time out then you have to get a new one, so dont panic if you get a bunch of responses in the 400s after some time
    auth = {"Authorization": "Bearer YOUR_AUTH_KEY_FROM_LOCALHOST"}
    
    
    playlist = []
    err_log = []
    base_url = 'https://api.spotify.com/v1'
    search_method = '/search'
    
    with open('/Users/main/Desktop/output_songs.txt', 'r') as file:
        songs = file.readlines()
    
    
    # this querys spotify does some magic and then appends the tracks spotify uri to an array
    def query_song_uris():
        for n, entry in enumerate(songs):
            x = re.findall(r"'([^']*)'", entry)
            title_len = len(entry) - len(x[0]) - 4
            
            title = x[0]
            artist = entry[:title_len]
    
            payload = {
                'q': (entry),
                'track:': (title),
                'artist:': (artist),
                'type': 'track',
                'limit': 1
            }
    
            url = base_url + search_method
            
            try:
                r = requests.get(url, params=payload, headers=auth)
                print('\nquerying spotify;  ', r)
                
                c = r.content.decode('UTF-8')
                dic = json.loads(c)
    
                track_uri = dic["tracks"]["items"][0]["uri"]
    
                playlist.append(track_uri)
                print(track_uri)
    
            except:
                err = f'\nNr. {(len(songs)-n)}: ' + f'{entry}'
                err_log.append(err)
    
        playlist.reverse()
    query_song_uris()
    
    # creates a playlist and returns playlist id
    def create_playlist():
        payload = {
                    "name": "Rolling Stone: Top 500 (All Time)",
                    "description": "music for old men xD with occasional hip hop appearences. just kidding"
                }
    
        url = base_url + f'/users/{user_id}/playlists'
        r = requests.post(url, headers=auth, json=payload)
        
        c = r.content.decode('UTF-8')
        dic = json.loads(c)
    
        print(f'\n\ncreating playlist @{dic["id"]};  ', r)
        return dic["id"]
    
    
    def add_to_playlist():
    
        playlist_id = create_playlist()
    
        while True:
    
            if len(playlist) > 100:
                p = playlist[:100]
            else:
                p = playlist
    
            payload = {"uris": (p)}
    
            url = base_url + f'/playlists/{playlist_id}/tracks'
            r = requests.post(url, headers=auth, json=payload)
    
            print(f'\nadding {len(p)} songs to playlist;  ', r)
    
            del playlist[ : len(p) ]
    
            if len(playlist) == 0:
                break
    add_to_playlist()
    
    
    print('\n\ncheck your spotify :)')
    print("\n\n\nthese tracks didn't make it, check manually:\n")
    for line in err_log:
        print(line)
    print('\n\n')
    

    Done

    If you don't want to run the code yourself, heres the playlist: https://open.spotify.com/playlist/5fdLKYNFlA4XSvhEl36KXS

    If you have trouble, everything from step 2 on is also described here in the Web API quick start or in general in the web API docs.

    Regarding Apple Music

    So Apple seems very closed up (surprise haha). What I found though is that you can query the i-Tunes store. Given response also contains a direct link to the song(s) on Apple music. You might be able to go from there.

    Get ISRC code from iTunes Search API (Apple music)

    PS: undeniably regex is witchcraft, but y'all here got my back