web-scrapingspotifyplaylist

How do I scrape all spotify playlists ever?


I am trying to analyze all user-curated Spotify playlists and the tracks inside all of them, especially in the hip-hop genre. The result that I want is a list of user-curated playlists ID (preferably 50,000 playlist IDs)

I have tried using search API and Get Category’s Playlist Spotify API. The problem is that

  1. There is a 1,000 data limit forsearch API.
  2. Get Category’s Playlist Spotify API only gives out Spotify-curated playlists on each genre.

I also tried to go around the search API by thinking of parsing different queries (i.e. search on 'a','b','c','d',...). However, I still have no idea which queries will best represent Spotify playlists as a whole (as searching 'a','b',... would be considered too random). I would appreciate any help or ideas!

This is what I have tried with Get Category’s Playlist Spotify API with Spotipy Library in Google Colab

import pandas as pd
import numpy as np
import spotipy
import spotipy.util as util
from spotipy.oauth2 import SpotifyClientCredentials
import spotipy.oauth2 as oauth2

# Replace Auth details with your Client ID, Secret
spotify_details = {
    'client_id' : 'Client ID',
    'client_secret':'Client Secret',
    'redirect_uri':'Redirect_uri'}

scope = "user-library-read user-follow-read user-top-read playlist-read-private playlist-read-collaborative playlist-modify-public playlist-modify-private" 

sp = spotipy.Spotify(
        auth_manager=spotipy.SpotifyOAuth(
          client_id=spotify_details['client_id'],
          client_secret=spotify_details['client_secret'],
          redirect_uri=spotify_details['redirect_uri'],    
          scope=scope,open_browser=False))


results = sp.category_playlists(category_id="hiphop", limit = 5, country="US", offset=0)
total = results["playlists"]["total"]
df=pd.DataFrame([],columns = ['id', 'name', 'external_urls.spotify'])
for offset in range(0,total,50):
  results = sp.category_playlists(category_id="hiphop", limit = 50, country="US", offset=offset)
  playlists = pd.json_normalize(results['playlists']['items'])
  #print(playlists.keys)
  df=pd.concat([df,playlists])
df

I only can get around 104 playlists when I run

print(len(df)) 
>>104

P.S. This number varies around 80-100+ depending on the location of your account.


Solution

  • Main idea is same as @Nima Akbarzadeh's idea with offset

    I am using axios call with Spotify API call on node.js

    Got the playlists first, then get track within loop each playlist.

    This Code can get all of hiphop songs from Spotify.

    const axios = require('axios')
    
    const API_KEY='<your client ID>'
    const API_KEY_SECRET='<your client Secret>'
    
    const getToken = async () => {
        try {
            const resp = await axios.post(
                url = 'https://accounts.spotify.com/api/token',
                data = '',
                config = {
                    params: {
                        'grant_type': 'client_credentials'
                    },
                    auth: {
                        username: API_KEY,
                        password: API_KEY_SECRET
                    }
                }
            );
            return Promise.resolve(resp.data.access_token);
        } catch (err) {
            console.error(err)
            return Promise.reject(err)
        }
    };
    const getCategories = async (category_id, token) => {
        try {
            let offset = 0
            let next = 1
            const songs = [];
            while (next != null) {
                const resp = await axios.get(
                    url = `https://api.spotify.com/v1/browse/categories/${category_id}/playlists?country=US&offset=${offset}&limit=20`,
                    config = {
                        headers: {
                            'Accept-Encoding': 'application/json',
                            'Authorization': `Bearer ${token}`,
                        }
                    }
                );
                
                for(const item of resp.data.playlists.items) {
                    if(item?.name != null) {
                        songs.push({
                            name: item.name,
                            external_urls: item.external_urls.spotify,
                            type: item.type,
                            id : item.id
                        })
                    }
                }
    
                offset = offset + 20
    
                next = resp.data.playlists.next
            }
            return Promise.resolve(songs)
        } catch (err) {
            console.error(err)
            return Promise.reject(err)
        }
    }
    
    const getTracks = async (playlists, token) => {
        try {
            const tracks = [];
            for(const playlist of playlists) {
                const resp = await axios.get(
                    url = `https://api.spotify.com/v1/playlists/${playlist.id}`,
                    config = {
                        headers: {
                            'Accept-Encoding': 'application/json',
                            'Authorization': `Bearer ${token}`,
                        }
                    }
                );
                for(const item of resp.data.tracks.items) {
                    if(item.track?.name != null) {
                        tracks.push({
                            name: item.track.name,
                            external_urls: item.track.external_urls.spotify
                        })
                    }
                }
            }
            return Promise.resolve(tracks)
        } catch (err) {
            console.error(err)
            return Promise.reject(err)
        }
    };
    
    getToken()
        .then(token => {
            getCategories('hiphop', token)
                .then(playlists => {
                    getTracks(playlists, token)
                        .then(tracks => {
                            for(const track of tracks) {
                                console.log(track)
                            }
                        })
                        .catch(error => {
                            console.log(error.message);
                        });  
                })
                .catch(error => {
                    console.log(error.message);
                });
          
        })
        .catch(error => {
            console.log(error.message);
        });
    

    I got 6435 songs

    $ node get-data.js
    [
      {
        name: 'RapCaviar',
        external_urls: 'https://open.spotify.com/playlist/37i9dQZF1DX0XUsuxWHRQd',
        type: 'playlist',
        id: '37i9dQZF1DX0XUsuxWHRQd'
      },
      {
        name: "Feelin' Myself",
        external_urls: 'https://open.spotify.com/playlist/37i9dQZF1DX6GwdWRQMQpq',
        type: 'playlist',
        id: '37i9dQZF1DX6GwdWRQMQpq'
      },
      {
        name: 'Most Necessary',
        external_urls: 'https://open.spotify.com/playlist/37i9dQZF1DX2RxBh64BHjQ',
        type: 'playlist',
        id: '37i9dQZF1DX2RxBh64BHjQ'
      },
      {
        name: 'Gold School',
        external_urls: 'https://open.spotify.com/playlist/37i9dQZF1DWVA1Gq4XHa6U',
        type: 'playlist',
        id: '37i9dQZF1DWVA1Gq4XHa6U'
      },
      {
        name: 'Locked In',
        external_urls: 'https://open.spotify.com/playlist/37i9dQZF1DWTl4y3vgJOXW',
        type: 'playlist',
        id: '37i9dQZF1DWTl4y3vgJOXW'
      },
      {
        name: 'Taste',
        external_urls: 'https://open.spotify.com/playlist/37i9dQZF1DWSUur0QPPsOn',
        type: 'playlist',
        id: '37i9dQZF1DWSUur0QPPsOn'
      },
      {
        name: 'Get Turnt',
        external_urls: 'https://open.spotify.com/playlist/37i9dQZF1DWY4xHQp97fN6',
        type: 'playlist',
        id: '37i9dQZF1DWY4xHQp97fN6'
      },
    ...
     {
        name: 'BILLS PAID (feat. Latto & City Girls)',
        external_urls: 'https://open.spotify.com/track/0JiLQRLOeWQdPC9rVpOqqo'
      },
      {
        name: 'Persuasive (with SZA)',
        external_urls: 'https://open.spotify.com/track/67v2UHujFruxWrDmjPYxD6'
      },
      {
        name: 'Shirt',
        external_urls: 'https://open.spotify.com/track/34ZAzO78a5DAVNrYIGWcPm'
      },
      {
        name: 'Back 2 the Streets',
        external_urls: 'https://open.spotify.com/track/3Z9aukqdW2HuzFF1x9lKUm'
      },
      {
        name: 'FTCU (feat. GloRilla & Gangsta Boo)',
        external_urls: 'https://open.spotify.com/track/4lxTmHPgoRWwM9QisWobJL'
      },
      {
        name: 'My Way',
        external_urls: 'https://open.spotify.com/track/5BcIBbBdkjSYnf5jNlLG7j'
      },
      {
        name: 'Donk',
        external_urls: 'https://open.spotify.com/track/58lmOL5ql1YIXrpRpoYi3i'
      },
      ... 6335 more items
    ]
    
    node get-data.js > result.json
    

    enter image description here

    Update with Python version

    import spotipy
    from spotipy.oauth2 import SpotifyOAuth
    import json
    import re
    
    SCOPE = ['user-library-read',
        'user-follow-read',
        'user-top-read',
        'playlist-read-private',
        'playlist-read-collaborative',
        'playlist-modify-public',
        'playlist-modify-private']
    USER_ID = '<your user id>'
    REDIRECT_URI = '<your redirect uri>'
    CLIENT_ID = '<your client id>'
    CLIENT_SECRET = '<your client secret>'
    auth_manager = SpotifyOAuth(
        scope=SCOPE,
        username=USER_ID,
        redirect_uri=REDIRECT_URI,
        client_id=CLIENT_ID,
        client_secret=CLIENT_SECRET)
    
    def get_categories():
        try:
            sp = spotipy.Spotify(auth_manager=auth_manager)
            query_limit = 50
            categories=[]
            new_offset = 0
            while True:
                results=sp.category_playlists(category_id='hiphop', limit = query_limit, country='US', offset=new_offset)
                for item in results['playlists']['items']:
                    if (item is not None and item['name'] is not None):
                        # ['https:', '', 'api.spotify.com', 'v1', 'playlists', '37i9dQZF1DX0XUsuxWHRQd', 'tracks']
                        tokens = re.split(r"[\/]", item['tracks']['href'])
                        categories.append({
                            'id' : item['id'],
                            'name': item['name'],
                            'url': item['external_urls']['spotify'],
                            'tracks': item['tracks']['href'],
                            'playlist_id': tokens[5],
                            'type': item['type']
                        })
                new_offset = new_offset + query_limit
                next = results['playlists']['next']
                if next is None:
                    break
            return categories
        except Exception as e:
            print('Failed to upload to call get_categories: '+ str(e))
    
    def get_songs(categories):
        try:
            sp = spotipy.Spotify(auth_manager=auth_manager)
            songs=[]
            for category in categories:
                if category is None:
                    break
                playlist_id = category['playlist_id']
                results=sp.playlist(playlist_id=playlist_id)
                for item in results['tracks']['items']:
                    if (item is not None and item['track'] is not None and item['track']['id'] is not None and item['track']['name'] is not None and item['track']['external_urls']['spotify'] is not None):
                        songs.append({
                            'id' : item['track']['id'],
                            'name': item['track']['name'],
                            'url': item['track']['external_urls']['spotify']
                        })
                    else:
                        break
            return songs
        except Exception as e:
            print('Failed to upload to call get_songs: '+ str(e))
    
    categories = get_categories()
    songs = get_songs(categories)
    print(json.dumps(songs))
    # print(len(songs)) -> 6021
    

    Result by

    $ python get-songs.py > all-songs.json
    

    enter image description here