Fellow developers,
I am trying to research how many English songs have been in Spotify's "Global top50" for the last 5 years. For that, I was planning to extract all the songs that belonged to that ranking and check their language.
However, bad news, Spotify's public API does not provide language metadata for the songs.
I tried to solve that with 2 workarounds but I could not succeed:
First workaround:
I tried to use MusixMatch API for developers as, based on their documentation, the methods matcher.lyrics.get and track.lyrics.get should be returning the song language in the attribute "lyrics_language".
However, when I make the call I am not getting this value. I assume it is only a functionality of their PLUS paid service.
Second workaround:
In this post, I found nice information about how to retrieve the language of performance using the Spotify client (different than the open Spotify API).
In the call to the client service, you are required to specify a GID - which is different than the typical Spotify trackID. GID is in base16 while trackID is in base 62.
While using the example provided in the post - "Oh Yeah" by Aime Simone - I can get the same song when using:
Until here, everything is nice and smooth as 0VtMV3IYHAu7fyZmqnGE99 in base62 is the same as 1e763423ad0e4862b20f9dbfdadc2cb7 in base16.
But then, when I grab any other trackID in base 62 and I change it to base 16 to make the spclient call, is not working and I am getting a 404 Not Found.
Conclusion
It would be very helpful if any of you could let me know how to:
Many thanks in advance. I hope the explanation was clear enough :)
From a suggested edit to the other answer:
Here is a Python library which scrapes the full lyrics of songs from Genius. https://lyricsgenius.readthedocs.io/en/master/
You can get the lyrics of a song and save them to a file, demonstrated here: https://melaniewalsh.github.io/Intro-Cultural-Analytics/04-Data-Collection/08-Collect-Genius-Lyrics.html
However, the langdetect library only outputs one language as the detected language. So, even if a song is in multiple languages you'll get the most dominant one for the most part. You could use the detect_langs
method, and specify some logic to include any number of languages if the probability is within a certain tolerance.
>>> from langdetect import detect_langs
>>> detect_langs("Otec matka syn.")
[sk:0.572770823327, pl:0.292872522702, cs:0.134356653968]