I wrote a short prog which uses the Discogs API with python, but it is so damn slow thats not usable for real web-applications. Here is the Python code and the python profile results (published only the time consuming spots) :
# -*- coding: utf-8 -*-
import profile
import discogs_client as discogs
def main():
discogs.user_agent = 'Mozilla/5.0'
#dump released albums into the file. You could also print it to the console
f=open('DiscogsTestResult.txt', 'w+')
#Use another band if you like,
#but if you decide to take "beatles" you will wait an hour! (cause of the num of releases)
artist = discogs.Artist('Faust')
print >> f, artist
print >> f," "
artistReleases = artist.releases
for r in artistReleases:
print >> f, r.data
print >> f,"---------------------------------------------"
print 'Performance Analysis of Discogs API'
print '=' * 80
profile.run('print main(); print')
and here is the result of pythons profile:
Performance Analysis of Discogs API
================================================================================
82807 function calls (282219 primitive calls) in 177.544 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
188 121.013 0.644 121.013 0.644 :0(connect)
206 52.080 0.253 52.080 0.253 :0(recv)
1 0.036 0.036 177.494 177.494 <string>:1(<module>)
188 0.013 0.000 175.234 0.932 adapters.py:261(send)
376 0.005 0.000 0.083 0.000 adapters.py:94(init_poolmanager)
188 0.008 0.000 176.569 0.939 api.py:17(request)
188 0.007 0.000 176.577 0.939 api.py:47(get)
188 0.015 0.000 173.922 0.925 connectionpool.py:268(_make_request)
188 0.015 0.000 174.034 0.926 connectionpool.py:332(urlopen)
1 0.496 0.496 177.457 177.457 discogsTestFullDump.py:6(main)
564 0.009 0.000 176.613 0.313 discogs_client.py:66(_response)
188 0.012 0.000 176.955 0.941 discogs_client.py:83(data)
188 0.011 0.000 51.759 0.275 httplib.py:363(_read_status)
188 0.017 0.000 52.520 0.279 httplib.py:400(begin)
188 0.003 0.000 121.198 0.645 httplib.py:754(connect)
188 0.007 0.000 121.270 0.645 httplib.py:772(send)
188 0.005 0.000 121.276 0.645 httplib.py:799(_send_output)
188 0.003 0.000 121.279 0.645 httplib.py:941(endheaders)
188 0.003 0.000 121.348 0.645 httplib.py:956(request)
188 0.016 0.000 121.345 0.645 httplib.py:977(_send_request)
188 0.009 0.000 52.541 0.279 httplib.py:994(getresponse)
1 0.000 0.000 177.544 177.544 profile:0(print main(); print)
188 0.032 0.000 176.322 0.938 sessions.py:225(request)
188 0.030 0.000 175.513 0.934 sessions.py:408(send)
752 0.015 0.000 121.088 0.161 socket.py:223(meth)
2256 0.224 0.000 52.127 0.023 socket.py:406(readline)
188 0.009 0.000 121.195 0.645 socket.py:537(create_connection)
Does anybody has any idea how to speed this up. I hope that whith some changes in the discogs_client.py it would be faster. Maybe changing from httplib to something else, or whatever. Or mybe it is faster to use another protocol instead of http?
(The source of discogs_client.py can be accessed here :"https://github.com/discogs/discogs_client/blob/master/discogs_client.py")
If anybody has any idea please respond, a lot of people would benefit from this.
Regards Daniel
UPDATE: From the discogs documentation: Requests are throttled by the server to one per second per IP address. Your application should (but doesnt have to) take this into account and throttle requests locally, too.
The bottleneck seems to be at the (discogs) server end, retrieving individual releases. There is nothing you can really do about that, except give them money for faster servers.
My suggestion would be to to cache the results, it's probably the only thing that will help. Rewrite discogs.APIBase._response, as follows:
def _response(self):
if not self._cached_response:
self._cached_response=self._load_response_from_disk()
if not self._cached_response:
if not self._check_user_agent():
raise UserAgentError("Invalid or no User-Agent set.")
self._cached_response = requests.get(self._uri, params=self._params, headers=self._headers)
self._save_response_to_disk()
return self._cached_response
An alternative approach is to write requests to a log and say "we don't know, try again later", then in another process, read the log, download the data, store it in a database. Then when they come back later, the requested data will be there ready.
You would need to write _load_response_from_disk() and _save_response_to_disk() yourself - The stored data should have _uri, _params, and _headers
as the key, and should include a timestamp with the data. If the data is too old (under the circumstances, I would suggest in the order of months - I have no idea if the numbering is persistent - I would guess trying days - weeks initially), or not found, return None. The storage would have to handle concurrent access, and fast indexes - probably a database.