pythongeventgreenlets

Correct greenlet termination


I am using gevent to download some html pages. Some websites are way too slow, some stop serving requests after period of time. That is why I had to limit total time for a group of requests I make. For that I use gevent "Timeout".

timeout = Timeout(10)
timeout.start()

def downloadSite():
    # code to download site's url one by one
    url1 = downloadUrl()
    url2 = downloadUrl()
    url3 = downloadUrl()
try:
    gevent.spawn(downloadSite).join()
except Timeout:
    print 'Lost state here'

But the problem with it is that i loose all the state when exception fires up.

Imagine I crawl site 'www.test.com'. I have managed to download 10 urls right before site admins decided to switch webserver for maintenance. In such case i will lose information about crawled pages when exception fires up.

The question is - how do I save state and process the data even if Timeout happens ?


Solution

  • Why not try something like:

    timeout = Timeout(10)
    
    def downloadSite(url):
        with Timeout(10):
            downloadUrl(url)
    
    urls = ["url1", "url2", "url3"]
    
    workers = []
    limit = 5
    counter = 0
    for i in urls:
        # limit to 5 URL requests at a time
        if counter < limit:
            workers.append(gevent.spawn(downloadSite, i))
            counter += 1
        else:
            gevent.joinall(workers)
            workers = [i,]
            counter = 0
    gevent.joinall(workers)
    

    You could also save a status in a dict or something for every URL, or append the ones that fail in a different array, to retry later.