I am trying to download a series of Warc files from the CommonCrawl database, each of them about 25mb. This is my script:
import json
import urllib.request
from urllib.error import HTTPError
from src.Util import rooted
with open(rooted('data/alexa.txt'), 'r') as alexa:
for i, url in enumerate(alexa):
if i % 1000 == 0:
try:
request = 'http://index.commoncrawl.org/CC-MAIN-2018-13-index?url={search}*&output=json' \
.format(search=url.rstrip())
page = urllib.request.urlopen(request)
for line in page:
result = json.loads(line)
urllib.request.urlretrieve('https://commoncrawl.s3.amazonaws.com/%s' % result['filename'],
rooted('data/warc/%s' % ''.join(c for c in result['url'] if c.isalnum())))
except HTTPError:
pass
What this is currently doing is requesting the link to download the Warc file via the CommonCrawl REST API and then initiating the download into the 'data/warc' folder.
The problem is that in each urllib.request.urlretrieve()
call, the program hangs until the file is completely downloaded before issuing the next download request. Is there any way the urllib.request.urlretrieve()
call can be terminated as soon as the download has been issued and then the file downloaded after or some way to spin a new thread for each of these requests and have all the files downloading simultaneously?
Thanks
Use threads, futures
even :)
jobs = []
with ThreadPoolExecutor(max_workers=100) as executor:
for line in page:
future = executor.submit(urllib.request.urlretrieve,
'https://commoncrawl.s3.amazonaws.com/%s' % result['filename'],
rooted('data/warc/%s' % ''.join(c for c in result['url'] if c.isalnum()))
jobs.append(future)
...
for f in jobs:
print(f.result())
read more here: https://docs.python.org/3/library/concurrent.futures.html