pythonamazon-web-servicespython-requestscommon-crawl

Can't stream files from Amazon s3 using requests


I'm trying to stream crawl data from Common Crawl, but Amazon s3 errors when I use the stream=True parameters to get requests. Here is an example:

resp = requests.get(url, stream=True)
print(resp.raw.read())

When I run this on a Common Crawl s3 http url, I get the response:

b'<?xml version="1.0" encoding="UTF-8"?>\n<Error><Code>NoSuchKey</Code>
<Message>The specified key does not exist.</Message><Key>crawl-data/CC-
MAIN-2018-05/segments/1516084886237.6/warc/CC-
MAIN-20180116070444-20180116090444-00000.warc.gz\n</Key>
<RequestId>3652F4DCFAE0F641</RequestId><HostId>Do0NlzMr6
/wWKclt2G6qrGCmD5gZzdj5/GNTSGpHrAAu5+SIQeY15WC3VC6p/7/1g2q+t+7vllw=
</HostId></Error>'

I am using warcio, and need a streaming file object as input to the archive iterator, and a can't download the file all at once because of limited memory. What should I do?

PS. The url I request in the example is https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-05/segments/1516084886237.6/warc/CC-MAIN-20180116070444-20180116090444-00000.warc.gz


Solution

  • There is an error in your url. Compare the key in the response you are getting:

    <Key>crawl-data/CC-
    MAIN-2018-05/segments/1516084886237.6/warc/CC-
    MAIN-20180116070444-20180116090444-00000.warc.gz\n</Key>
    

    to the one in the intended url:

    https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2018-05/segments/1516084886237.6/warc/CC-MAIN-20180116070444-20180116090444-00000.warc.gz
    

    For some reason you are adding unnecessary whitespace, probably picked up during file reading (readline() will give you trailing '\n' characters on every line). Maybe try calling .strip() to remove the trailing newline.