pythonhttpcurlwget

Content-length available in Curl, Wget, but not in Python Requests


I have an URL pointing to a binary file which I need to download after checking its size, because the download should only be (re-)executed if the local file size differs from the remote file size.

This is how it works with wget (anonymized host names and IPs):

$ wget <URL>
--2020-02-17 11:09:18--  <URL>
Resolving <URL> (<host>)... <IP>
Connecting to <host> (<host>)|<ip>|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31581872 (30M) [application/x-gzip]
Saving to: ‘[...]’

This also works fine with the --continue flag in order to resume a download, including skipping if the file was completely downloaded earlier.

I can do the same with curl, the content-length is also present:

$ curl -I <url>
HTTP/2 200 
date: Mon, 17 Feb 2020 13:11:55 GMT
server: Apache/2.4.25 (Debian)
strict-transport-security: max-age=15768000
last-modified: Fri, 14 Feb 2020 15:42:29 GMT
etag: "[...]"
accept-ranges: bytes
content-length: 31581872
vary: Accept-Encoding
content-type: application/x-gzip

In Python, I try to implement the same logic by checking the Content-length header using the requests library:

        with requests.get(url, stream=True) as response:
            total_size = int(response.headers.get("Content-length"))

            if not response.ok:
                logger.error(
                    f"Error {response.status_code} when downloading file from {url}"
                )
            elif os.path.exists(file) and os.stat(file).st_size == total_size:
                logger.info(f"File '{file}' already exists, skipping download.")
            else:
                [...] # download file

It turns out that the Content-length header is never present, i.e. gets a None value here. I know that this should be worked around by passing a default value to the get() call, but for the purpose of debugging, this example consequently triggers an exception:

TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType' 

I can confirm manually that the Content-length header is not there:

requests.get(url, stream=True).headers
{'Date': '[...]', 'Server': '[...]', 'Strict-Transport-Security': '[...]', 'Upgrade': '[...]', 'Connection': 'Upgrade, Keep-Alive', 'Last-Modified': '[...]', 'ETag': ''[...]'', 'Accept-Ranges': 'bytes', 'Vary': 'Accept-Encoding', 'Content-Encoding': 'gzip', 'Keep-Alive': 'timeout=15, max=100', 'Transfer-Encoding': 'chunked', 'Content-Type': 'application/x-gzip'}

This logic works fine though for other URLs, i.e. I do get the Content-length header.

When using requests.head(url) (omitting the stream=True), I get the same headers except for Transfer-Encoding.

I understand that a server does not have to send a Content-length header. However, wget and curl clearly do get that header. What do they do differently from my Python implementation?


Solution

  • We just encountered this with AWS cloudfront, for us adding Accept-Encoding: "identity" made it work with nodejs

    - Without Accept-Encoding => no content-length, response is sent with br compression
    - With Accept-Encoding: identity => content-length is available

    Because compression is happening on the fly, Cloudfront is unable to get the final length of the compressed content. Adding a Content-Length header would require storing the full response and calculating its length and then returning it back to the client which is not possible since it would cause latency if the original response itself was large. That is why when using compression instead of using Content-Length header, we add the “Transfer-Encoding” header with the value as “Chunked”. This indicates that the data is sent in a series of chunks. If you are using HTTP/1.1, the Content-Length header is omitted and the “Transfer-Encoding: Chunked” is added instead, in the response headers. However, if you are using HTTP/2 for compressed contented, neither “Content-Length” nor “Transfer-Encoding” is used. HTTP/2 uses its own mechanism to convey to the client that the response is chunked.