Uncompressed size of a webpage using chunked transfer encoding and gzip compression

I'm writing an application that calculates the savings we got after using gzip on a web page. When the user inputs the URL of the web page that used gzip, the application should spit out the savings in size due to gzip.

How should I approach this problem?

This is what I am getting as header for a GET request on the page:

{
    'X-Powered-By': 'PHP/5.5.9-1ubuntu4.19',
    'Transfer-Encoding': 'chunked',
    'Content-Encoding': 'gzip',
    'Vary': 'Accept-Encoding', 
    'Server': 'nginx/1.4.6 (Ubuntu)',
    'Connection': 'keep-alive',
    'Date': 'Thu, 10 Nov 2016 09:49:58 GMT',
    'Content-Type': 'text/html'
}

I am retrieving the page with requests:

r  = requests.get(url, headers)
data = r.text
print "Webpage size : " , len(data)/1024

Solution

If you are already downloaded the URL (using a requests GET request without the stream option, you already have both sizes available as the whole response is downloaded and decompressed, and the original length is available in the headers:

from __future__ import division

r = requests.get(url, headers=headers)
compressed_length = int(r.headers['content-length'])
decompressed_length = len(r.content)

ratio = compressed_length / decompressed_length

You could compare a Accept-Encoding: identity HEAD request content-length header with one with setting Accept-Encoding: gzip instead:

no_gzip = {'Accept-Encoding': 'identity'}
no_gzip.update(headers)
uncompressed_length = int(requests.get(url, headers=no_gzip).headers['content-length'])
force_gzip = {'Accept-Encoding': 'gzip'}
force_gzip.update(headers)
compressed_length = int(requests.get(url, headers=force_gzip).headers['content-length'])

However, this may not work for all servers, as dynamically-generated content servers routinely futz the Content-Length header in such cases to avoid having to render the content first.

If you are requesting a chunked transfer encoding resource, there won't be a content-length header, in which case a HEAD request may or may not provide you with the correct information either.

In that case you'd have to stream the whole response and extract the decompressed size from the end of the stream (the GZIP format includes this as a little-endian 4-byte unsigned int at the very end). Use the stream() method on the raw urllib3 response object:

import requests
from collections import deque

if hasattr(int, 'from_bytes'):
    # Python 3.2 and up
    _extract_size = lambda q: int.from_bytes(bytes(q), 'little')
else:
    import struct
    _le_int = struct.Struct('<I').unpack
    _extract_size = lambda q: _le_int(b''.join(q))[0]

def get_content_lengths(url, headers=None, chunk_size=2048):
    """Return the compressed and uncompressed lengths for a given URL

    Works for all resources accessible by GET, regardless of transfer-encoding
    and discrepancies between HEAD and GET responses. This does have
    to download the full request (streamed) to determine sizes.

    """
    only_gzip = {'Accept-Encoding': 'gzip'}
    only_gzip.update(headers or {})
    # Set `stream=True` to ensure we can access the original stream:
    r = requests.get(url, headers=only_gzip, stream=True)
    r.raise_for_status()
    if r.headers.get('Content-Encoding') != 'gzip':
        raise ValueError('Response not gzip-compressed')
    # we only need the very last 4 bytes of the data stream
    last_data = deque(maxlen=4)
    compressed_length = 0
    # stream directly from the urllib3 response so we can ensure the
    # data is not decompressed as we iterate
    for chunk in r.raw.stream(chunk_size, decode_content=False):
        compressed_length += len(chunk)
        last_data.extend(chunk)
    if compressed_length < 4:
        raise ValueError('Not enough data loaded to determine uncompressed size')
    return compressed_length, _extract_size(last_data)

Demo:

>>> compressed_length, decompressed_length = get_content_lengths('http://httpbin.org/gzip')
>>> compressed_length
179
>>> decompressed_length
226
>>> compressed_length / decompressed_length
0.7920353982300885