I'm writing an application that calculates the savings we got after using gzip on a web page. When the user inputs the URL of the web page that used gzip, the application should spit out the savings in size due to gzip.
How should I approach this problem?
This is what I am getting as header for a GET request on the page:
{
'X-Powered-By': 'PHP/5.5.9-1ubuntu4.19',
'Transfer-Encoding': 'chunked',
'Content-Encoding': 'gzip',
'Vary': 'Accept-Encoding',
'Server': 'nginx/1.4.6 (Ubuntu)',
'Connection': 'keep-alive',
'Date': 'Thu, 10 Nov 2016 09:49:58 GMT',
'Content-Type': 'text/html'
}
I am retrieving the page with requests
:
r = requests.get(url, headers)
data = r.text
print "Webpage size : " , len(data)/1024
If you are already downloaded the URL (using a requests
GET
request without the stream
option, you already have both sizes available as the whole response is downloaded and decompressed, and the original length is available in the headers:
from __future__ import division
r = requests.get(url, headers=headers)
compressed_length = int(r.headers['content-length'])
decompressed_length = len(r.content)
ratio = compressed_length / decompressed_length
You could compare a Accept-Encoding: identity
HEAD request content-length header with one with setting Accept-Encoding: gzip
instead:
no_gzip = {'Accept-Encoding': 'identity'}
no_gzip.update(headers)
uncompressed_length = int(requests.get(url, headers=no_gzip).headers['content-length'])
force_gzip = {'Accept-Encoding': 'gzip'}
force_gzip.update(headers)
compressed_length = int(requests.get(url, headers=force_gzip).headers['content-length'])
However, this may not work for all servers, as dynamically-generated content servers routinely futz the Content-Length header in such cases to avoid having to render the content first.
If you are requesting a chunked transfer encoding resource, there won't be a content-length header, in which case a HEAD request may or may not provide you with the correct information either.
In that case you'd have to stream the whole response and extract the decompressed size from the end of the stream (the GZIP format includes this as a little-endian 4-byte unsigned int at the very end). Use the stream()
method on the raw urllib3 response object:
import requests
from collections import deque
if hasattr(int, 'from_bytes'):
# Python 3.2 and up
_extract_size = lambda q: int.from_bytes(bytes(q), 'little')
else:
import struct
_le_int = struct.Struct('<I').unpack
_extract_size = lambda q: _le_int(b''.join(q))[0]
def get_content_lengths(url, headers=None, chunk_size=2048):
"""Return the compressed and uncompressed lengths for a given URL
Works for all resources accessible by GET, regardless of transfer-encoding
and discrepancies between HEAD and GET responses. This does have
to download the full request (streamed) to determine sizes.
"""
only_gzip = {'Accept-Encoding': 'gzip'}
only_gzip.update(headers or {})
# Set `stream=True` to ensure we can access the original stream:
r = requests.get(url, headers=only_gzip, stream=True)
r.raise_for_status()
if r.headers.get('Content-Encoding') != 'gzip':
raise ValueError('Response not gzip-compressed')
# we only need the very last 4 bytes of the data stream
last_data = deque(maxlen=4)
compressed_length = 0
# stream directly from the urllib3 response so we can ensure the
# data is not decompressed as we iterate
for chunk in r.raw.stream(chunk_size, decode_content=False):
compressed_length += len(chunk)
last_data.extend(chunk)
if compressed_length < 4:
raise ValueError('Not enough data loaded to determine uncompressed size')
return compressed_length, _extract_size(last_data)
Demo:
>>> compressed_length, decompressed_length = get_content_lengths('http://httpbin.org/gzip')
>>> compressed_length
179
>>> decompressed_length
226
>>> compressed_length / decompressed_length
0.7920353982300885