pythongoogle-cloud-platformgoogle-cloud-storagegoogle-cloud-python

How to prevent GCS from automatically decompressing objects when using Python SDK?


I'm trying to download an object in GCS that is compressed, but I'm unable to download it without GCS automatically decompressing the file for me. I want to be able to download the gzip myself, and then decompress locally.

If I go to my object in the GCS gui, I can view the object metadata and see the following:

Content-Type: application/json
Content-Encoding: gzip
Cache-Control: no-transform

Also, if I right click the Authenticated URL in the console and click Save Link As, I get a gzip archive, so I know that this file is actually an archive.

I read on GCS's documentation that you can set Cache-Control: no-transform then "the object is served as a compressed object in all subsequent requests".

Except when I use the code below to download the GCS object it's downloaded as a JSON object, not as a gzip archive:

bucket = storage_client.get_bucket("bucketname")
blob = bucket.blob("objectname")
stringobj = blob.download_as_text()
bytesobj = blob.download_as_bytes()
blob.download_to_filename("test.json.gz")

I've tried three different methods for downloading the object and they're all downloading the files as JSON objects.

Just to validate that the object does in fact have the correct headers, I ran the following:

blob.reload()
print(f"Content encoding: {blob.content_encoding}")
print(f"Content type: {blob.content_type}")
print(f"Cache control: {blob.cache_control}")

>> Content encoding: gzip
>> Content type: application/json
>> Cache control: no-transform

I'm not sure what else I could try.


Solution

  • I reproduced your problem. I followed your input and got similar behavior as I downloaded a gzip archive with the filename having .gz extension. However, gunzip -ing the file returns an error:

    Example.json.gz: not in gzip format
    

    The solution is to use raw_download=True to download the raw gzip archive to prevent decompressive transcoding from happening.

    Example:

    blob.download_to_filename("test.json.gz", raw_download=True)