I'm trying to download an object in GCS that is compressed, but I'm unable to download it without GCS automatically decompressing the file for me. I want to be able to download the gzip myself, and then decompress locally.
If I go to my object in the GCS gui, I can view the object metadata and see the following:
Content-Type: application/json
Content-Encoding: gzip
Cache-Control: no-transform
Also, if I right click the Authenticated URL
in the console and click Save Link As
, I get a gzip archive, so I know that this file is actually an archive.
I read on GCS's documentation that you can set Cache-Control: no-transform
then "the object is served as a compressed object in all subsequent requests".
Except when I use the code below to download the GCS object it's downloaded as a JSON object, not as a gzip archive:
bucket = storage_client.get_bucket("bucketname")
blob = bucket.blob("objectname")
stringobj = blob.download_as_text()
bytesobj = blob.download_as_bytes()
blob.download_to_filename("test.json.gz")
I've tried three different methods for downloading the object and they're all downloading the files as JSON objects.
Just to validate that the object does in fact have the correct headers, I ran the following:
blob.reload()
print(f"Content encoding: {blob.content_encoding}")
print(f"Content type: {blob.content_type}")
print(f"Cache control: {blob.cache_control}")
>> Content encoding: gzip
>> Content type: application/json
>> Cache control: no-transform
I'm not sure what else I could try.
I reproduced your problem. I followed your input and got similar behavior as I downloaded a gzip archive with the filename having .gz extension. However, gunzip
-ing the file returns an error:
Example.json.gz: not in gzip format
The solution is to use raw_download=True
to download the raw gzip archive to prevent decompressive transcoding from happening.
Example:
blob.download_to_filename("test.json.gz", raw_download=True)