pythongoogle-cloud-storagecrc32gcloud-pythongoogle-cloud-python

Difficulty comparing generated and google cloud storage provided CRC32c checksums


I am attemptting to get a CRC32c checksum on my local file so I can compare it to the blob.crc32c provided by the gcloud library. Google says I should be using the crcmod module in order to actually calculate CRC32c hashes of my data.

modifiedFile.txt has already been downloaded from a Google Cloud Storage bucket onto my local filesystem.

The goal here is to set should_download to true only if modifiedFile.txt has a different CRC32c on my local client vs my remote server. How do I get them to generate matching CRC32c in the event that my local filesystem and my gcloud Blob both have the same content?

from crcmod import PredefinedCrc
from gcloud import storage

# blob is a gcloud Blob object

should_download = True

with open('modifiedFile.txt') as f:
  hasher = PredefinedCrc('crc-32c')
  hasher.update(f.read())
  crc32c = hasher.digest()
  print crc32c # \207\245.\240
  print blob.crc32c # CJKo0A==
  should_download = crc32c != blob.crc32c

Unfortunately, it currently always fails as I don't actually know how to compare the checksum I build with crcmod to the attribute I am seeing in the matching Blob object.


Solution

  • Here's an example md5 and crc32c for the gsutil public tarball:

    $ gsutil ls -L gs://pub/gsutil.tar.gz | grep Hash
        Hash (crc32c):      vHI6Bw==
        Hash (md5):     ph7W3cCoEgMQWvA45Z9y9Q==
    

    I'll copy it locally to work with:

    $ gsutil cp gs://pub/gsutil.tar.gz /tmp/
    Copying gs://pub/gsutil.tar.gz...
    Downloading file:///tmp/gsutil.tar.gz:                           2.59 MiB/2.59 MiB    
    

    CRC values are usually displayed as unsigned 32-bit integers. To convert it:

    >>> import base64
    >>> import struct
    >>> struct.unpack('>I', base64.b64decode('vHI6Bw=='))
    (3161602567,)
    

    To obtain the same from the crcmod library:

    >>> file_bytes = open('/tmp/gsutil.tar.gz', 'rb').read()
    >>> import crcmod
    >>> crc32c = crcmod.predefined.Crc('crc-32c')
    >>> crc32c.update(file_bytes)
    >>> crc32c.crcValue
    3161602567L
    

    If you want to convert the value from crcmod to the same base64 format used by gcloud/gsutil:

    >>> base64.b64encode(crc32c.digest()).decode('utf-8')
    'vHI6Bw=='