amazon-web-servicesamazon-s3checksumetag

Calculate MD5 from AWS S3 ETag


I know is it possible to calculate ETag of a local stored file. That's not useful in my case. I got a chain where i zip file and directly upload them to S3 storage using memory :

zip -r - $input_path | tee >(md5sum - >> $MD5_FILE) >(aws s3 cp - s3://$bucket_name/$final_path_zip) >/dev/null

After this i want to check if the ETag matches the md5 i calculated in the this command. Therefore i would like to know if it's possible (possibly using bash) to calculate md5checksum of the whole file knowing the ETag ?

Another way around would be to calculate ETag from the piped zip but i have no idea how do that (didn't have any result with wc -c)


Solution

  • You can't get the MD5 digest from an arbitrary ETag in S3. For non-encrypted objects uploaded with a single PutObject request, it is just an MD5 digest of the contents. For objects uploaded with multipart uploads, it is documented as a composite checksum. This means it is the digest of the digests of each part concated together, with a tag added to the end counting the number of parts. Since the MD5 hash algorithm is not reversable, you can't get the hash of the individual parts out of it.

    For encrypted objects uploaded with any method, it is just documented as "not an MD5 digest of their object data".

    So, if you want to compare the ETag of an object in S3 with what you create, you'll need to calculate the ETag using a the same technique as S3 does. md5 on it's own is not enough to do this with multipart uploads, you'll need something more complex. The following Python script will do just that, outputting either an MD5 digest for smaller files, or a digest of the parts of larger uploads:

    #!/usr/bin/env python3
    
    import sys
    from hashlib import md5
    
    MULTIPART_THRESHOLD = 8388608
    MULTIPART_CHUNKSIZE = 8388608
    BUFFER_SIZE = 1048576
    
    # Verify some assumptions are correct
    assert(MULTIPART_CHUNKSIZE >= MULTIPART_THRESHOLD)
    assert((MULTIPART_THRESHOLD % BUFFER_SIZE) == 0)
    assert((MULTIPART_CHUNKSIZE % BUFFER_SIZE) == 0)
    
    hash = md5()
    read = 0
    chunks = None
    
    while True:
        # Read some from stdin, if we're at the end, stop reading
        bits = sys.stdin.buffer.read(1048576)
        if len(bits) == 0: break
        read += len(bits)
        hash.update(bits)
        if chunks is None:
            # We're handling a multi-part upload, so switch to calculating 
            # hashes of each chunk
            if read >= MULTIPART_THRESHOLD:
                chunks = b''
        if chunks is not None:
            if (read % MULTIPART_CHUNKSIZE) == 0:
                # Dont with a chunk, add it to the list of hashes to hash later
                chunks += hash.digest()
                hash = md5()
    
    if chunks is None:
        # Normal upload, just output the MD5 hash
        etag = hash.hexdigest()
    else:
        # Multipart upload, need to output the hash of the hashes
        if (read % MULTIPART_CHUNKSIZE) != 0:
            # Add the last part if we have a partial chunk
            chunks += hash.digest()
        etag = md5(chunks).hexdigest() + "-" + str(len(chunks) // 16)
    
    # Just show the etag, adding quotes to mimic how S3 operates
    print('"' + etag + '"')
    

    It is a drop in replacement for your md5 call:

    $ zip -r - "$input_path" | tee >(python calculate_etag_from_pipe - >> "$MD5_FILE") >(aws s3 cp - s3://$bucket_name/$final_path_zip) >/dev/null
    [ ... zip file is created and uploaded to S3 ... ]
    
    $ cat "$MD5_FILE"
    "ef5c64605cb198b65b2451a76719b8d8-96"
    
    $ aws s3api head-object --bucket $bucket_name --key $final_path_zip --query ETag --output text
    "ef5c64605cb198b65b2451a76719b8d8-96"
    

    Note that the script as shown makes some assumptions about how the upload will be split into a multi-part upload. These assumptions roughly map how the AWS CLI operates by default, but it is not guaranteed. If you're using a different SDK, or different settings for the CLI, you will need to adjust MULTIPART_THRESHOLD and MULTIPART_CHUNKSIZE.