I know is it possible to calculate ETag of a local stored file. That's not useful in my case. I got a chain where i zip file and directly upload them to S3 storage using memory :
zip -r - $input_path | tee >(md5sum - >> $MD5_FILE) >(aws s3 cp - s3://$bucket_name/$final_path_zip) >/dev/null
After this i want to check if the ETag matches the md5 i calculated in the this command. Therefore i would like to know if it's possible (possibly using bash) to calculate md5checksum of the whole file knowing the ETag ?
Another way around would be to calculate ETag from the piped zip but i have no idea how do that (didn't have any result with wc -c)
You can't get the MD5 digest from an arbitrary ETag in S3. For non-encrypted objects uploaded with a single PutObject request, it is just an MD5 digest of the contents. For objects uploaded with multipart uploads, it is documented as a composite checksum. This means it is the digest of the digests of each part concated together, with a tag added to the end counting the number of parts. Since the MD5 hash algorithm is not reversable, you can't get the hash of the individual parts out of it.
For encrypted objects uploaded with any method, it is just documented as "not an MD5 digest of their object data".
So, if you want to compare the ETag of an object in S3 with what you create, you'll need to calculate the ETag using a the same technique as S3 does. md5
on it's own is not enough to do this with multipart uploads, you'll need something more complex. The following Python script will do just that, outputting either an MD5 digest for smaller files, or a digest of the parts of larger uploads:
#!/usr/bin/env python3
import sys
from hashlib import md5
MULTIPART_THRESHOLD = 8388608
MULTIPART_CHUNKSIZE = 8388608
BUFFER_SIZE = 1048576
# Verify some assumptions are correct
assert(MULTIPART_CHUNKSIZE >= MULTIPART_THRESHOLD)
assert((MULTIPART_THRESHOLD % BUFFER_SIZE) == 0)
assert((MULTIPART_CHUNKSIZE % BUFFER_SIZE) == 0)
hash = md5()
read = 0
chunks = None
while True:
# Read some from stdin, if we're at the end, stop reading
bits = sys.stdin.buffer.read(1048576)
if len(bits) == 0: break
read += len(bits)
hash.update(bits)
if chunks is None:
# We're handling a multi-part upload, so switch to calculating
# hashes of each chunk
if read >= MULTIPART_THRESHOLD:
chunks = b''
if chunks is not None:
if (read % MULTIPART_CHUNKSIZE) == 0:
# Dont with a chunk, add it to the list of hashes to hash later
chunks += hash.digest()
hash = md5()
if chunks is None:
# Normal upload, just output the MD5 hash
etag = hash.hexdigest()
else:
# Multipart upload, need to output the hash of the hashes
if (read % MULTIPART_CHUNKSIZE) != 0:
# Add the last part if we have a partial chunk
chunks += hash.digest()
etag = md5(chunks).hexdigest() + "-" + str(len(chunks) // 16)
# Just show the etag, adding quotes to mimic how S3 operates
print('"' + etag + '"')
It is a drop in replacement for your md5
call:
$ zip -r - "$input_path" | tee >(python calculate_etag_from_pipe - >> "$MD5_FILE") >(aws s3 cp - s3://$bucket_name/$final_path_zip) >/dev/null
[ ... zip file is created and uploaded to S3 ... ]
$ cat "$MD5_FILE"
"ef5c64605cb198b65b2451a76719b8d8-96"
$ aws s3api head-object --bucket $bucket_name --key $final_path_zip --query ETag --output text
"ef5c64605cb198b65b2451a76719b8d8-96"
Note that the script as shown makes some assumptions about how the upload will be split into a multi-part upload. These assumptions roughly map how the AWS CLI operates by default, but it is not guaranteed. If you're using a different SDK, or different settings for the CLI, you will need to adjust MULTIPART_THRESHOLD
and MULTIPART_CHUNKSIZE
.