I'm using Amazon Elastic Transcoder in conjunction with Lambda and Step Functions to transcode MP3s from WAV files.
I need to store the MD5 / S3 ETag
header value of the transcoded MP3s in my database.
At the moment I'm having to fetch these using these in a separate process which is really slow:
s3_cli = boto3.client("s3",aws_access_key_id=ACCESS_KEY,aws_secret_access_key=SECRET_KEY)
s3_resp = s3_cli.head_object(Bucket=bucket, Key=mp3_key)
s3obj_etag = s3_resp['ETag'].replace('"', '')
Before putting this in place, I was hoping that Elastic Transcoder would provide the transcoded file(s) MD5 hash in the job response, but I can not see this anywhere.
Does anyone have any tips on how to approach this better or am I missing something in the response/docs?
I was hoping that Elastic Transcoder would provide the transcoded files ETag in the job response
Unfortunately, it doesn't.
As of the current AWS Python Boto3 1.18.60 SDK (or any other SDK including the REST API), the entity tag(s) for the output object(s) isn't returned anywhere in the job response object.
This is most likely because an entity tag represents a specific version of the object and is mostly used for efficient caching invalidation.
Elastic Transcoder jobs do not produce multiple versions of the same output and as such, why would it return the ETag
value? If someone requires the ETag, they can get this from the S3 object.
Another question is what happens if there is multi-part output for large inputs? What does the SDK return? A list of ETag
header values? You have multiple parts, but you don't have multiple versions.
This implementation would go against the RFC 7232 specification for the ETag
header:
An entity-tag is an opaque validator for differentiating between multiple representations of the same resource, regardless of whether those multiple representations are due to resource state changes over time, content negotiation resulting in multiple representations being valid at the same time, or both.
Your actual problem, in this case, is that you want the MD5 hash of the file(s) even if they are multi-part.
Now your code will work for getting the MD5 hash for a single file, but if they are multi-part, they don't hash the multipart uploads as you would expect. Instead of calculating the hash of the entire file, Amazon calculates the hash of each part and then combines that into a single hash set as the ETag
header.
This makes a lot of sense: they calculate the hashes of each part as they receive it. After all the parts have been transferred, they combine the hashes instead of trying to calculate a final MD5 hash by reading a file that could possibly be up to the AWS object size limit - 5 terabytes. Try generating MD5 hashes for everyone's files when you're at Amazon's scale & you'll find that their way is quite clever :)
This is probably why the S3 API Reference says the below:
The ETag may or may not be an MD5 digest of the object data.
Objects created by either the Multipart Upload or Part Copy operation have ETags that are not MD5 digests, regardless of the method of encryption.
It is an MD5 hash, but when it's a multi-part upload it isn't so the above is technically correct.
To calculate the MD5 hash correctly for multi-part uploads, try checking out this great answer in response to "What is the algorithm to compute the Amazon-S3 Etag for a file larger than 5GB?".
To summarise, you, unfortunately, don't have the MD5 digest hash of the object(s) in the response back from Amazon Elastic Transcoder - you'll have to do this heavy lifting yourself & it can take long if you have huge files.
There's no workaround or quicker solution - you have the quickest solution as you're already getting the ETag
value from the HEAD in the most efficient way.
I would perhaps recommend trying to parallelise the HTTP HEAD requests made to get the objects' metadata (s3_cli.head_object(...)
) before trying to determine the final MD5 digest of the file(s).
That will definitely speed things up if you have 500 files for example - don't make the API requests separately.
You’ll definitely be able to send them in parallel, so offload the time you’d spend between requests all onto Amazon’s infrastructure.
Get the responses & then process together.