amazon-web-services amazon-s3 aws-lambda amazon-kinesis amazon-kinesis-firehose

AWS process large batch of small files daily

I have intermediate AWS knowledge and have an issue, where I can see multiple ways of solving it, and I'm looking for opinions from more skilled AWS architects.

I have an on-premise system that produces ~30k XML files (each <100KB) throughout the day. These XML files have to be sent to AWS to be parsed.

Possible solutions:

Feed the XML files to a Kinesis Firehose (presumably via an API gateway) that parses each file in a lambda and also stores the raw files in S3. This is, in my opinion, the ideal solution (?). A new XML file is created approx. every 3 seconds.
Upload each XML file via a presigned S3 URL and trigger a lambda that parses it. This involves fetching a presigned URL for every file from an API gateway. I am unsure whether this is a good approach for files that are produced at the above mentioned frequency.
Same as above, but use SFTP for the upload.

Out of these 3 solutions, I think option 1 is most suitable, but I'm eager to hear opinions on this.

There is also a scenario, where the XML files are collected every day into a batch of ~30k files. For that case, I have the following questions:

Does option 1 still make sense, even though the firehose is fed only once a day with a large amount of files at once?
Does option 2/3 still make sense? Another possibility would be to upload a single zip, unzip it with a lambda into another folder, where again, a lambda is triggered for every new files. That would mean 30k lambdas being triggered at once.
There are S3 batch operation, that apply a lambda to every file in a predetermined list of files. Where is the difference to the zip-version of option 2? It also seems that these s3 batch operations cannot be provisioned with terraform, which is a disadvantage for me.

I understand that this question is not super specific, but I would appreciate help.

One concrete question that I have is: does triggering 30k lambdas "at once" pose an issue? The tasks are not time-sensitive, so it's not a problem that "only" 1k lambdas are running in parallell, as long as all of them eventually run.

Solution

The cheapest solution so far would be using the distributed map feature of AWS Step Functions.

Regarding your file upload you need to decide how fast you need your data be accessible after being processed. Hence uploading all in one batch or when they occur is bound to that.

Independently of your upload interval I would use the event from S3 that a new file has arrived, batch them and process them with Step Functions. To reduce costs, use an Express Workflow.

There are a couple of file integrations from on premise to cloud. You can write a script that puts files into S3 directly, upload them via API Gateway or use more sophisticated services like AWS DataSync or AWS Storage Gateway. Depending on how stable your connection is, you could directly mount S3 into your filesystem like described here.