I have intermediate AWS knowledge and have an issue, where I can see multiple ways of solving it, and I'm looking for opinions from more skilled AWS architects.
I have an on-premise system that produces ~30k XML files (each <100KB) throughout the day. These XML files have to be sent to AWS to be parsed.
Possible solutions:
Out of these 3 solutions, I think option 1 is most suitable, but I'm eager to hear opinions on this.
There is also a scenario, where the XML files are collected every day into a batch of ~30k files. For that case, I have the following questions:
I understand that this question is not super specific, but I would appreciate help.
One concrete question that I have is: does triggering 30k lambdas "at once" pose an issue? The tasks are not time-sensitive, so it's not a problem that "only" 1k lambdas are running in parallell, as long as all of them eventually run.
The cheapest solution so far would be using the distributed map feature of AWS Step Functions.
Regarding your file upload you need to decide how fast you need your data be accessible after being processed. Hence uploading all in one batch or when they occur is bound to that.
Independently of your upload interval I would use the event from S3 that a new file has arrived, batch them and process them with Step Functions. To reduce costs, use an Express Workflow.
There are a couple of file integrations from on premise to cloud. You can write a script that puts files into S3 directly, upload them via API Gateway or use more sophisticated services like AWS DataSync or AWS Storage Gateway. Depending on how stable your connection is, you could directly mount S3 into your filesystem like described here.