amazon-web-servicesaws-lambdaamazon-dynamodbthroughputcapacity

AWS heartbeat service using API Gateway / Lambda / DynamoDB


I'm using a local windows application and I want to be able to know when this application is offline (not later than one minute delay).

I was thinking to create an api gateway called HeartbeatControllerAPI and this api will call a lambda which in turns will set the last heartbeat of the app inside dynamodb table called "Heartbeats" looks like that:

Machine Name | Last Heartbeat

A 2:00 AM

B 2:10 AM

C 2:05 AM

Therefore if I have 3 machines as following:

Machine 1 - with the app installed.

Machine 2 - with the app installed.

Machine 3 - with the app installed.

Then I though that each minute all the machine will do the work as I described above and then I will be able to know which machine is offline / online.

Is this method is right? I'm worried (from billing and overload perspective) what happened if I had 1 millions machines like these whereas on each minute all of them will access the api gateway and update the dynamodb table


Solution

  • Typically a heartbeat is the other way around, where in your case you would do something like have the Lambda triggered by a CloudWatch event (cron event, triggered every X minutes), and that Lambda would call your machines and confirm that they are up and running. This way, your machines are left to do their task, and your heartbeat function (the Lambda) will confirm they are working every X minutes.

    This would also mean you only need a single Lambda for a lot of machines (depending on performance, can increase the amount of Lambdas if you reach something like 50 machines maybe).

    So if you follow this approach, your Lambda will get triggered by CloudWatch cron event, check if the machines are running and then for each machine if they are running will update the table with the status, and if they aren't, you could maybe send a message to an SNS topic (and subscribe yourself to that topic) so that you can be notified of a machine being down (this is also something that is a lot more complicated if the machine is the one calling, because if it's down then you have to scan the table for the last updated time, it can get costly and inefficient).