amazon-web-servicesamazon-s3reportingamazon-kinesisamazon-kinesis-analytics

Run Athena every 15 minutes vs Kinesis Data Analytics


I am going to be using Athena for report generation on data available in S3. A lot of it is time series data coming from IoT devices.

Users can request reports over years and years' worth of data but will mostly be weekly, monthly or annual.

I am thinking to save aggregates every 15 minutes for ex: 12:00, 12:15, 12:30, 12:45, 1:00 etc. The calculated aggregates should always be at the full 15 mins and cannot be at 12:03 and 12:18 so on and so forth. Is it possible with Kinesis data analytics? If yes, how?

If not, does scheduling a lambda to be triggered every 5-10 minutes and having athena calculate those aggregates sound like a reasonable approach? Any alternatives I should consider?


Solution

  • Kinesis Data Analytics runs Apache Flink which supports tumbling windows. The intervals starting from 00:00, 00:15, etc. should work by default by setting the window time to 15min.
    https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/dev/datastream/operators/windows/#tumbling-windows

    Since 15min is quite slow, you could also consider writing AWS Glue job (Apache Spark) and have it triggered periodically with built-in Glue triggers.

    Or you can go with your current solution (Lambda/Athena).

    One of the main decisions here would be how much do you need to invest to learn Spark or Flink vs. alredy known (I assume) Athena query. I would reserve some limited time for each approach to test them before picking one. This way you can quickly see where things get complicated.