cassandra hbase iot apache-phoenix opentsdb

Real time analytics Time series Database

I'm looking for a distributed Time series database which is free to use in a cluster setup up mode and production ready plus it has to fit well in the hadoop ecosystem.

I have an IOT project which is basically around 150k Sensors which send data every 10 minutes or One hour, so I'm trying to look at time series database that has useful functions like aggregating metrics, Down-sampling, pre-aggregate (roll-ups) i have found this comparative in this Google stylesheet document time series database comparative .

I have tested Opentsdb, the data model of the hbaserowkey really suits my use case : but the functions that sill need to be developed for my use case are :

aggregate multiples metrics
do rollups

I have tested also keirosDB which is a fork of opentsdb with a richer API and it uses Cassandra as a backend storage the thing is that their API does all what my looking for downsampling rollups querying multiples metrics and a lot more.

I have tested Warp10.io and Apache Phoenix which i have read here Hortonworks link that it will be used by Ambari Metrics so i assume that its well suited for time series data too.

My question is as of now what's the best Time series Database to do real time analytics with requests performance under 1S for all the type of requests example : we want the average of the aggregated data sent by 50 sensors in a period of 5 years resampled by months ?

Such requests I assume can't be done under 1S so I believe for such requests we need some rollups/ pre aggregate mechanism, but I'm not so sure because there's a lot of tools out there and i can't decide which one suits my need the best.

Solution

I'm the lead for Warp 10 so my answer can be considered opinionated.

Given your projected data volume, 150k sensors sending data every 10 minutes, it is a mean of 250 datapoints per second and less than 40B on a period of 5 years. Such a volume can easily fit on a simple Warp 10 standalone, and if you later need to have a larger infrastructure you can migrate to a distributed Warp 10 based on Hadoop.

In terms of requests, if your data is already resampled, fetch 5 years of monthly data for 50 sensors is only 3000 datapoints, Warp 10 can do that in far less than 1s, and doing the automatic rollups is just a matter of scheduling WarpScript code in a monthly manner, nothing fancy.

Lastly, in terms of integration with the Hadoop ecosystem, Warp 10 is on top of things with integration of the WarpScript language in Pig, Spark, Flink and Storm. With the Warp10InputFormat you can fetch data from a Warp 10 platform or you can load data using any other InputFormat and then manipulate them using WarpScript.