hadoophdfshbasehana

sensor data with SAP HANA and Hadoop/HDFS


I would like to save sensor data in a suitable database. I have 100.000 writes every minute with 100 bytes size each write. Also I want to do analytics on the data.

I thought about Hadoop, because it has many different frameworks to analyze the data (e.g. Apache spark).

Now my problem: HBase, a NoSQL database, would be a suitable solution, because it has a column family data model to access large columns. But it runs on top of HDFS. HDFS has 64 MB size of data Blocks.

What does that mean for me if I have 100 byte data?

Also I would like to run machine learning on top of Hadoop.

Would it be possible to use HBASE and SAP Hana together? (SAP HANA runs with Hadoop)


Solution

  • Let me try to address you points step by step:

    I would like to save sensor data in a suitable database.

    I would suggest something like OpenTSDB running on HBase here, since you also want to run a Hadoop cluster anyhow.

    I have 100.000 writes every minute with 100 bytes size each write.

    As you correctly point out, small messages/files are an issue for HDFS. Not so for HBase though (the block size is abstracted away by HBase, no need to adjust it for the underlying HDFS).

    A solution like OpenTSDB on Hbase or pure Hbase will work just fine for this load.

    That said since you apparently want to access your data via Hbase and also SAP Hana (which will probably require aggregating measurements from many 100byte files into larger files because now the HDFS block size comes into play) I would suggest handling incoming data via Kafka first and then reading from Kafka into raw HDFS (in some way compatible with Hana) and Hbase via separate consumers on Kafka.

    Would it be possible to use HBASE and SAP Hana together?

    See above explanation, Kafka (or a similar distributed queue) would be what you want for ingesting into multiple stores from a stream of small messages in my opinion.

    HDFS has 64 MB size of data Blocks. What does that mean for me if i have 100 byte data?

    Also i would like to run machine learning on top of Hadoop.

    Not an issue, HDFS is a distributed system, so you can scale things up to more performance and add a machine learning solution based Spark or whatever other thing you want to run on top of Hadoop at any time. Worst case you will have to add another machine to your cluster, but there is no hard limit on the number of things you can simultaneously run on your data once it's stored in HDFS and your cluster is powerful enough.