hadoopbigdata

Confusion between Operational and Analytical Big Data and on which category Hadoop operates?


I can't wrap my head around the basic theoretical concept of 'Operational and Analytical Big Data'.

According to me:

  1. Operational Big Data: Branch where we can perform Read/write operations on big data using specially designed Databases (NoSQL). Somewhat similar to ETL in RDMS.

  2. Analytical Big Data: Branch where we analyse data in retrospect and draw predictions using techniques like MPP and MapReduce. Somewhat similar to reporting in RDMS.

(Please feel free to correct wherever I'm wrong, it's just my understanding.)

So according to me, Hadoop is used for Analytical Big Data where we just process data for analysis but don't temper original data and hence is not an idea choice for ETL. But recently I have come across this article which advocates using Hadoop for ETL: https://www.datanami.com/2014/09/01/five-steps-to-running-etl-on-hadoop-for-web-companies/


Solution

  • Hadoop (MapReduce) is not an efficient processing layer, IMO, without adequate tweaking, so out of the box, the answer is neither. Sure, MapReduce could be used, and under the hood, that API is what most higher level tools depend on, but since those other tools exist, you wouldn't want to go write ETL jobs in plain MapReduce.

    You can combine Hadoop with Spark, Presto, HBase, Hive, etc. to unlock these other Operational or Analytical layers, some are useful for reporting use cases, and others are useful for ETL. Again, plenty of knobs to get useful results in a reasonable time compared to an RDBMS (or other NoSQL tools). Plus, it takes several attempts to know how to best store data in Hadoop to begin with (hint: not plaintext, and not lots of small files)

    That link is over 5 years old now, and references Flume and Sqoop. Other "web scale" technologies have shown their worth in that time, meanwhile Flume and Sqoop have shown their age can be difficult to configure manage compared to tools like Apache NiFi.