hadoopbigdatadata-warehousedatamart

Handling Big Data in a Datawarehouse


I am a learner in Big data concepts. Based on my understanding Big Data is critical in handling unstructured data and high volume.When we look at the big data architecture for a datawarehouse (DW) the data from source is extracted through the Hadoop (HDFS and Mapreduce) and the relevant unstructured information is converted to a valid business information and finally data is injected to the DW or DataMart through ETL processing (along with the existing sturctured data processing).

However i would like to know what are the new techniques/new dimensional model or storage requirements required at DW for an organization (due to the Big Data) as most of the tutorials/resources i try to learn only talks about Hadoop at source but not at target. How does the introduction of Big Data impacts the predefined reports/adhoc analysis of an organization due to this high volume of data

Appreciate your response


Solution

  • That is a very broad question, but I'll try to give some answers.

    Hadoop can be a data source, a data warehouse, or a "data lake", being a repository of data from which warehouses and marts may be drawn.

    The line between Hadoop and RDBMS-based data warehouses is increasingly blurred. As SQL-on-Hadoop becomes a reality, interacting with Hadoop-based data becomes increasingly easy. To be effective, though, there must be structure in the data.

    Some examples of Hadoop/DW interactions:

    That said, Hadoop DW is still immature. It is not quite as performant as RDBMS-based DW, lacks many security and operational features, as well as lacking in SQL capability. Think carefully about your needs before taking this path.

    Another question you should ask is whether you actually need a platform of this type. Any RDBMS can handle 3-5Tb of data. SQL Server and PostgreSQL are two examples of platforms that would handle a DW on commodity hardware, and negligible administration.

    Those same RDBMS can handle 100Tb workloads, but they require much more care and feeding at this scale.

    MPP RDBMS appliances handle data workloads into the Petabyte range, with lower administrative and operational overhead as they scale. I doubt you get to that scale, very few companies do :) You might choose an MPP appliance for a much smaller data volume, if speed of complex queries was your most important factor. I've seen MPP appliances deployed on data volumes as small as 5Tb for this reason.

    Depending on the load technique, you will probably find that an RDBMS-based DW is faster to load than Hadoop. For example, I load hundreds of thousands of rows per second into PostgreSQL, and slightly less than that into SQL Server. It takes substantially longer to achieve the same result in Hadoop as I have to ingest the file, establish it in Hive, and move it to Parquet to get a similar level of output performance. Over time I expect this to change in Hadoop's favour, but it isn't quite there, yet.

    You mentioned Dimensional Modelling. If your star schema is comprised of transactional fact tables and SCD0-SCD1 dimensions, thus needing insert-only processing, you might have success with SQL-on-Hadoop. If you need to update the facts (accumulating snapshots) or dimensions (SCD2, SCD3) you might struggle with both capability and performance - a lot of implementations don't yet support UPDATE queries, and those that do are slow.

    Sorry that there isn't a simple "Do this!" answer, but this is a complex topic in an immature field. I hope these comments help your thinking.