I am learning the Hadoop and came across the HIVE. And I learnt that HIVE acts a data warehouse for analytical queries.
If data warehouse is aggregation of multiple data sources, why do the companies use multiple data sources and aggregate them? Why can't they directly write to data warehouse as it would reduce the overhead of aggregation and processing?
Is it because the overhead that would occur if all the users read from same data warehouse?
I wanted to learn why do we need databases if we have data warehouses which are highly scalable?
(Only in context of big companies that generates huge volumes of data. Because I can understand that most of the companies can even manage with 2 databases.)
Data Warehouse (DWH) is a bigger layer on top of different databases, while the database is a namespace+storage for storing some tables/objects/procedures corresponding to some DWH stage.
The final purpose of analytical DWH is a provision of analytical data marts which can be used in different analytics/reporting tools: Tableu, QulickView, etc, etc
DWH contains data in different stages (databases) differently aggregated, for example LZ - a landing zone where the data loaded from different sources are stored, ODS - operational data store, where the data from different sources are combined into fact table and dimensions, cleared, transformed, usually 3NF is used and conformed dimensions which are being enriched from different sources. And finally DM - a data mart where aggregated data are stored in dimensional model: fact tables(can be aggregated) and dimensions which are used in a star/snowflake schemas. Also some other databases can be used, like STG for intermediate data processing.
DWH is not only a storage consisted of multiple databases, it is also ETL processes for loading, extracting, transforming data on each stage, common architecture and policies.
You decide which layers the DWH will contain and how it should be designed: using top-down approach (starting from source systems) or bottom-up(starting from data marts dimensional modelling) or both.
DWH does not aggregate(process) any data itself. For each step and entity you create an ETL process which loads, extracts, transforms the data, between different stages(databases).
Conformed dimensions (same dimensions used in different datamarts/fact tables) are used as a single point of truth in the DWH.
For example if you have User
table incrementally loaded from different source systems like Salesforce, GoogleAnalytics, etc, then different ETL processes are loading the data into LZ, then another ETL process combines and de duplicates it into ODS table, then another process is loading it into DM along with Transactions data into monthly/daily aggregated Transaction data mart, which is a daily or monthly aggregation star schema, using User
table as a dimension.
There is another modern concept - a data lake
, which additionally contains semi-structured or not structured data along with structured data in 3NF and data marts, this allows data engineers/data miners to mine the unstructured data and analyze, find some correlations and finally build new datamarts.