I am currently setting up a data lake trying to follow the principles of Delta Lake (landing in bronze, cleaning and merging into silver, and then, if needed, presenting the final view in gold) and have a question about what should be stored in Silver.
For example, if the data in bronze comes in from a REST API and is stored in the JSON form it comes in in this format:
An example looks like:
{
'id':12345,
'name':'Test',
'fields':['Hello','this','is','a','test']
}
In the end I want to present this as two tables. One would be the base table and look like:
TABLE 1
| id | name |
| -------- | -------------- |
| 12345 | Test |
And another would look like:
TABLE 2
| id | field_value |
| -------- | -------------- |
| 12345 | Hello |
| 12345 | this |
| 12345 | is |
| 12345 | a |
| 12345 | test |
My question is, should I pre-process the data in Spark and store the data in silver in separate folders like this:
-- root
---table 1
----file1.parquet
----etc.parquet
---table 2
----file1.parquet
----etc.parquet
Or store it all in silver under one folder and then derive those two tables using TSQL and functions like OPENJSON later?
Thank you for your help or insight!
I do not think there is a real answer to your questions, but here is a stab - based on your explicit example and this reference https://k21academy.com/microsoft-azure/data-engineer/delta-lake/
My question is, should I pre-process the data in Spark and store the data in silver in separate folders like this: ...
Yes, I would as JSON takes more time to process. I use JSON for RAW on a current project if it comes in that format and in the REFined Area we store arrays if needed, as opposed to JSON structs. But this is because we use a data Hub approach based on Martin Fowler's Distributed Data Mesh. We have a BUSiness Area where we model the data according to a semantic model.
But for every expert there is an equal and opposite expert. Some would say do it on the fly, like SAP Hana ETL on-the-fly.
For analysis of datasets given to Data Scientist for analysis, or ad hoc analysis, the 2nd approach is fine. The data would be in the Bronze zone. That said gdpr aspects would, could mean refine them to the Silver zone with gdpr aspects removed.
In short, depends on your use case.