delta-lakedata-lakehouse

Do you store data in the Delta Lake Silver layer in a normalized format or do you derive it?


I am currently setting up a data lake trying to follow the principles of Delta Lake (landing in bronze, cleaning and merging into silver, and then, if needed, presenting the final view in gold) and have a question about what should be stored in Silver.

For example, if the data in bronze comes in from a REST API and is stored in the JSON form it comes in in this format:

An example looks like:

{ 
'id':12345,
'name':'Test',
'fields':['Hello','this','is','a','test']
}

In the end I want to present this as two tables. One would be the base table and look like:

TABLE 1

| id       | name            |
| -------- | --------------  |
| 12345    | Test            |

And another would look like:

TABLE 2

| id       | field_value    |
| -------- | -------------- |
| 12345    | Hello          |
| 12345    | this           |
| 12345    | is             |
| 12345    | a              |
| 12345    | test           |

My question is, should I pre-process the data in Spark and store the data in silver in separate folders like this:

-- root
---table 1
----file1.parquet
----etc.parquet
---table 2
----file1.parquet
----etc.parquet

Or store it all in silver under one folder and then derive those two tables using TSQL and functions like OPENJSON later?

Thank you for your help or insight!


Solution

  • I do not think there is a real answer to your questions, but here is a stab - based on your explicit example and this reference https://k21academy.com/microsoft-azure/data-engineer/delta-lake/

    My question is, should I pre-process the data in Spark and store the data in silver in separate folders like this: ...

    In short, depends on your use case.