I have a raw event table that I'm working with. It has two columns date
and metadata
. Metadata has raw json dump of all the event attributes. But I'd like to make it clear that each time different attributes are sent. And then I need to ingest this data into looker.
Raw table:
timestamp | metadata |
---|---|
2024-04-1 | {"type":"created","title":"test1","due":"2024-04-02","id":12345} |
2024-04-1 | {"type":"confirmed","id":12345} |
2024-04-1 | {"type":"completed","id":12345, "completedby":"johndoe"} |
Now I need to normalize it
OPTION A :
timestamp | type | title | due_date | id | completed_by |
---|---|---|---|---|---|
2024-04-1 | created | test1 | 2024-04-02 | 12345 | |
2024-04-1 | confirmed | 12345 | |||
2024-04-1 | completed | johndoe |
OPTION B:
timestamp | type | title | due_date | id | completed_by |
---|---|---|---|---|---|
2024-04-1 | created | test1 | 2024-04-02 | 12345 | johndoe |
2024-04-1 | confirmed | test1 | 2024-04-02 | 12345 | johndoe |
2024-04-1 | completed | test1 | 2024-04-02 | 12345 | johndoe |
How should I design the table, should I fill all the rows with their respective information (Option B) or should I leave them as nulls (Option A)?
Option B is what I would aim for, specially because it holds relevant information for all rows, which would make it easier for reporting purposes.
Also, given that you want to treat the different actions (created
, confirmed
, completed
) as "sub-events", one thing you could do on top of option B is to also add a surrogate key based on the id
, the type
and the timestamp
. See an example below using the dbt_utils package:
{{ dbt_utils.generate_surrogate_key(['id', 'type', 'timestamp']) }} as sub_event
This way, you can have a primary key in your model to define the granularity, as well as e.g. easily count "sub-events" for each user.