After a lot of research and head picking, I'm still unable to find a good/clean solution to convert an entity-attribute-value-timestamp table to an scd type 2 dimension.
Here's the issue: I have a CRM source that stores all history in a EAVT model (Entity/Attribute/Value of the attribute/valid_from/valid_to). So for every object (Company, product...etc) I have a table with the current state that is in a relational model, and another history table that contains all value changes to all attributes with a valid_from/valid_to column for validity of the values themselves.
I want to be able to merge these two tables into an SCD table with a Valid_To/Valid_From and a column per attribute.
To give an example: Company has two tables:
company_id | name | number_of_employees | city |
---|---|---|---|
1 | Company 1 | 500 | Paris |
2 | Company 2 | 500 | Paris |
company_id | attribute | value | valid_from | valid_to |
---|---|---|---|---|
1 | city | New York | 01/01/2020 | 01/05/2022 |
1 | city | Paris | 01/05/2022 | 12/31/9999 |
1 | number_of_employees | 50 | 01/01/2021 | 01/01/2022 |
1 | number_of_employees | 100 | 01/01/2022 | 12/31/9999 |
What I want to have as a result is the following:
company_id | name | city | number_of_employees | valid_from | valid_to | is_active |
---|---|---|---|---|---|---|
1 | Company 1 | New York | null | 01/01/2020 | 01/01/2021 | false |
1 | Company 1 | New York | 50 | 01/01/2021 | 01/01/2022 | false |
1 | Company 1 | New York | 100 | 01/01/2022 | 01/01/2022 | false |
1 | Company 1 | Paris | 100 | 01/05/2022 | 12/31/9999 | true |
So based on this example, we have a company that started on 01/01/2020 with New York as city and number of employees wasn't populated at that time. We then modified our company to add 50 as the number of employees, this happened on 01/01/2021. We modified our company again on 01/01/2022 to change the number of employees to 100, only to change the city of the company from New York to Paris on 01/05/2021.
This gives us 4 states for the company, so our SCD should contain a row per state or 4 rows. The dates should be calculated to overlap and valid_from should be set to the valid_to of the attribute that changed from the "history" table, and valid_to should be set to the valid_from of the attribute that changed from the "history" table.
To add more complexity to the task, imagine we have about 120 attributes but also if a company was never changed (just created and still has the same state from creation) then it won't exist in the "Current State" table. So in our example, Company 2 will not exist in the history table at all and will have to be read from the first table into the SCD (union between current table and history result table). Fun right! :)
To give you a sense of the technical environment, the CRM is hubspot, data is replicated from hubspot to BigQuery and the reporting tool is Power BI.
I have tried to use pivoting in both Power BI and BigQuery, which is the standard solution when it comes to EAV model tables, but I'm stuck at the calculation of the valid/from valid/to in the result SCD. ( example of using the pivoting here: https://dba.stackexchange.com/questions/20275/solutions-for-reporting-off-of-an-eav-structured-database )
I need one process that can be applied to multiple tables (because this example is only for company, but I have also other objects that I need to convert into SCD). So what is the best way to convert this EAVT data into an SCD without falling into a labyrinth of hard code and performance issues? And how to calculate the valid_from/valid_to dynamically<
Whether it's BigQuery or Power Query or just theoretical, any solutions, tips, ideas or just plain opinion is highly appreciated as this is the last step into the adoption of a whole data culture in the company I work for, and if I cannot make this, well... my credibility will be hit! so please help a fellow lost IT professional! :D
Too broad question - but anyway, below is just to give you an idea. Obviously it does not cover all cases - but hope you can work it further out
select company_id, city, number_of_employees, min(day) valid_from, max(day) valid_to
from (
select * from (
select company_id, attribute, value, day
from history,
unnest(generate_date_array(date(valid_from), if(valid_to = '9999-12-31', date('2222-12-31'), date(valid_to)))) day
)
pivot (any_value(value) for attribute in ('city', 'number_of_employees'))
)
group by company_id, city, number_of_employees
if applied to sample data as in your question
with history as (
select 1 company_id, 'city' attribute, 'New York' value, '2020-01-01' valid_from, '2022-01-05' valid_to union all
select 1, 'city', 'Paris', '2022-01-05', '2222-12-31' union all
select 1, 'number_of_employees', '50', '2021-01-01', '2022-01-01' union all
select 1, 'number_of_employees', '100', '2022-01-01', '2222-12-31'
)
output is