I'm working on a data warehouse project using BigQuery. We're loading daily files exported from various mainframe systems. Most tables have unique keys which we can use to create the type 2 history, but some tables, e.g. a ledger/positions table, can have duplicate rows. These files contain the full data extract from the source system every day.
We're currently able to maintain a type 2 history for most tables without knowing the primary keys, as long as all rows in a load are unique, but we have a challenge with tables where this is not the case.
One person on the project has suggested that the way to handle it is to "compare duplicates", meaning that if the DWH table has 5 identical rows and the staging tables has 6 identical rows, then we just insert one more, and if it is the other way around, we just close one of the records in the DWH table (by setting the end date to now). This could be implemented by adding and extra "sub row" key to the dataset like this:
Row_number() over(partition by “all data columns” order by SystemTime) as data_row_nr
I've tried to find out if this is good practice or not, but without any luck. Something about it just seems wrong to me, and I can't see what unforeseen consequences can arise from doing it like this.
Can anybody tell me what the best way to go is when dealing with full loads of ledger data on a daily basis, for which we want to maintain some kind of history in the DWH?
No, I do not think this would be a good idea to introduce an artificial primary key based on all columns plus the index of the duplicated row.
You will solve the technical problem, but I doubt there will be some business value. First of all you should distinct – the tables you get with primary key are dimensions and you can recognise changes and build history.
But the table without PK are most probably fact tables (i.e. transaction records) that are typically not full loaded but loaded based on some DELTA criterion.
Anyway you will never be able to recognise an update in those records, only possible change is insert (deletes are typically not relevant as data warehouse keeps longer history that the source system).
So my todo list
Check if the dup are intended or illegal
Try to find a delta criterion to load the fact tables
If everything fails, make the primary key of all columns with a single attribute of the number of duplicates and build the history.