sqlgoogle-bigquerydata-warehousescd

Is it viable to build type 2 history in BiqQuery without knowing primary keys?


I am importing a lot of mainframe extracts into BigQuery daily. Each extract is a full export of all the data available. I've been loading the data into BigQuery and then generating a type 2 history using a SQL MERGE statement, where I join on the primary keys of each table and compare all columns to find differences, closing outdated rows and inserting new/updated rows into a history table. This works fine.

A colleague made the case that we don't need to know the primary keys to do this, we can just treat all data columns together as the unique constraint. With that in mind I created a new MERGE statement, which seems to work just as well as the old one, but without having to know the primary keys, which makes a lot of things easier for us. Here is an example of the new query:

MERGE ods.kfir_history AS main USING (
SELECT FF_NR, FIRMA_PRODENH_TYPE, FIRMA_NAVN_1, FIRMA_NAVN_2, SE_NR, KONCERN_NR, FIRMA_STATUS_DATO, FIRMA_STATUS_DATO_NUL, FIRMA_STATUS, FIRMA_TYPE, ANTAL_ANSAT_DATO, ANTAL_ANSAT_DATO_NUL, ANTAL_ANSAT, ANTAL_ANSAT_NUL, ANTAL_ANSAT_KILDE, FIRMA_STIFTET_DATO, FIRMA_STIFTET_DATO_NUL, AS_REGISTRERET, MOMS_REGISTRERET, FAGLIG_FORENING, HEKTAR, RET_SBH, RET_TIMESTAMP, SUPL_FIRMA_NAVN, SUPL_FIRMA_NAVN_NUL, CVR_NR, P_NR from staging.kfir_new
) AS delta ON   
 IFNULL(main.FF_NR,'null') = IFNULL(delta.FF_NR,'null') AND    
 IFNULL(main.FIRMA_PRODENH_TYPE,'null') = FNULL(delta.FIRMA_PRODENH_TYPE,'null') AND 
 IFNULL(main.FIRMA_NAVN_1,'null') = IFNULL(delta.FIRMA_NAVN_1,'null') AND  
 IFNULL(main.FIRMA_NAVN_2,'null') = IFNULL(delta.FIRMA_NAVN_2,'null') AND 
  IFNULL(main.SE_NR,0) = IFNULL(delta.SE_NR,0) AND IFNULL(main.KONCERN_NR,'null')  = IFNULL(delta.KONCERN_NR,'null') AND
 IFNULL(main.FIRMA_STATUS_DATO,'null') = IFNULL(delta.FIRMA_STATUS_DATO,'null') AND    
IFNULL(main.FIRMA_STATUS_DATO_NUL,'null')= IFNULL(delta.FIRMA_STATUS_DATO_NUL,'null') AND  
 IFNULL(main.FIRMA_STATUS,'null') = IFNULL(delta.FIRMA_STATUS,'null') AND IFNULL(main.FIRMA_TYPE,'null') = IFNULL(delta.FIRMA_TYPE,'null') AND IFNULL(main.ANTAL_ANSAT_DATO,'null') = IFNULL(delta.ANTAL_ANSAT_DATO,'null') AND IFNULL(main.ANTAL_ANSAT_DATO_NUL,'null') = IFNULL(delta.ANTAL_ANSAT_DATO_NUL,'null') AND IFNULL(main.ANTAL_ANSAT,0) = IFNULL(delta.ANTAL_ANSAT,0) AND IFNULL(main.ANTAL_ANSAT_NUL,'null') = IFNULL(delta.ANTAL_ANSAT_NUL,'null') AND IFNULL(main.ANTAL_ANSAT_KILDE,'null') = IFNULL(delta.ANTAL_ANSAT_KILDE,'null') AND IFNULL(main.FIRMA_STIFTET_DATO,'null') = IFNULL(delta.FIRMA_STIFTET_DATO,'null') AND IFNULL(main.FIRMA_STIFTET_DATO_NUL,'null') = IFNULL(delta.FIRMA_STIFTET_DATO_NUL,'null') AND IFNULL(main.AS_REGISTRERET,0) = IFNULL(delta.AS_REGISTRERET,0) AND IFNULL(main.MOMS_REGISTRERET,'null') = IFNULL(delta.MOMS_REGISTRERET,'null') AND IFNULL(main.FAGLIG_FORENING,'null') = IFNULL(delta.FAGLIG_FORENING,'null') AND IFNULL(main.HEKTAR,0) = IFNULL(delta.HEKTAR,0) AND IFNULL(main.RET_SBH,'null') = IFNULL(delta.RET_SBH,'null') AND IFNULL(main.RET_TIMESTAMP,'null') = IFNULL(delta.RET_TIMESTAMP,'null') AND IFNULL(main.SUPL_FIRMA_NAVN,'null') = IFNULL(delta.SUPL_FIRMA_NAVN,'null') AND IFNULL(main.SUPL_FIRMA_NAVN_NUL,'null') = IFNULL(delta.SUPL_FIRMA_NAVN_NUL,'null') AND IFNULL(main.CVR_NR,0) = IFNULL(delta.CVR_NR,0) AND IFNULL(main.P_NR,0) = IFNULL(delta.P_NR,0)
WHEN NOT MATCHED BY SOURCE AND main.SystemTimeEnd = "5999-12-31 23:59:59" --Close all updated records
THEN UPDATE SET SystemTimeEnd=delta.SystemTime, current_timestamp, LastAction = 'U'
WHEN NOT MATCHED BY TARGET --insert all new and updated records
THEN INSERT (FF_NR, FIRMA_PRODENH_TYPE, FIRMA_NAVN_1, FIRMA_NAVN_2, SE_NR, KONCERN_NR, FIRMA_STATUS_DATO, FIRMA_STATUS_DATO_NUL, FIRMA_STATUS, FIRMA_TYPE, ANTAL_ANSAT_DATO, ANTAL_ANSAT_DATO_NUL, ANTAL_ANSAT, ANTAL_ANSAT_NUL, ANTAL_ANSAT_KILDE, FIRMA_STIFTET_DATO, FIRMA_STIFTET_DATO_NUL, AS_REGISTRERET, MOMS_REGISTRERET, FAGLIG_FORENING, HEKTAR, RET_SBH, RET_TIMESTAMP, SUPL_FIRMA_NAVN, SUPL_FIRMA_NAVN_NUL, CVR_NR, P_NR, SystemTime, SystemTimeEnd, Lastupdated, LastAction)
VALUES (delta.FF_NR, delta.FIRMA_PRODENH_TYPE, delta.FIRMA_NAVN_1, delta.FIRMA_NAVN_2, delta.SE_NR, delta.KONCERN_NR, delta.FIRMA_STATUS_DATO, delta.FIRMA_STATUS_DATO_NUL, delta.FIRMA_STATUS, delta.FIRMA_TYPE, delta.ANTAL_ANSAT_DATO, delta.ANTAL_ANSAT_DATO_NUL, delta.ANTAL_ANSAT, delta.ANTAL_ANSAT_NUL, delta.ANTAL_ANSAT_KILDE, delta.FIRMA_STIFTET_DATO, delta.FIRMA_STIFTET_DATO_NUL, delta.AS_REGISTRERET, delta.MOMS_REGISTRERET, delta.FAGLIG_FORENING, delta.HEKTAR, delta.RET_SBH, delta.RET_TIMESTAMP, delta.SUPL_FIRMA_NAVN, delta.SUPL_FIRMA_NAVN_NUL, delta.CVR_NR, delta.P_NR, delta.SystemTime, '5999-12-31 23:59:59', current_timestamp, 'I')

Can anybody tell me if there are any downsides to this approach? Considering BigQuery doesn't deal with primary keys at all, it would be nice to lose the concept completely for us in the solution.


Solution

  • As per my understanding this should not be a problem. Unless your conditions acts on columns which values are not unique, while you have the need to update only that particular datapoint.