I am importing a lot of mainframe
extracts into BigQuery
daily. Each extract is a full export of all the data available. I've been loading the data into BigQuery
and then generating a type 2 history using a SQL MERGE
statement, where I join on the primary keys of each table and compare all columns to find differences, closing outdated rows and inserting new/updated rows into a history table. This works fine.
A colleague made the case that we don't need to know the primary keys to do this, we can just treat all data columns together as the unique constraint. With that in mind I created a new MERGE
statement, which seems to work just as well as the old one, but without having to know the primary keys, which makes a lot of things easier for us. Here is an example of the new query:
MERGE ods.kfir_history AS main USING (
SELECT FF_NR, FIRMA_PRODENH_TYPE, FIRMA_NAVN_1, FIRMA_NAVN_2, SE_NR, KONCERN_NR, FIRMA_STATUS_DATO, FIRMA_STATUS_DATO_NUL, FIRMA_STATUS, FIRMA_TYPE, ANTAL_ANSAT_DATO, ANTAL_ANSAT_DATO_NUL, ANTAL_ANSAT, ANTAL_ANSAT_NUL, ANTAL_ANSAT_KILDE, FIRMA_STIFTET_DATO, FIRMA_STIFTET_DATO_NUL, AS_REGISTRERET, MOMS_REGISTRERET, FAGLIG_FORENING, HEKTAR, RET_SBH, RET_TIMESTAMP, SUPL_FIRMA_NAVN, SUPL_FIRMA_NAVN_NUL, CVR_NR, P_NR from staging.kfir_new
) AS delta ON
IFNULL(main.FF_NR,'null') = IFNULL(delta.FF_NR,'null') AND
IFNULL(main.FIRMA_PRODENH_TYPE,'null') = FNULL(delta.FIRMA_PRODENH_TYPE,'null') AND
IFNULL(main.FIRMA_NAVN_1,'null') = IFNULL(delta.FIRMA_NAVN_1,'null') AND
IFNULL(main.FIRMA_NAVN_2,'null') = IFNULL(delta.FIRMA_NAVN_2,'null') AND
IFNULL(main.SE_NR,0) = IFNULL(delta.SE_NR,0) AND IFNULL(main.KONCERN_NR,'null') = IFNULL(delta.KONCERN_NR,'null') AND
IFNULL(main.FIRMA_STATUS_DATO,'null') = IFNULL(delta.FIRMA_STATUS_DATO,'null') AND
IFNULL(main.FIRMA_STATUS_DATO_NUL,'null')= IFNULL(delta.FIRMA_STATUS_DATO_NUL,'null') AND
IFNULL(main.FIRMA_STATUS,'null') = IFNULL(delta.FIRMA_STATUS,'null') AND IFNULL(main.FIRMA_TYPE,'null') = IFNULL(delta.FIRMA_TYPE,'null') AND IFNULL(main.ANTAL_ANSAT_DATO,'null') = IFNULL(delta.ANTAL_ANSAT_DATO,'null') AND IFNULL(main.ANTAL_ANSAT_DATO_NUL,'null') = IFNULL(delta.ANTAL_ANSAT_DATO_NUL,'null') AND IFNULL(main.ANTAL_ANSAT,0) = IFNULL(delta.ANTAL_ANSAT,0) AND IFNULL(main.ANTAL_ANSAT_NUL,'null') = IFNULL(delta.ANTAL_ANSAT_NUL,'null') AND IFNULL(main.ANTAL_ANSAT_KILDE,'null') = IFNULL(delta.ANTAL_ANSAT_KILDE,'null') AND IFNULL(main.FIRMA_STIFTET_DATO,'null') = IFNULL(delta.FIRMA_STIFTET_DATO,'null') AND IFNULL(main.FIRMA_STIFTET_DATO_NUL,'null') = IFNULL(delta.FIRMA_STIFTET_DATO_NUL,'null') AND IFNULL(main.AS_REGISTRERET,0) = IFNULL(delta.AS_REGISTRERET,0) AND IFNULL(main.MOMS_REGISTRERET,'null') = IFNULL(delta.MOMS_REGISTRERET,'null') AND IFNULL(main.FAGLIG_FORENING,'null') = IFNULL(delta.FAGLIG_FORENING,'null') AND IFNULL(main.HEKTAR,0) = IFNULL(delta.HEKTAR,0) AND IFNULL(main.RET_SBH,'null') = IFNULL(delta.RET_SBH,'null') AND IFNULL(main.RET_TIMESTAMP,'null') = IFNULL(delta.RET_TIMESTAMP,'null') AND IFNULL(main.SUPL_FIRMA_NAVN,'null') = IFNULL(delta.SUPL_FIRMA_NAVN,'null') AND IFNULL(main.SUPL_FIRMA_NAVN_NUL,'null') = IFNULL(delta.SUPL_FIRMA_NAVN_NUL,'null') AND IFNULL(main.CVR_NR,0) = IFNULL(delta.CVR_NR,0) AND IFNULL(main.P_NR,0) = IFNULL(delta.P_NR,0)
WHEN NOT MATCHED BY SOURCE AND main.SystemTimeEnd = "5999-12-31 23:59:59" --Close all updated records
THEN UPDATE SET SystemTimeEnd=delta.SystemTime, current_timestamp, LastAction = 'U'
WHEN NOT MATCHED BY TARGET --insert all new and updated records
THEN INSERT (FF_NR, FIRMA_PRODENH_TYPE, FIRMA_NAVN_1, FIRMA_NAVN_2, SE_NR, KONCERN_NR, FIRMA_STATUS_DATO, FIRMA_STATUS_DATO_NUL, FIRMA_STATUS, FIRMA_TYPE, ANTAL_ANSAT_DATO, ANTAL_ANSAT_DATO_NUL, ANTAL_ANSAT, ANTAL_ANSAT_NUL, ANTAL_ANSAT_KILDE, FIRMA_STIFTET_DATO, FIRMA_STIFTET_DATO_NUL, AS_REGISTRERET, MOMS_REGISTRERET, FAGLIG_FORENING, HEKTAR, RET_SBH, RET_TIMESTAMP, SUPL_FIRMA_NAVN, SUPL_FIRMA_NAVN_NUL, CVR_NR, P_NR, SystemTime, SystemTimeEnd, Lastupdated, LastAction)
VALUES (delta.FF_NR, delta.FIRMA_PRODENH_TYPE, delta.FIRMA_NAVN_1, delta.FIRMA_NAVN_2, delta.SE_NR, delta.KONCERN_NR, delta.FIRMA_STATUS_DATO, delta.FIRMA_STATUS_DATO_NUL, delta.FIRMA_STATUS, delta.FIRMA_TYPE, delta.ANTAL_ANSAT_DATO, delta.ANTAL_ANSAT_DATO_NUL, delta.ANTAL_ANSAT, delta.ANTAL_ANSAT_NUL, delta.ANTAL_ANSAT_KILDE, delta.FIRMA_STIFTET_DATO, delta.FIRMA_STIFTET_DATO_NUL, delta.AS_REGISTRERET, delta.MOMS_REGISTRERET, delta.FAGLIG_FORENING, delta.HEKTAR, delta.RET_SBH, delta.RET_TIMESTAMP, delta.SUPL_FIRMA_NAVN, delta.SUPL_FIRMA_NAVN_NUL, delta.CVR_NR, delta.P_NR, delta.SystemTime, '5999-12-31 23:59:59', current_timestamp, 'I')
Can anybody tell me if there are any downsides to this approach? Considering BigQuery
doesn't deal with primary keys at all, it would be nice to lose the concept completely for us in the solution.
As per my understanding this should not be a problem. Unless your conditions acts on columns which values are not unique, while you have the need to update only that particular datapoint.