azure-data-factorydataflowtask

ADF data flow etl schema mapping issue


I'm working with data flows in data factory to create a dynamic data flow which has parameterized source/sink/surrogate key names to reuse across many files and accomplishes the following tasks:

  1. Join source parquet to existing delta destination to find new/updated records.
  2. Insert new rows with a numeric surrogate key
  3. compare hashes between source and existing destination and determine if record is different
  4. Update existing rows that that have a difference in the record hash

I am able to get data throughout the flow, but I am running into an issue during the sinks. It seems that I am unable to insert records from my incoming stream to the sink because of a schema mismatch. sample schema below:

Source stream final schema where I've highlighted the columns that I don't want in my delta table: enter image description here

I am aware of the mismatch between source/sink. It's caused by additional work that goes into creating surrogate keys and determining if records have changed.

When I turn off auto-mapping I'm able to map columns and drop columns without any issues but then I lose the ability to make this flow dynamic. Are there settings in the sink that will allow auto-mapping to drop columns from the source that don't match destination?

Things I've tried:

Unchecking "Allow schema drift". To me this makes sense, I don't want my destination schema to drift, but when I uncheck this, the auto-mapping doesn't seem to work and gives me the error: "Error at Sink 'sink1': The result has 0 output columns. Please ensure at least one column is mapped"

Checking "Allow schema drift". Checking this makes the automapper work but causes the error "Job failed due to reason: at Sink 'sink1': A schema mismatch detected when writing to the Delta table"

Now, I can also enable mergeSchema but then I will be writing the 5 columns I don't want into my delta table. Is there a way to get my sink to work such that automapping will map to columns it finds and drop columns it can't find so that I don't have to write a bunch of transient columns to my table?


Solution

  • enter image description here

    enter image description here

    enter image description here