azureazure-data-factory

Azure DataFlow: Sink data to different folders based on error condition


In azure data flow, I'm doing a data validation and based on the the Assert Transformation, I am adding a new column called Bad Data (using derived column transformation) to identify error data and non-error data.

now I want to sink the data to two folders (as excel/csv file)

  1. If the data has at least one row error in any column, it will be treated as bad data and entire data to be written into 'reject' folder in Azure blob.
  2. If no error is found, then the entire file should be sinked to 'accept' folder.

I am trying an aggregate transformation after derived col to get the count of error data, if it is greater than 1, entire data will be sent to reject else to accept folder but I am facing issue with the expression to count error rows. I am using expression count(iif(IsValid== true(), 1, 0))

Is there a better approach to this?

enter image description here

##Update

Added an aggregate transformation to get the count of error data sumIf(IsValid == true(), 1)

Now the challenge is getting the entire data sink to either Accept or reject folder.

OutPut enter image description here


Solution

  • According to your requirement: If the data has at least one row error in any column, it will be treated as bad data and entire data to be written into 'reject' folder in Azure blob. If no error is found, then the entire file should be sinked to 'accept' folder.

    Follow the below steps to achieve the scenario:
    Assuming we have two file, file1 (mixed of good and bad data) and file2 (good data) in blob storage.

    We will create a Dataflow where Source will be container path where the files are stored while creating Source Dataset.

    enter image description here

    In Source Options, do the below to get a column with FileName:

    enter image description here

    Then, Add DerivedColumn named BadDataCol, where we will check our Validity Condition (e.g. Length of column > 6)

    enter image description here

    Then we will check the BadDataCount adding Aggregate, GroupBy filename with condition ErrorCount = sum(iif(BadData == true(),1,0))

    enter image description here

    enter image description here

    Now, we need to Join BadDataCount and BadDataCol with join Condition Column as filename and joining Type as Inner.

    enter image description here

    Then, Add Conditional Split where add two streams Reject with condition ErrorCount > 0 and Accept otherwise.

    enter image description here

    Now we will Select only source columns and removing all other columns created through this dataflow for both Accept and Reject Conditions.

    enter image description here

    Now Add Sink to each Reject and Accept Conditions creating Different Sink Dataset pointing to respected Folders.

    enter image description here

    DataFlow: enter image description here

    Finally Create a Pipeline attaching the Dataflow to Run the process successfully.

    enter image description here

    Check the Output in Respected Folders:

    Accept: enter image description here

    Reject: enter image description here