In azure data flow, I'm doing a data validation and based on the the Assert Transformation, I am adding a new column called Bad Data (using derived column transformation) to identify error data and non-error data.
now I want to sink the data to two folders (as excel/csv file)
I am trying an aggregate transformation after derived col to get the count of error data, if it is greater than 1, entire data will be sent to reject else to accept folder but I am facing issue with the expression to count error rows. I am using expression count(iif(IsValid== true(), 1, 0))
Is there a better approach to this?
##Update
Added an aggregate transformation to get the count of error data
sumIf(IsValid == true(), 1)
Now the challenge is getting the entire data sink to either Accept or reject folder.
According to your requirement: If the data has at least one row error in any column, it will be treated as bad data and entire data to be written into 'reject' folder in Azure blob. If no error is found, then the entire file should be sinked to 'accept' folder.
Follow the below steps to achieve the scenario:
Assuming we have two file, file1 (mixed of good and bad data) and file2 (good data) in blob storage.
We will create a Dataflow where Source will be container path where the files are stored while creating Source Dataset.
In Source Options, do the below to get a column with FileName:
Then, Add DerivedColumn named BadDataCol, where we will check our Validity Condition (e.g. Length of column > 6)
Then we will check the BadDataCount adding Aggregate, GroupBy filename
with condition
ErrorCount = sum(iif(BadData == true(),1,0))
Now, we need to Join BadDataCount and BadDataCol with join Condition Column as filename and joining Type as Inner.
Then, Add Conditional Split where add two streams Reject with condition ErrorCount > 0 and Accept otherwise.
Now we will Select only source columns and removing all other columns created through this dataflow for both Accept and Reject Conditions.
Now Add Sink to each Reject and Accept Conditions creating Different Sink Dataset pointing to respected Folders.
DataFlow:
Finally Create a Pipeline attaching the Dataflow to Run the process successfully.
Check the Output in Respected Folders:
Accept:
Reject: