azuremappingazure-data-factorysftpsink

Using Data Flow activity inside For each activity in Azure Data Factory


I'm working on a pipeline in which I get the file names from some file I need to process from a SFTP using the get metadata activity, then filter the file names get just the files I need, after that I need to process each of those files into an Azure SQL table, I'm trying to use the For each activity and when I tried using the Copy activity to process the files it worked fine, but I also need two additional columns to be inserted into the azure SQL table, the date time of processing and the file name.

First of all, when I use the Data Flow, it's not detecting any of the columns in the file (https://i.sstatic.net/6bR0T.png) and when I try to map the columns in the sink activity, I only get the derived columns I created. I also tried using the Copy Activity but I don't really know how to add derived columns there.

Does anyone know if this sink/transformation can be accomplished using the Copy activity or the Data Flow Activity?


Solution

  • First of all, when I use the Data Flow, it's not detecting any of the columns in the file

    Based on the structure of your pipeline, inside ForEach activity, you are passing every file name to the Dataflow using dataset parameters.

    As you are doing the transformation dynamically for all the files in each iteration, your source schema was set to None while creating the dataset.

    enter image description here

    In general, the dataset schema for particular file will be imported from the connection/store while creating the dataset and it will take that schema in the dataflow source as well.

    when I try to map the columns in the sink activity, I only get the derived columns I created

    You will get like this only in the dataflow debug. When you execute the dataflow using pipeline, it will give desired result dynamically as per the source file schema in each iteration.

    In the below demonstration, I took one file with import schema set to None in the dataset and the source rows in the dataflow debug are 0 now.

    enter image description here

    You can see I am adding one new column and in sink it is showing only one column.

    But when I execute the dataflow using pipeline, it took the schema dynamically from the file and you can see the result included all the existing columns from the source file.

    enter image description here

    So, the above behavior won't affect your output file. If you want to cross check the original data while doing the transformation, you need to give your source file once in the dataset and import schema in the dataset. After importing schema, use your dataset parameters in the file name to do this in each iteration.

    enter image description here

    From this, the dataflow will take it and you can see the data preview while doing the transformations.