azureazure-pipelinesazure-data-factory

How to specify only the necessary files from the retrieved file names and sort them


I want to get some files into the Azure Blob Storage from FTP server using an ADF pipeline.

I want to create a pipeline whose configuration is like the below image(not whole configuration but a part of it). pipeline image

Get Metadata: Get all file names under the specified directory.

Set Variable: Stores the retrieved file names in a variable of array type.

For Each: Retrieve items one by one from stored variables and copy the data.

I have some questions here.

  1. Is it possible to filter the filenames retrieved by Get Metadata?
  2. Can the retrieved file names be reordered freely and stored in a variable?

For example, in case, the return value of Get Metadata activity was like below.

{
    "childItems": [
        {
            "name": "archive",
            "type": "Folder"
        },
        {
            "name": "Test_JP_20241110124416.zip",
            "type": "File"
        },
        {
            "name": "Test_JP_20241111123911.zip",
            "type": "File"
        },
        {
            "name": "Test_JP_20241111135003.zip",
            "type": "File"
        },
        {
            "name": "Test_JP_20241111023251.zip",
            "type": "File"
        }
    ],
    "effectiveIntegrationRuntime": "SelfHostedIntegrationRuntime",
    "executionDuration": 1,
    "durationInQueue": {
        "integrationRuntimeQueue": 3
    },
    "billingReference": {
        "activityType": "PipelineActivity",
        "billableDuration": [
            {
                "meterType": "SelfhostedIR",
                "duration": 0.016666666666666666,
                "unit": "Hours"
            }
        ]
    }
}

Now, I have 1 folder and 4 files in the same directory. There are 4 conditions for files to be imported.

  1. Type of ChildItems is "File".
  2. File name includes "JP".
  3. The date in timestamp in file name is today.
  4. Import files in order from the oldest to the newest file name timestamp.

In this case, I would like the files to be imported and their order to be as follows.

  1. Test_JP_20241111023251.zip
  2. Test_JP_20241111123911.zip
  3. Test_JP_20241111135003.zip

Can you see that the return value of Get Metadata is different from the order in which the files are imported?

Does anyone know any tips? Any answers would be helpful.

Thank you. Best regards.


Solution

  • You can follow below approach to achieve your requirement:

    List the files and folders from source using get metadata activity, add filter activity to the metadata activity with @activity('Get Metadata1').output.childItems items and condition

     @and(
        equals(item().type, 'File'),
        and(
            contains(item().name, 'JP'),
            equals(substring(item().name, 8, 8), utcNow('yyyyMMdd'))
        )
    )
    

    enter image description here

    Add foreach activity with sequential enabling and items @activity('Filter1').output.Value

    enter image description here

    Create dataset with dataset parameter filename with value @item().name for source, create blob data source for sink debug the pipeline, after successful debug required files will copy to the blob storage account. In blob storage after copying files, they will sort by default by ascending as shown below:

    enter image description here