regexazureazure-data-factorywildcard

Copy files based on wild card file name; from SFTP to Blob storage - Azure Data Factory - Copy activity


I have a list of files on an SFTP server, and I would like to copy and move some of those to the blob storage container. The filenames should match the wild card: (SCUST_TXT)[0-9]*(-)[0-9]*(-)[0-9]*(.)(CSV).

I have followed the below steps, the pipeline runs fine, but the files are not moved:

  1. created a new pipeline in Azure data factory
  2. added the 'Copy and Move' activity
  3. created source data set and linked service to connect SFTP
  4. created sink data set and linked service to connect blob storage
  5. in the source, i have chosen the wild card path and used the above regex in the wild card file name.

The pipeline runs successfully but it doesn't move the matching files. In the SFTP server i have 3 csv files, but only two of those match the wild card. I was expecting to see those two in the storage container, but it wasn't the case. I'm afraid the wild card pattern may be wrong. Here is the sample file name: SCUST_TXT20240417-032938-891.CSV.

If i change the wildcard file name as: *.CSV, the pipeline moves all the three files to the destination (sink). So i assume it's a problem with the wild card pattern.


Solution

  • Currently, Regular expressions are not supported in ADF wild card path. Regular expressions are only possible in ADF dataflow expressions.

    As your source data is from self-hosted integration run time, you cannot use dataflow directly to get the source file paths.

    Instead, pass the source file names from Get meta data activity to Dataflow and get the list of required file names from the dataflow.

    For this, first you need a dummy csv file with one header and one row in a temporary location of Blob storage.

    enter image description here

    Now, give your SFTP dataset with path till the folder to a Get meta data activity and select the ChildItems field.

    It will give the output array like below upon debugging.

    "childItems": [
            {
                "name": "LowFixone.csv",
                "type": "File"
            },
            {
                "name": "SCUST_20240417-032938-891.CSV",
                "type": "File"
            },
            {
                "name": "SCUST_TXT20240417-032924-826.CSV",
                "type": "File"
            },
            {
                "name": "SCUST_TXT20240417-032938-891.CSV",
                "type": "File"
            },
            {
                "name": "SCUST_TXT20240420-032938-ok.CSV",
                "type": "File"
            }
        ]
    
    

    Change the above array to a string like name1,name2,name3,. using below expression.

    @replace(replace(replace(replace(string(split(replace(replace(replace(string(activity('Get Metadata1').output.childItems),'"',''),'[',','),']',''),',type:File}')),'"',''),'[',''),']',''),',{name:','')
    

    Give this expression to a String type set variable activity.

    enter image description here

    It will the string like below.

    enter image description here

    In the dataflow, give the dummy dataset as source and create a string parameter with an empty default value.

    enter image description here

    Pass the variable to this parameter through dataflow activity.

    enter image description here

    In the dataflow, take a derived column transformation and create a new column with below expression. Here, give your regular expression. This filters out the required file names.

    mapIf(slice(split($str_param,','),1,size(split($str_param,','))-1),regexMatch(#item, '(SCUST_TXT)[0-9]*(-)[0-9]*(-)[0-9]*(.)(CSV)'),#item)
    

    enter image description here

    To get the result array back in the pipeline, use dataflow sink cache. Set the sink cache like below.

    enter image description here

    Now, debug the pipeline and you will get an array with required file names in the dataflow output.

    enter image description here

    You can get this array using below expression after dataflow activity.

    @activity('Data flow1').output.runStatus.output.sink1.value[0].filenames
    

    You need to use a For-each and pass this array. Inside For-Each use a copy activity with SFTP as source and Blob as target. Your source and target datasets should be parameterized on the file name so that you can pass this file name from each iteration to those