jsonsoapazure-data-factory

How do I extract JSON array from SOAP reply in Azure Datafactory


I'm trying to get the JSON-array that in the reply I get from a SOAP service into a JSON file. But i'm having very little success.

The original reply I get is this:

<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns:xsd="http://www.w3.org/2001/XMLSchema">
    <soap:Body>
        <WebServiceXMLResponse xmlns="http://tempuri.org/">
            <WebServiceXMLResult xsi:type="xsd:string">
                {&amp;quot;messages&amp;quot;:[
                {&amp;quot;Sequence&amp;quot;:&amp;quot;11&amp;quot;,&amp;quot;ScientificName&amp;quot;:&amp;quot;Bos Taurus&amp;quot;},
                {&amp;quot;Sequence&amp;quot;:&amp;quot;12&amp;quot;,&amp;quot;ScientificName&amp;quot;:&amp;quot;Accipitridae&amp;quot;},
                {&amp;quot;Sequence&amp;quot;:&amp;quot;13&amp;quot;,&amp;quot;ScientificName&amp;quot;:&amp;quot;Corvus splendens&amp;quot;}
                ]}
            </WebServiceXMLResult>
        </WebServiceXMLResponse>
    </soap:Body>
</soap:Envelope>

I've created a copy-activity in ADF to transform it to JSON, which gives me this (formatted, it's all on one line):

{"soap:Envelope":
    {"soap:Body":
        {"WebServiceXMLResponse":
            {"WebServiceXMLResult":
                {"xsi:@type":"xsd:string",
                 "_value_":"{&quot;messages&quot;:[{&quot;Sequence&quot;:&quot;11&quot;,&quot;ScientificName&quot;:&quot;Bos Taurus&quot;},{&quot;Sequence&quot;:&quot;12&quot;,&quot;ScientificName&quot;:&quot;Accipitridae&quot;},{&quot;Sequence&quot;:&quot;13&quot;,&quot;ScientificName&quot;:&quot;Corvus splendens&quot;}]}"
                }
            }
        }
    }
}

The string i'm trying to extract is the value of the key "_value_" and then I need to convert this to an array.

After my copy-step there is a dataflow activity in which I have been experimenting with derived columns, select, etc. My idea was to extract the "messages" array, replace the &quot; with actual quotes and then flatten the array. But I keep failing to isolate the string which contains the array.

My goal is a JSON which looks like this:

[
    {
        "Sequence": "11",
        "ScientificName": "Bos Taurus"
    },
    {
        "Sequence": "12",
        "ScientificName": "Accipitridae"
    },
    {
        "Sequence": "13",
        "ScientificName": "Corvus splendens"
    }
]

Data preview looks like this:

Data preview dataflow source

Source options like this:

Source options dataflow source

Can someone point me in the right direction to tackle this?


Solution

  • As your input array is in string type, you are not getting the proper results. You can use the below approach using the combination of Derived column and Aggregate transformations to achieve your requirement. This will work for only the above kind of structure. If input xml string structure changes, then you need to change the string patterns below according to that.

    After your XML source, first take a derived column and create a new column as shown below.

    one - toString(unfold(slice(split({soap:Envelope}.{soap:Body}.WebServiceXMLResponse.WebServiceXMLResult.{_value_},'{'),3)))
    

    enter image description here

    First, it splits the input string on "{" and gives an array with required strings. Then, unfolds the array and creates a string column of all array elements.

    enter image description here

    Now, take another Derived column transformation and create two columns Sequence and ScientificName with following expressions.

    Sequence - split(split(one,'&quot;Sequence&quot;:&quot;')[2],'&')[1]
    
    ScientificName - split(split(one,'&quot;ScientificName&quot;:&quot;')[2],'&')[1]
    

    enter image description here

    This will extract the required Sequences and Scientific Name values and stores in the above string columns.

    To create a JSON array of above two columns, take an Aggregate transformation and use the below expression without grouping any column.

    arr - collect(@(Sequence=Sequence,ScientificName=ScientificName))
    

    enter image description here

    This will create the required array column. Add the JSON dataset as sink for the dataflow and upon running the dataflow, you can see it generated the required JSON structure in the result file.

    enter image description here