azure-data-factorypipeline

Can I use Azure Data Factory Copy Item to map nested XML to nested JSON


I'm trying to use a Blob source containing nested (1 level) XML documents and convert them to nested JSON (1 level)... similar to the following:

Input XML Document

<RootElement>
  <RootChild>rootchild</RootChild>
  <RootRepeatingElement>
    <RootRepeatingElementChild>rootrepeatingelementchild1</RootRepeatingElementChild>
  </RootRepeatingElement>
  <RootRepeatingElement>
    <RootRepeatingElementChild>rootrepeatingelementchild2</RootRepeatingElementChild>
  </RootRepeatingElement>
</RootElement>

Output JSON Document

{
  "RootChild": "rootchild",
  "RootRepeatingElement": [
    {
      "RootRepeatingElementChild": "rootrepeatingelementchild1"
    },
    {
      "RootRepeatingElementChild": "rootrepeatingelementchild2"
    }
  ]
}

I've tried using the Copy Item mapping, but can't seem to figure out how to do the nested elements.

Is this possible, if so is there a walk through link you can provide or explain how to map the repeating XML elements to a repeating JSON array. If not, is there an alternative?

Thank you.


Solution

  • You can follow below procedure to get your desired Json output:

    Use Mapping dataflow to get the requirement. Add your XML file as source in data flow, add flatten transformation to the source, Select the RootRepeatingElement as the Unroll by field, create input columns as below:

    enter image description here

    Add aggregate transformation to flatten transformation to convert the repeated elements back into a JSON array.

    enter image description here

    Add sink to the aggregate transformation, debug the dataflow, you will get Json as shown below:

    {"RootChild":"rootchild","RootRepeatingElement":[{"RootRepeatingElementChild":"rootrepeatingelementchild1"},{"RootRepeatingElementChild":"rootrepeatingelementchild2"}]} 
    

    enter image description here

    Here is the dataflow Json for your reference:

     {
        "name": "dataflow1",
        "properties": {
            "type": "MappingDataFlow",
            "typeProperties": {
                "sources": [
                    {
                        "dataset": {
                            "referenceName": "srcxml",
                            "type": "DatasetReference"
                        },
                        "name": "source1"
                    }
                ],
                "sinks": [
                    {
                        "dataset": {
                            "referenceName": "Json1",
                            "type": "DatasetReference"
                        },
                        "name": "sink1"
                    }
                ],
                "transformations": [
                    {
                        "name": "flatten1"
                    },
                    {
                        "name": "aggregate1"
                    }
                ],
                "scriptLines": [
                    "source(output(",
                    "          RootElement as (RootChild as string, RootRepeatingElement as (RootRepeatingElementChild as string)[])",
                    "     ),",
                    "     allowSchemaDrift: true,",
                    "     validateSchema: false,",
                    "     ignoreNoFilesFound: false,",
                    "     validationMode: 'none',",
                    "     namespaces: true) ~> source1",
                    "source1 foldDown(unroll(RootElement.RootRepeatingElement),",
                    "     mapColumn(",
                    "          RootChild = RootElement.RootChild,",
                    "          RootRepeatingElement = RootElement.RootRepeatingElement",
                    "     ),",
                    "     skipDuplicateMapInputs: false,",
                    "     skipDuplicateMapOutputs: false) ~> flatten1",
                    "flatten1 aggregate(groupBy(RootChild),",
                    "     RootRepeatingElement = collect(RootRepeatingElement)) ~> aggregate1",
                    "aggregate1 sink(allowSchemaDrift: true,",
                    "     validateSchema: false,",
                    "     umask: 0022,",
                    "     preCommands: [],",
                    "     postCommands: [],",
                    "     skipDuplicateMapInputs: true,",
                    "     skipDuplicateMapOutputs: true) ~> sink1"
                ]
            }
        }
    }