scalaapache-sparkazure-synapseapache-spark-xml

How To Read XML File from Azure Data Lake In Synapse Notebook without Using Spark


I have an XML file stored in Azure Data Lake which I need to read from Synapse notebook. But when I read this using spark-xml library, I get this error:

org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `d:col`

Sample xml looks like this:

<m:properties>
            <d:FileSystemObjectType m:type="Edm.Int32">0</d:FileSystemObjectType>
            <d:Id m:type="Edm.Int32">10</d:Id>
            <d:Modified m:type="Edm.DateTime">2021-03-25T15:35:17Z</d:Modified>
            <d:Created m:type="Edm.DateTime">2021-03-25T15:35:17Z</d:Created>
            <d:ID m:type="Edm.Int32">10</d:ID>
            <d:Title m:null="true" />
            <d:Description m:type="Edm.String">Test</d:Description>
            <d:PurposeCode m:type="Edm.Int32">1</d:PurposeCode>
</m:properties>

Notice there are tags for d:Id and d:ID which are causing the duplicate error. Found this documentation that states that although they are of different case, they are considered duplicate: https://learn.microsoft.com/en-us/azure/databricks/kb/sql/dupe-column-in-metadata But I cannot modify the xml and have to read as it is. Is there a work around so I can still read the xml?

Or, is there a way to read the xml without using spark? I'm thinking of reading the xml file using the scala.xml.XML library to load the file and parse the file. But when I attempt this, I get an error:

abfss:/<container>@<adls>.dfs.core.windows.net/<directory>/<xml_file>.xml (No such file or directory)

Code snippet below:

import scala.xml.XML
val xml = XML.loadFile("abfss://<container>@<adls>.dfs.core.windows.net/<directory>/<xml_file>.xml")

Note: error really only displayed abfss:/ as opposed to the path on the parameter which has //

Thanks.


Solution

  • Found a way to set spark to be case sensitive and is able now to read the xml successfully:

    spark.conf.set("spark.sql.caseSensitive", "true")