apache-sparkpysparkapache-spark-sqlapache-spark-xml

Reading XML File Through Dataframe


I have XML file like below format.

<nt:vars>
<nt:var id="1.3.0" type="TimeStamp"> 89:19:00.01</nt:var>
<nt:var id="1.3.1" type="OBJECT ">1.9.5.67.2</nt:var>
<nt:var id="1.3.9" type="STRING">AB-CD-EF</nt:var>
</nt:vars>

I built a dataframe on it using below code. Though the code is displaying 3 rows and retrieving id and type fields it'snot displaying actual value which is 89:19:00.01, 1.9.5.67.2, AB-CD-EF

spark.read.format("xml").option("rootTag","nt:vars").option("rowTag","nt:var").load("/FileStore/tables/POC_DB.xml").show()

Could you please help me if I have to add any other options to above line to bring the values as well please.


Solution

  • You can instead specify rowTag as nt:vars:

    df = spark.read.format("xml").option("rowTag","nt:vars").load("file.xml")
    
    df.printSchema()
    root
     |-- nt:var: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- _VALUE: string (nullable = true)
     |    |    |-- _id: string (nullable = true)
     |    |    |-- _type: string (nullable = true)
    
    df.show(truncate=False)
    +-------------------------------------------------------------------------------------------+
    |nt:var                                                                                     |
    +-------------------------------------------------------------------------------------------+
    |[[ 89:19:00.01, 1.3.0, TimeStamp], [1.9.5.67.2, 1.3.1, OBJECT ], [AB-CD-EF, 1.3.9, STRING]]|
    +-------------------------------------------------------------------------------------------+
    

    And to get the values as separate rows, you can explode the array of structs:

    df.select(F.explode('nt:var')).show(truncate=False)
    +--------------------------------+
    |col                             |
    +--------------------------------+
    |[ 89:19:00.01, 1.3.0, TimeStamp]|
    |[1.9.5.67.2, 1.3.1, OBJECT ]    |
    |[AB-CD-EF, 1.3.9, STRING]       |
    +--------------------------------+
    

    Or if you just want the values:

    df.select(F.explode('nt:var._VALUE')).show()
    +------------+
    |         col|
    +------------+
    | 89:19:00.01|
    |  1.9.5.67.2|
    |    AB-CD-EF|
    +------------+