xmlapache-sparkdataframepysparkapache-spark-xml

Read XML in spark


I am trying to read xml/nested xml in pyspark using spark-xml jar.

df = sqlContext.read \
  .format("com.databricks.spark.xml")\
   .option("rowTag", "hierachy")\
   .load("test.xml"

when I execute, data frame is not creating properly.

    +--------------------+
    |                 att|
    +--------------------+
    |[[1,Data,[Wrapped...|
    +--------------------+

xml format I have is mentioned below :

enter image description here


Solution

  • heirarchy should be rootTag and att should be rowTag as

    df = spark.read \
        .format("com.databricks.spark.xml") \
        .option("rootTag", "hierarchy") \
        .option("rowTag", "att") \
        .load("test.xml")
    

    and you should get

    +-----+------+----------------------------+
    |Order|attval|children                    |
    +-----+------+----------------------------+
    |1    |Data  |[[[1, Studyval], [2, Site]]]|
    |2    |Info  |[[[1, age], [2, gender]]]   |
    +-----+------+----------------------------+
    

    and schema

    root
     |-- Order: long (nullable = true)
     |-- attval: string (nullable = true)
     |-- children: struct (nullable = true)
     |    |-- att: array (nullable = true)
     |    |    |-- element: struct (containsNull = true)
     |    |    |    |-- Order: long (nullable = true)
     |    |    |    |-- attval: string (nullable = true)
    

    find more information on databricks xml