apache-sparkapache-spark-sqlapache-spark-datasetapache-spark-xml

Custom schema with nested parent node in spark-xml


I am pretty new to spark-xml and I am finding it difficult to prepare a custom schema for my Object. Request you all to help me. Below is what I have tried.

I am using Spark 1.4.7 and spark-xml version 0.3.5

Test.Java

StructType customSchema = new StructType(new StructField[]{
    new StructField("id", DataTypes.StringType, true, Metadata.empty()),
    new StructField("name", DataTypes.StringType, true, Metadata.empty()),

    DataTypes.createStructField("names", DataTypes.createStructType(new StructField[]{
        DataTypes.createStructField("test", DataTypes.createArrayType(DataTypes.StringType),
            true)}), true)
});

final JavaRDD<Row> map = spoofRDD()
    .map(book -> RowFactory.create(
        book.getId(),
        book.getName(),
        book.getNames()));

final DataFrame df = sqlContext.createDataFrame(map, customSchema);
df.show();
df.printSchema();



private JavaRDD<Book> spoofRDD() {

Book book1 = Book.builder().id("1").name("Name1")
    .names(new String[]{"1", "2"}).build();
List<Book> books = new ArrayList<>();
books.add(book1);

return javaSparkContext.parallelize(books);
}

My POJO class Book.Java

private final String id;
private final String name;
private final String[] names;

My Expected XML

<books>
<book>
    <id>1</id>
    <name>Name1</name>
    **<parent>**
        <names>1</names>
        <names>2</names>
    **</parent>**
</book>
<book>
    <id>2</id>
    <name>Name2</name>
    **<parent>**
        <names>1</names>
        <names>2</names>
    **</parent>**
</book>

So, as you see I wish to have a nested tag in the parent. How can I modify my customSchema to achieve the same.


Solution

  • A correct schema for the desired XML output is:

    root
     |-- id: long (nullable = true)
     |-- name: string (nullable = true)
     |-- parent: struct (nullable = true)
     |    |-- names: array (nullable = true)
     |    |    |-- element: long (containsNull = true)]
    

    while your current schema is:

    root
     |-- id: string (nullable = true)
     |-- name: string (nullable = true)
     |-- names: struct (nullable = true)
     |    |-- test: array (nullable = true)
     |    |    |-- element: string (containsNull = true)
    

    So the only thing you have to change here is the name of the tag from test to name and names to parent and value type for the array contents.

    new StructType(new StructField[]{
      new StructField("id", DataTypes.StringType, true, Metadata.empty()),
      new StructField("name", DataTypes.StringType, true, Metadata.empty()),
    
      DataTypes.createStructField("names", DataTypes.createStructType(new StructField[]{
        DataTypes.createStructField("test", DataTypes.createArrayType(DataTypes.StringType),
            true)}), true)
    })
    

    The real problem is the data. Because parent has to be struct, getNames output should be wrapped with Row:

    .map(book -> RowFactory.create(
        book.getId(),
        book.getName(),
        RowFactory.create(book.getNames())));