I am pretty new to spark-xml and I am finding it difficult to prepare a custom schema for my Object. Request you all to help me. Below is what I have tried.
I am using Spark 1.4.7 and spark-xml version 0.3.5
Test.Java
StructType customSchema = new StructType(new StructField[]{
new StructField("id", DataTypes.StringType, true, Metadata.empty()),
new StructField("name", DataTypes.StringType, true, Metadata.empty()),
DataTypes.createStructField("names", DataTypes.createStructType(new StructField[]{
DataTypes.createStructField("test", DataTypes.createArrayType(DataTypes.StringType),
true)}), true)
});
final JavaRDD<Row> map = spoofRDD()
.map(book -> RowFactory.create(
book.getId(),
book.getName(),
book.getNames()));
final DataFrame df = sqlContext.createDataFrame(map, customSchema);
df.show();
df.printSchema();
private JavaRDD<Book> spoofRDD() {
Book book1 = Book.builder().id("1").name("Name1")
.names(new String[]{"1", "2"}).build();
List<Book> books = new ArrayList<>();
books.add(book1);
return javaSparkContext.parallelize(books);
}
My POJO class Book.Java
private final String id;
private final String name;
private final String[] names;
My Expected XML
<books>
<book>
<id>1</id>
<name>Name1</name>
**<parent>**
<names>1</names>
<names>2</names>
**</parent>**
</book>
<book>
<id>2</id>
<name>Name2</name>
**<parent>**
<names>1</names>
<names>2</names>
**</parent>**
</book>
So, as you see I wish to have a nested tag in the parent. How can I modify my customSchema to achieve the same.
A correct schema for the desired XML output is:
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
|-- parent: struct (nullable = true)
| |-- names: array (nullable = true)
| | |-- element: long (containsNull = true)]
while your current schema is:
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- names: struct (nullable = true)
| |-- test: array (nullable = true)
| | |-- element: string (containsNull = true)
So the only thing you have to change here is the name of the tag from test
to name
and names
to parent
and value type for the array contents.
new StructType(new StructField[]{
new StructField("id", DataTypes.StringType, true, Metadata.empty()),
new StructField("name", DataTypes.StringType, true, Metadata.empty()),
DataTypes.createStructField("names", DataTypes.createStructType(new StructField[]{
DataTypes.createStructField("test", DataTypes.createArrayType(DataTypes.StringType),
true)}), true)
})
The real problem is the data. Because parent
has to be struct
, getNames
output should be wrapped with Row
:
.map(book -> RowFactory.create(
book.getId(),
book.getName(),
RowFactory.create(book.getNames())));