javascalaapache-sparklombokapache-spark-dataset

In Scala, how to map a spark Dataset into a list of POJOs?


I have a POJO defined in Java

@Data
@Builder
public class JavaItem {
    private String name;

}

I have the code in Scala

case class Record(name: String)

def asJavaItem(record: Record): JavaItem = {
    JavaItem.builder().build()
}

def recordDatasetToListJavaItem(record: Dataset[Record]): java.util.List[JavaItem] = {
    implicit val encoder: Encoder[JavaItem] = Encoders.bean(classOf[JavaItem])
    record.map(asJavaItem).collectAsList() // this fails
}

val recordDataset = Seq(Record("name")).toDS()

recordDatasetToListJavaItem(recordDataset)

I'm getting this error message:

org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 24, Column 11: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 24, Column 11: No applicable constructor/method found for zero actual parameters; candidates are: "JavaItem(java.lang.String)"

Why am I getting this error? I suspect this is a problem with encoders. How do I correctly map the Dataset[Record] to a list of JavaItem ?


Solution

  • You need a no-args constructor on the Java pojo. Spark occasionally constructs the objects first and then later calls the setters for the single data fields. That's why you need a constructur public JavaItem(){}.

    The following Lombok annotations work:

    @Data
    @Builder
    @NoArgsConstructor
    @AllArgsConstructor
    public class JavaItem {
        private String name;
    }
    

    The all-args constructor is required as - according to this answer- the builder only adds an all-args constructur if no other constructor is present..