javaapache-sparkapache-spark-sqlapache-spark-dataset

Deconstructing Spark SQL Dataset<Row> back into its individual StructFields/columns


Java 11 and Spark SQL 2.13:3.3.2 here. Please note: I'm using and interested in the Java API and would appreciate Java answers, but I can probably decipher Scala/Python-based answers and do the necessary Scala/Python-to-Java conversions if necessary. But Java would be appreciated!


I understand how to create a new Dataset<Row> with a specified schema:

Dataset<Row> dataFrame = sparkSession.emptyDataFrame();

List<StructField> structFields = getSomehow();

StructType schema = DataTypes.createStructType(structFields.toArray(StructField[]::new));
Dataset<Row> ds = sparkSession.createDataFrame(dataFrame.rdd(), schema);

What I'm trying to understand is: how do I do the reverse of this? How do I turn a Dataset<Row> back into a List<StructField> (its schema; columns)? I see the ds.schema() method, which returns a StructType, but not sure how to deconstruct that back into a list of individual columns/StructFields. Any ideas?


Solution

  • You were close, you need to transform it to Java list:

    ds.schema().toList()
    

    Full code:

    scala.collection.immutable.List<StructField> schemaList = ds.schema().toList();