javaapache-sparkapache-spark-sqlspark-excel

How to write Dataset to a excel file using hadoop office library in apache spark java


Currently I am using com.crealytics.spark.excel to read an Excel file, but using this library I can't write the dataset to an Excel file.

This link says that using hadoop office library (org.zuinnote.spark.office.excel) we can read and write to Excel files

Please help me to write dataset object to an excel file in spark java.


Solution

  • You can use org.zuinnote.spark.office.excel for both reading and writing excel file using Dataset. Examples are given at https://github.com/ZuInnoTe/spark-hadoopoffice-ds/. However, there is one issue if you read the Excel in Dataset and try to write it in another Excel file. Please see the issue and workaround in scala at https://github.com/ZuInnoTe/hadoopoffice/issues/12.

    I have written a sample program in Java using org.zuinnote.spark.office.excel and workaround given at that link. Please see if this helps you.

    public class SparkExcel {
        public static void main(String[] args) {
            //spark session
            SparkSession spark = SparkSession
                    .builder()
                    .appName("SparkExcel")
                    .master("local[*]")
                    .getOrCreate();
    
            //Read
            Dataset<Row> df = spark
                    .read()
                    .format("org.zuinnote.spark.office.excel")
                    .option("read.locale.bcp47", "de")
                    .load("c:\\temp\\test1.xlsx");
    
            //Print
            df.show();
            df.printSchema();
    
            //Flatmap function
            FlatMapFunction<Row, String[]> flatMapFunc = new FlatMapFunction<Row, String[]>() {
                @Override
                public Iterator<String[]> call(Row row) throws Exception {
                    ArrayList<String[]> rowList = new ArrayList<String[]>();
                    List<Row> spreadSheetRows = row.getList(0);
                    for (Row srow : spreadSheetRows) {
                        ArrayList<String> arr = new ArrayList<String>();
                        arr.add(srow.getString(0));
                        arr.add(srow.getString(1));
                        arr.add(srow.getString(2));
                        arr.add(srow.getString(3));
                        arr.add(srow.getString(4));
                        rowList.add(arr.toArray(new String[] {}));
                    }
                    return rowList.iterator();
                }
            };
    
            //Apply flatMap function
            Dataset<String[]> df2 = df.flatMap(flatMapFunc, spark.implicits().newStringArrayEncoder());
    
            //Write
            df2.write()
               .mode(SaveMode.Overwrite)
               .format("org.zuinnote.spark.office.excel")
               .option("write.locale.bcp47", "de")
               .save("c:\\temp\\test2.xlsx");
    
        }
    }
    

    I have tested this code with Java 8 and Spark 2.1.0. I am using maven and added dependency for org.zuinnote.spark.office.excel from https://mvnrepository.com/artifact/com.github.zuinnote/spark-hadoopoffice-ds_2.11/1.0.3