javaexcelapache-sparkhdfsspark-excel

Loading Data from an excel File using Spark Java Excel


I want to load data from an Excel file in HDFS using Spark Session 2.2. Here is bellow my Java code and the exception I got.

Dataset<Row> df = 
            session.read().
            format("com.crealytics.spark.excel").
            option("location", pathFile).
            option("sheetName", "Feuil1").
            option("useHeader", "true").
            option("treatEmptyValuesAsNulls", "true").
            option("inferSchema", "true").
            option("addColorColumns", "false").
            load(pathFile);

I got this exception:

java.lang.NoSuchMethodError: org.apache.poi.ss.usermodel.Workbook.close()V at com.crealytics.spark.excel.ExcelRelation.com$crealytics$spark$excel$ExcelRelation$$getExcerpt(ExcelRelation.scala:81) at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:270) at com.crealytics.spark.excel.ExcelRelation$$anonfun$inferSchema$1.apply(ExcelRelation.scala:269) at scala.Option.getOrElse(Option.scala:121) at com.crealytics.spark.excel.ExcelRelation.inferSchema(ExcelRelation.scala:269) at com.crealytics.spark.excel.ExcelRelation.(ExcelRelation.scala:97) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:35) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:14) at com.crealytics.spark.excel.DefaultSource.createRelation(DefaultSource.scala:8) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)


Solution

  • It looks like a dependency issue. Check if in your pom/sbt some of libraries don't use different version of Apache POI. You can do it for instance with mvn depenency:tree (https://maven.apache.org/plugins/maven-dependency-plugin/examples/resolving-conflicts-using-the-dependency-tree.html) or appropriate SBT/Gradle... command.

    When you find the conflicting dependency (the one where Workbook.close() method is missing), you can exclude it from the import.

    Apparently the close()method was added here: https://github.com/apache/poi/commit/47a8f6cf486b974f31ffd694716f424114e647d5