apache-poispark3spark-excel

java.lang.NullPointerException while reading specific sheet from xlsx using org.zuinnote.spark.office.excel


We are trying to read one specific sheet from Excel (.xlsx with 3 sheets) using org.zuinnote.spark.office.excel into spark dataframe.

We are using MSExcelLowFootprintParser parser.

code used is

    val hadoopConf = new Configuration()
val spark = SparkSession.builder()
  .appName("ExcelReadingExample")
  .master("local[*]")
  .getOrCreate()


spark.sparkContext.hadoopConfiguration.set("hadoopoffice.read.lowFootprint", "true")
spark.sparkContext.hadoopConfiguration.set("hadoopoffice.read.header.read","true")
spark.sparkContext.hadoopConfiguration.set("hadoopoffice.read.sheets", "Most Runs Over - 2010")

spark.sparkContext.setLogLevel("INFO")
val schema = StructType(Seq(
    StructField("col", IntegerType, nullable = true),
    StructField("col1", StringType, nullable = true),
    StructField("Runs", IntegerType, nullable = true),
    StructField("BF", IntegerType, nullable = true),
    StructField("SR", DoubleType, nullable = true),
    StructField("s4", IntegerType, nullable = true),
    StructField("s6", IntegerType, nullable = true),
    StructField("Against", StringType, nullable = true),
    StructField("Venue", StringType, nullable = true),
    StructField("Match Date", StringType, nullable = true),
    StructField("Match Time", StringType, nullable = true),
    StructField("Match Partition Time", StringType, nullable = true)
))

val df: Dataset[Row] = spark.read
  .format("org.zuinnote.spark.office.excel")
  .option("hadoopoffice.read.sheets", "Most Runs Over - 2010")
  .option("read.spark.simpleMode","true")
  .option("read.lowFootprint", "true")
  .option("multiLine", "true")
  .option("read.spark.simpleMode.maxInferRows","1000")
  .option("read.header.read","true")
  .schema(schema)
  .load("D:\\excel\\spark-hadoopoffice-ds-s2-ho-1.3.9\\spark-hadoopoffice-ds-s2-ho-1.3.9\\src\\resources\\MostRuns_Over2008.xlsx")

df.show();

We are facing exception

java.lang.NullPointerException
at org.apache.poi.xssf.eventusermodel.XSSFReader$SheetIterator.getSheetName(XSSFReader.java:325)
at org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.processOPCPackage(MSExcelLowFootprintParser.java:374)
at org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.parse(MSExcelLowFootprintParser.java:267)
at org.zuinnote.hadoop.office.format.common.OfficeReader.parse(OfficeReader.java:92)
at org.zuinnote.hadoop.office.format.mapreduce.AbstractSpreadSheetDocumentRecordReader.initialize(AbstractSpreadSheetDocumentRecordReader.java:138)
at org.zuinnote.spark.office.excel.HadoopFileExcelReader.<init>(HadoopFileExcelReader.scala:55)
at org.zuinnote.spark.office.excel.DefaultSource.$anonfun$buildReader$4(DefaultSource.scala:322)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:185)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:240)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:159)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:832)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$SliceIterator.hasNext(Iterator.scala:266)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
at scala.collection.AbstractIterator.to(Iterator.scala:1429)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1429)
at org.apache.spark.rdd.RDD.$anonfun$take$2(RDD.scala:1449)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2281)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)

Its working if i read all sheets instead of one sheet. Its also working if i dont use low foot print parser.


Solution

  • Its working if i read all sheets instead of one sheet. It is also working if i don't use low footprint parser.

    Good observed. File a bug about this on https://github.com/ZuInnoTe/hadoopoffice/issues.

    Reason for the NullPointerException

    java.lang.NullPointerException
    at org.apache.poi.xssf.eventusermodel.XSSFReader$SheetIterator.getSheetName(XSSFReader.java:325)
    at org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.processOPCPackage(MSExcelLowFootprintParser.java:374)
    

    MSExcelLowFootprintParser in program line 374 calls iter.getSheetName() where iter is XSSFReader.SheetIterator. Then XSSFReader.SheetIterator.getSheetName calls xssfSheetRef.getName() but the XSSFSheetRef xssfSheetRef will not even be set to something and will be null unless XSSFReader.SheetIterator.next gets called.

    Its working if you read all sheets instead of one sheet because then MSExcelLowFootprintParser.java code line 374 never gets reached because no sheet names are given. Then the first call after while (iter.hasNext()) {... will be InputStream rawSheetInputStream = iter.next(); and that calls XSSFReader.SheetIterator.next which sets xssfSheetRef.

    It is also working if you don't use low footprint parser because that not uses org.apache.poi.xssf.eventusermodel.XSSFReader at all.

    Solution

    InputStream rawSheetInputStream = iter.next(); needs get called immidatelly after while (iter.hasNext()) { in org.zuinnote.hadoop.office.format.common.parser.msexcel.MSExcelLowFootprintParser.processOPCPackage. That will set XSSFReader.SheetIterator.xssfSheetRef and so avoid the NullPointerException.