pythonapache-sparkpysparkspark-excel

i'm unable to perform skipFirstRows parameter while reading excel in pyspark - python


Note: we should not use pandas.read_excel() while reading excel in my case. we only need to use spark-excel jar installed in our cluster.

my main point is. we have skip few lines in the excel sheet while reading the file by using any logic or any parameter like ("skipFirstRows", "[int value]")

df = spark.read.format("com.crealytics.spark.excel")\
               .option("header", "true")\
               .option("inferSchema", "true")\
               .option("skipFirstRows","1")\ 
               .option("treatEmptyValuesAsNulls", "true")\
               .load("dbfs:/FileStore/filename.xlsx")
df

Even after using this parameter .option("skipFirstRows","1") the line was not getting skipped while reading. it's raise error in the first line itself.

ERROR: java.lang.IllegalStateException: Cannot get a STRING value from a NUMERIC formula cell

My excel has one numeric value in the first row in the 6th or 7th cell and from the second line of my excel the actual header starts.

so i have to skip that first line.

Excel sample :

enter image description here

please help me to achieve this.

Thank you


Solution

  • skipFirstRows was deprecated in favor of more generic dataAddress option. For skipping rows in your example, try:

    df = spark.read.format("com.crealytics.spark.excel")\
           .option("dataAddress","A2")
           .load("dbfs:/FileStore/filename.xlsx")