excelscalaapache-sparkapache-spark-sqlspark-excel

Construct a dataframe from excel using scala


I am looking for way to construct the dataframe from an excel file in spark using scala? I referred below SO post and tried doing an operation for an excel sheet attached.

Excel sheet sample

How to construct Dataframe from a Excel (xls,xlsx) file in Scala Spark?

Unfortunately, below modified code didn't read all the columns in an excel.

val df = spark.read.format("com.crealytics.spark.excel")
      .option("sheetName", "Sheet1") // Required
      .option("useHeader", "false") // Required
      .option("treatEmptyValuesAsNulls", "false") // Optional, default: true
      .option("inferSchema", "true") // Optional, default: false
      .option("addColorColumns", "false") // Optional, default: false
      .option("startColumn", 0) // Optional, default: 0
      .option("endColumn", 99) // Optional, default: Int.MaxValue
      .option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
      .option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files
      .option("excerptSize", 10) // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from
      .option("path", excelFile)
      //.schema(customSchema)
      .load()

+---+---+--------------+---+---+
|_c0|_c1|           _c2|_c3|_c4|
+---+---+--------------+---+---+
|   |   |Test Profile 1|  A|123|
|   |   |Test Profile 2|  B|   |
|   |   |Test Profile 3|  C|   |
|   |   |Test Profile 4|  D|   |
|   |   |Test Profile 5|  E|   |
|   |   |Test Profile 6|  F|   |
+---+---+--------------+---+---+

Am I missing anything here?

My objective is to get all the data from a sheet which is randomly distributed and then get specific values out of it. Some of the cells can be blank.

I can do it in scala using apache poi, get the required values, convert into csv and then load in dataframe.

However, I am looking for a way to parse the excel sheet directly into dataframe using scala, iterate through dataframe rows and apply conditions to get the required rows/columns.

p.s. Sorry, I didnt know how to attach an excel file from my local machine.

Thanks!


Solution

  • If you study the source code of crealytics spark excel, you will find that column numbers are defined with first row with value. And the first row with value in your excel file has file columns, so the last column which has value in other columns and not in the first row with value is neglected.

    The solution to this would be to define a custom schema and pass it to to framework as

    val customSchema = StructType(Seq(
      StructField("col0", StringType, true),
      StructField("col1", StringType, true),
      StructField("col2", StringType, true),
      StructField("col3", StringType, true),
      StructField("col4", IntegerType, true),
      StructField("col5", IntegerType, true)
      ))
    val df = spark.read.format("com.crealytics.spark.excel")
      .option("sheetName", "Sheet1") // Required
      .option("useHeader", "false") // Required
      .option("treatEmptyValuesAsNulls", "false") // Optional, default: true
      .option("inferSchema", "true") // Optional, default: false
      .option("addColorColumns", "false") // Optional, default: false
      .option("startColumn", 0) // Optional, default: 0
      .option("endColumn", 99) // Optional, default: Int.MaxValue
      .option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]
      .option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files
      .option("excerptSize", 10) // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from
      .option("path", excelFile)
      .schema(customSchema)
      .load()
    

    and you should have following dataframe

    +----+----+--------------+----+----+----+
    |col0|col1|col2          |col3|col4|col5|
    +----+----+--------------+----+----+----+
    |null|null|Test Profile 1|A   |123 |null|
    |null|null|Test Profile 2|B   |null|null|
    |null|null|Test Profile 3|C   |null|345 |
    |null|null|Test Profile 4|D   |null|null|
    |null|null|Test Profile 5|E   |null|null|
    |null|null|Test Profile 6|F   |null|null|
    +----+----+--------------+----+----+----+
    

    I hope the answer is helpful