excelscalaapache-sparkpysparkspark-excel

How to construct Dataframe from a Excel (xls,xlsx) file in Scala Spark?


I have a large Excel(xlsx and xls) file with multiple sheet and I need convert it to RDD or Dataframe so that it can be joined to other dataframe later. I was thinking of using Apache POI and save it as a CSV and then read csv in dataframe. But if there is any libraries or API that can help in this Process would be easy. Any help is highly appreciated.


Solution

  • The solution to your problem is to use Spark Excel dependency in your project.

    Spark Excel has flexible options to play with.

    I have tested the following code to read from excel and convert it to dataframe and it just works perfect

    def readExcel(file: String): DataFrame = sqlContext.read
        .format("com.crealytics.spark.excel")
        .option("location", file)
        .option("useHeader", "true")
        .option("treatEmptyValuesAsNulls", "true")
        .option("inferSchema", "true")
        .option("addColorColumns", "False")
        .load()
    
    val data = readExcel("path to your excel file")
    
    data.show(false)
    

    you can give sheetname as option if your excel sheet has multiple sheets

    .option("sheetName", "Sheet2")
    

    I hope its helpful