pysparkapache-spark-sqldatabricksazure-databricks

Reading Excel(xlsx) with Pyspark does not work above a certain medium size


Having the following configuration of a cluster in databricks: 64GB, 8 cores

The tests have been carried out as the only notebook in the cluster, at that time there were no other notebooks running. I find that reading a simple 30 MB Excel file in spark keeps loading and does not work. Using the following code for this purpose:

sdf = spark.read.format("com.crealytics.spark.excel")\
                .option("header", True)\
                .option("inferSchema", "true")\
                .load(my_path)
display(sdf)

I have tried reducing the excel file and it works fine up to 15MB. As a workaround I am going to export the excel to csv and read it from there, but I find it shocking that spark can't even read 30MB of excel.

or am I doing something wrong in the configuration?


Solution

  • Can you please try the below option as shown in this spark-excel - github ?

    Based on your input you can modify the number of rows. The value 20 is an sample value.

    .option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files (will fail if used with xls format files)

    As mentioned above, the option does not work for .xls files.

    In case the files are really big, consider the options as show in the link #590

    Please validate before using any of the options specified.

    Cheers...