apache-sparkpysparkspark-excel

Reading Excel (.xlsx) file in pyspark


I am trying to read a .xlsx file from local path in PySpark.

I've written the below code:

from pyspark.shell import sqlContext
from pyspark.sql import SparkSession

spark = SparkSession.builder \
      .master('local') \
      .appName('Planning') \
      .enableHiveSupport() \
      .config('spark.executor.memory', '2g') \
      .getOrCreate()

df = sqlContext.read("C:\P_DATA\tyco_93_A.xlsx").show()

Error:

TypeError: 'DataFrameReader' object is not callable


Solution

  • You can use pandas to read .xlsx file and then convert that to spark dataframe.

    from pyspark.sql import SparkSession
    import pandas
    
    spark = SparkSession.builder.appName("Test").getOrCreate()
    
    pdf = pandas.read_excel('excelfile.xlsx', sheet_name='sheetname', inferSchema='true')
    df = spark.createDataFrame(pdf)
    
    df.show()