apache-sparkpysparkspark-ui

Why do I see two jobs in Spark UI for a single read?


I am trying to run the below script to load file with 24k records. Is there any reason why I am seeing two jobs for single load in Spark UI.

code


from pyspark.sql import SparkSession

spark = SparkSession\
    .builder\
    .appName("DM")\
    .getOrCreate()


trades_df = spark.read.format("csv")\
    .option("header", "true")\
    .option("inferSchema", "true")\
    .load("s3://bucket/source.csv") 

trades_df.rdd.numPartitions() is 1

Spark UI Image


Solution

  • That's because spark reads the csv file two times since you enabled inferSchema.

    Read the comments for the function def csv(csvDataset: Dataset[String]): DataFrame on spark's github repo here.