I am trying to run the below script to load file with 24k records. Is there any reason why I am seeing two jobs for single load in Spark UI.
code
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("DM")\
.getOrCreate()
trades_df = spark.read.format("csv")\
.option("header", "true")\
.option("inferSchema", "true")\
.load("s3://bucket/source.csv")
trades_df.rdd.numPartitions() is 1
That's because spark reads the csv file two times since you enabled inferSchema.
Read the comments for the function def csv(csvDataset: Dataset[String]): DataFrame
on spark's github repo here.