dataframeapache-sparkrdd

Reading file using Spark RDD vs DF


i have a 2MB file, when i read it using

df = spark.read.option("inferSchema", "true").csv("hdfs:///data/ml-100k/u.data", sep="\t")
df.rdd.getNumPartitions()  # returns 1

it reads data in 1 partition, and when I read using

lines = sc.textFile("hdfs:///data/ml-100k/u.data")
lines.getNumPartitions()  # returns 2

it reads data in 2 partitions, what is the reason?


Solution

  • Default partition calculation is different for those 2 methods.

    sc.textFile():

    If you check the implementation of sc.textFile() in git, partition value is by default captured from function defaultMinPartitions which would mostly return 2.

    spark.read.csv():

    By default this method creates one partition for each block of the file. Since 128MB is default size of a block and your file size is 2MB, since partition will be created.