i have a 2MB file, when i read it using
df = spark.read.option("inferSchema", "true").csv("hdfs:///data/ml-100k/u.data", sep="\t")
df.rdd.getNumPartitions() # returns 1
it reads data in 1 partition, and when I read using
lines = sc.textFile("hdfs:///data/ml-100k/u.data")
lines.getNumPartitions() # returns 2
it reads data in 2 partitions, what is the reason?
Default partition calculation is different for those 2 methods.
sc.textFile():
If you check the implementation of sc.textFile() in git, partition value is by default captured from function defaultMinPartitions which would mostly return 2.
spark.read.csv():
By default this method creates one partition for each block of the file. Since 128MB is default size of a block and your file size is 2MB, since partition will be created.