[SOLVED] does EMR cluster size matters to read data from S3 using spark

does EMR cluster size matters to read data from S3 using spark

Setup: latest (5.29) AWS EMR, spark, 1 master 1 node.

step 1. I have used S3Select to parse a file & collect all file keys for pulling from S3. step 2. Use pyspark iterate the keys in a loop and do the following

spark .read .format("s3selectCSV") .load(key) .limit(superhighvalue) .show(superhighvalue)

It took be x number of minutes.

When I increase the cluster to 1 master and 6 nodes, I am not seeing difference in time. It appears to me that I am not using the increased core nodes.
Everything else, config wise are defaults out of the box, I am not setting anything.

So, my question is does cluster size matters to read and inspect (say log or print) data from S3 using EMR, Spark?

Solution

Yes, size does matter. For my use case, sc.parallelize(s3fileKeysList), parallelize turned out to be the key.