pysparkamazon-emramazon-s3-select

does EMR cluster size matters to read data from S3 using spark


Setup: latest (5.29) AWS EMR, spark, 1 master 1 node.

step 1. I have used S3Select to parse a file & collect all file keys for pulling from S3. step 2. Use pyspark iterate the keys in a loop and do the following

spark .read .format("s3selectCSV") .load(key) .limit(superhighvalue) .show(superhighvalue)

It took be x number of minutes.

When I increase the cluster to 1 master and 6 nodes, I am not seeing difference in time. It appears to me that I am not using the increased core nodes.
Everything else, config wise are defaults out of the box, I am not setting anything.

So, my question is does cluster size matters to read and inspect (say log or print) data from S3 using EMR, Spark?


Solution

  • Yes, size does matter. For my use case, sc.parallelize(s3fileKeysList), parallelize turned out to be the key.