Setup: latest (5.29) AWS EMR, spark, 1 master 1 node.
step 1. I have used S3Select to parse a file & collect all file keys for pulling from S3. step 2. Use pyspark iterate the keys in a loop and do the following
spark .read .format("s3selectCSV") .load(key) .limit(superhighvalue) .show(superhighvalue)
It took be x number of minutes.
When I increase the cluster to 1 master and 6 nodes, I am not seeing difference in time. It appears to me that I am not using the increased core nodes.
Everything else, config wise are defaults out of the box, I am not setting anything.
So, my question is does cluster size matters to read and inspect (say log or print) data from S3 using EMR, Spark?
Yes, size does matter. For my use case, sc.parallelize(s3fileKeysList)
, parallelize turned out to be the key.