[SOLVED] Parallelism in AWS Glue

Parallelism in AWS Glue

I am reading a large file from S3 in a Glue job. Its a .txt file which I convert to .csv and read all the values in a particular column.

I want to leverage parallelism of Glue over here where the reading part can be taken as a task by glue workers.

Do I need to programmatically split the file and then submit the small chunks to the workers or does Spark take care of the parallelism by itself and is smart enough to split the file by itself and distribute it to the workers?

Solution

AWS Glue by itself splits file and distributes to worker nodes, as it uses Spark, so no manual splitting of file is required. However, you can leverage parallelism properties of Spark,

spark.sql.files.maxPartitionBytes: The maximum number of bytes to pack into a single partition when reading files. Default is 128 MB.
spark.sql.files.minPartitionNum: The suggested (not guaranteed) minimum number of partitions when reading files. Default is spark.default.parallelism which equals to two or the total number of cores in our cluster, whichever is bigger.

You can refer more in these pages: Salesforce Engineering Blog Apache Spark Doc