pythonapache-sparkpysparkaws-glue

Parallelism in AWS Glue


I am reading a large file from S3 in a Glue job. Its a .txt file which I convert to .csv and read all the values in a particular column.

I want to leverage parallelism of Glue over here where the reading part can be taken as a task by glue workers.

Do I need to programmatically split the file and then submit the small chunks to the workers or does Spark take care of the parallelism by itself and is smart enough to split the file by itself and distribute it to the workers?


Solution

  • AWS Glue by itself splits file and distributes to worker nodes, as it uses Spark, so no manual splitting of file is required. However, you can leverage parallelism properties of Spark,

    You can refer more in these pages: Salesforce Engineering Blog Apache Spark Doc