pythonamazon-web-servicescsvpysparkaws-glue

How to process multiple csv with pyspark in aws glue?


I'm new to pyspark and aws glue. I wrote small script based on the examples i saw to read a csv file as a dynamic pyspark frame. I would like to know, how can i read multiple csv files or all csv files in particular s3 path and combine them to do some processing and then writing them back , possibly to different csv files.

I understand pyspark is meant to handle large amount of data, but is there a limit on how may rows of csv data or what amount of data can a pyspark dynamic frame can handle?

I'm trying to read multiple input files, combine them to do some processing on the data and then write them back to different output files .

from awsglue.transforms import *
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import job

sparkC = SparkContext.getOrCreate()
glueC = GlueContext(sparkC)
spark_session = glueC.spark_session
glue_job = Job(glueC)

t = glueC.create_dynamic_frame_from_options(connection_type="s3", connection_options={"paths":["s3://somebucket/inputfolder/"]}, format="csv")

...

Solution

  • Did you try to add the recurse parameter in your configuration, like this :

    t = glueC.create_dynamic_frame_from_options(
        connection_type="s3", 
        connection_options={
            "paths":["s3://somebucket/inputfolder/"]
            "recurse" : True
        }, 
    format="csv")