[SOLVED] How to process multiple csv with pyspark in aws glue?

How to process multiple csv with pyspark in aws glue?

I'm new to pyspark and aws glue. I wrote small script based on the examples i saw to read a csv file as a dynamic pyspark frame. I would like to know, how can i read multiple csv files or all csv files in particular s3 path and combine them to do some processing and then writing them back , possibly to different csv files.

I understand pyspark is meant to handle large amount of data, but is there a limit on how may rows of csv data or what amount of data can a pyspark dynamic frame can handle?

I'm trying to read multiple input files, combine them to do some processing on the data and then write them back to different output files .

from awsglue.transforms import *
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import job

sparkC = SparkContext.getOrCreate()
glueC = GlueContext(sparkC)
spark_session = glueC.spark_session
glue_job = Job(glueC)

t = glueC.create_dynamic_frame_from_options(connection_type="s3", connection_options={"paths":["s3://somebucket/inputfolder/"]}, format="csv")

...

Solution

Did you try to add the recurse parameter in your configuration, like this :

t = glueC.create_dynamic_frame_from_options(
    connection_type="s3", 
    connection_options={
        "paths":["s3://somebucket/inputfolder/"]
        "recurse" : True
    }, 
format="csv")