I'm new to pyspark and aws glue. I wrote small script based on the examples i saw to read a csv file as a dynamic pyspark frame. I would like to know, how can i read multiple csv files or all csv files in particular s3 path and combine them to do some processing and then writing them back , possibly to different csv files.
I understand pyspark is meant to handle large amount of data, but is there a limit on how may rows of csv data or what amount of data can a pyspark dynamic frame can handle?
I'm trying to read multiple input files, combine them to do some processing on the data and then write them back to different output files .
from awsglue.transforms import *
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import job
sparkC = SparkContext.getOrCreate()
glueC = GlueContext(sparkC)
spark_session = glueC.spark_session
glue_job = Job(glueC)
t = glueC.create_dynamic_frame_from_options(connection_type="s3", connection_options={"paths":["s3://somebucket/inputfolder/"]}, format="csv")
...
Did you try to add the recurse parameter in your configuration, like this :
t = glueC.create_dynamic_frame_from_options(
connection_type="s3",
connection_options={
"paths":["s3://somebucket/inputfolder/"]
"recurse" : True
},
format="csv")