Been searching for an answer to do this for quite awhile now but can't seem to figure it out. I've read Why is input_file_name() empty for S3 catalog sources in pyspark? and tried everything in this questions but none of it worked. I'm trying to get the filename of each record in the source s3 bucket but blank keeps getting returned. I'm thinking it could be to do with that the files are gunzipped as it worked perfectly before they were. Can't seem to find anywhere that this should be an issue. Does anyone know if this is an issue or if it is something else to do with my code?
Thank you!
def main():
glue_context = GlueContext(sc.getOrCreate())
#create a source dataframe for the bronze table
dyf_bronze_table = glue_context.create_dynamic_frame.from_catalog(
database=DATABASE
, table_name=TABLE
, groupFiles='none'
)
#Add file location to join postgres database on
bronze_df = dyf_bronze_table.toDF()
bronze_df = bronze_df.withColumn("s3_location", input_file_name())
bronze_df.show()
The problem was in my terraform file. I had set the
compressionType = "gzip"
and the
format = gzip
also. Once I removed these the filename was populated.
After reading through some of the documentation though I wouldn't recommend gunzipping the files (maybe use parquet instead) as when the files are gunzipped it can't shard them so instead of working on the data on multiple dpus it has to work through each file individually.