In code workbook, I have a dataset filtered_ds which has 2 columns fileName and Xml. I am trying to iterate through the rows in the dataframe and write the xml to separate files using python code. I tried the below, but here instead of the rows, it is iterating through columns. What is the correct way to do this?
def write_xmls(filtered_ds):
output = Transforms.get_output()
output_fs = output.filesystem()
for row in filtered_ds:
with output_fs.open(str(row[1]), 'w') as f:
f.write(str(row[2]))
f.close()
I think the reason you're getting columns instead of rows is that's simply what a spark dataframe as an iterable returns -- its columns, not rows. With that in mind, I see two options:
Once you've done either 1 or 2, then you should be able to iterate over the rows and write files as you'd like.
If the dataset is large and you're hitting driver memory issues, maybe there's a third approach that looks like (1) create a row_number column, (2) loop through all row number values and on each iteration filter the dataframe to just that row, (3) collect the filtered dataframe, (4) write that one row, (5) move onto the next step in the iteration. I expect this would be very slow, but if this is a one-time exercise then maybe that's OK.