palantir-foundryfoundry-code-workbooks

Palantir code workbook / Python, write xmls to files


In code workbook, I have a dataset filtered_ds which has 2 columns fileName and Xml. I am trying to iterate through the rows in the dataframe and write the xml to separate files using python code. I tried the below, but here instead of the rows, it is iterating through columns. What is the correct way to do this?

def write_xmls(filtered_ds): 
    output = Transforms.get_output()
    output_fs = output.filesystem()
    
    for row in filtered_ds:
        with output_fs.open(str(row[1]), 'w') as f: 
            f.write(str(row[2]))
            f.close()

Solution

  • I think the reason you're getting columns instead of rows is that's simply what a spark dataframe as an iterable returns -- its columns, not rows. With that in mind, I see two options:

    1. Collect the dataframe, which will then let you iterate over "rows". Perhaps a bit more intuitive for this use case, but be aware of driver memory issues (you can change Code Workbook Spark profiles in Control Panel if needed, but I would suggest avoiding this unless absolutely required).
    2. Convert the dataframe to a pandas dataframe, which may be more ergonomic for you depending on your familiarity with pandas. For this situation, this isn't incredibly different from collecting the rows with Spark, but I wanted to share in case you're already comfortable with pandas.

    Once you've done either 1 or 2, then you should be able to iterate over the rows and write files as you'd like.

    If the dataset is large and you're hitting driver memory issues, maybe there's a third approach that looks like (1) create a row_number column, (2) loop through all row number values and on each iteration filter the dataframe to just that row, (3) collect the filtered dataframe, (4) write that one row, (5) move onto the next step in the iteration. I expect this would be very slow, but if this is a one-time exercise then maybe that's OK.