palantir-foundryfoundry-code-repositoriesfoundry-code-workbooks

is it possible to generate pdf from datasets and save to foundry incrementally


FPDF is a library that allows to convert a pandas dataframe to nicely formatted pdf reports. Is there a feature in foundry code repo or code workbook to write pdf files into foundry from a spark or pandas dataframe ?

i have a requirement to create a nicely formatted pdf report from a foundry dataset filtered to few rows.

with the help of user https://stackoverflow.com/users/4922673/jackfischer i was able to get the requirement working, However the code overwrites the existing the file, how to incrementally update the datasets with new files everytime the code is ran. I am using Code Workbook templating feature to pass parameter to the logic and everytime a new parameter is passed, how can the logic create new file

example :

  1. samplefile.txt
  2. samplefile2.txt

Solution

  • While I'm not familiar with the FPDF library specifically, Foundry supports generating files from datasets in transforms or Code Workbooks.

    To create a single Pandas-based PDF from your dataset, convert your dataset to Pandas and get an output file handle from Foundry such as. In Code Workbooks,

    def pdf_dataset(input_df):
        output = Transforms.get_output()
        pd = input_df.toPandas()
        output_fs = output.filesystem()
            with output_fs.open(output_file_path, "wb") as f:
                # use FDPF as needed
    

    Alternatively, you can create a PDF per-row in parallel via Spark. This can be done most easily by structuring your data such that the parameters needed to generate each PDF are colocated in rows and from there you can run a Python function on to generate the PDF and write it out of Python memory to the destination dataset.

    In a Code Workbook this would resemble

    def pdf_dataset(input_df):
        output = Transforms.get_output()
    
        def generate_pdf(row):
            output_fs = output.filesystem()
            with output_fs.open(output_file_path, "wb") as f:
                # use FDPF as needed
                
        input_df.rdd.foreach(generate_pdf)