pysparkpalantir-foundryfoundry-code-repositoriesfoundry-python-transformfoundry-data-connection

How can I have nice file names & efficient storage usage in my Foundry Magritte dataset export?


I'm working on exporting data from Foundry datasets in parquet format using various Magritte export tasks to an ABFS system (but the same issue occurs with SFTP, S3, HDFS, and other file based exports).

The datasets I'm exporting are relatively small, under 512 MB in size, which means they don't really need to be split across multiple parquet files, and putting all the data in one file is enough. I've done this by ending the previous transform with a .coalesce(1) to get all of the data in a single file.

The issues are:

All of this is unnecessary complexity being added to my downstream system, I just want to be able to pull the latest version of data in a single step.


Solution

  • You can also use the rewritePaths functionality of the export plugin, to rename the file under spark/*.snappy.parquet file to "export.parquet" while exporting. This of course only works if there is only a single file, so .coalesce(1) in the transform is a must:

    excludePaths:
      - ^_.*
      - ^spark/_.*
    rewritePaths:
      '^spark/(.*[\/])(.*)': $1/export.parquet
    uploadConfirmation: exportedFiles
    incrementalType: snapshot
    retriesPerFile: 0
    bucketPolicy: BucketOwnerFullControl
    directoryPath: features
    setBucketPolicy: true