pysparkpalantir-foundryfoundry-code-repositoriesfoundry-code-workbooks

In Palantir Foundry how do I parse a very large csv file with OOMing the driver or executor?


Similar to How do I parse large compressed csv files in Foundry? but without the file being compressed, a system generated (>10GB) csv file which needs to be parsed as a Foundry Dataset.

A dataset this size normally causes the driver to OOM, so how can I parse this file?


Solution

  • Using the filesystem, you can read the file and yield a rowwise operation to split on each seperator (,) in this case.

    df = raw_dataset
    fs = df.filesystem()
    def process_file(fl):
        with fs.open("data_pull.csv", "r") as f:
            header = [x.strip() for x in f.readline().split(",")]
            Log = Row(*header)
            for i in f:
                yield Log(*i.split(","))
    rdd = fs.files().rdd
    rdd = rdd.flatMap(process_file)
    df = rdd.toDF()