pythonpalantir-foundry

Can I access directories in Palantir and use FOR to get names of all tables inside folder?


I need to simplify process of downloading datasets from Palantir. My idea was to use it like directory on local pc, but the problem is, that when i make codespace to use my own code, it seems like it uses virtual python environment, so i cant access directories outside of the environment, which has the datasets i want to use.

So the process should be from my perspective:

  1. Get into a directory with datasets
  2. Make some kind of FOR cycle based on the logic i need and insert names of files into a list
  3. Download all tables from list

Is there some way to do it?

I tried to access directory with the datasets, but as I am in virtual python environment, I dont know how.

I need to run the script inside Palantir. Right now we download datasets one by one through Palantir UI, but that consumes a lot of time.


Solution

  • If what you want to do (best guess, I +1 the comment below your post that it would be great if you can clarify what is what exactly - datasets, files, etc.) is: I have a lot of files on my local laptop, I need to upload them to Foundry, process them, and this will generate another dataset of lot of files, how can I download them in bulk ?

    Then my guess, is:

    1. you create a dataset in Foundry, you can bulk upload to dataset by drag and dropping all your files from your local laptop to the dataset. A dataset is primarily a "set of files" which can be of any type. There is no need to have a schema on a dataset to be processed
    2. You pick the app of your choice (Code Workspace for a jupyter like experience, Code Repo for pro-code, Pipeline Builder for no-code/low-code) - My preference is Code Repo, but Code Workspace is likely a good option as well given it generates small code snippets for you
    3. You process the files one by one. Here is a typical example = https://www.palantir.com/docs/foundry/transforms-python/unstructured-files/ Below is an example that simply "copy paste" content from the input dataset to the output dataset
    # @lightweight() # Optional - simply doesn't use spark as not needed
    # @incremental() # Optional - only to process the new files on each run
    @transform(
        input_files=Input("/PATH/example_incremental_dataset"),
        output=Output("/PATH/example_incremental_lightweight_output"),
    )
    def compute_downstream(input_files, output):
        fs = input_files.filesystem()
        files = list(fs.ls())  # listing all the files in the dataset
        timestamp = int(time.time())
    
        logger.warning(f"These are the files that will be processed: {files}")
    
        for curr_input_file in files:
            with input_files.filesystem().open(curr_input_file.path, "rb") as f1:
                with output.filesystem().open(curr_input_file.path + f"_{timestamp}.txt", "wb") as f2:
                    f2.write(f1.read())
    
    1. Now you want to download the output. This hasn't a first class solution, but you will have a few alternatives depending on what exactly you want to download For example:
    # UNTESTED
    # Note: If you want to read the files written on your output, to then save the zip file on your output as well, you will need to add the @incremental() decorator
    # which acts a bit like an "advanced" mode where you can read your output - which is useful in that case
    
    import zipfile
    import os
    
    def compress_files(file_paths, output_zip):
        with zipfile.ZipFile(output_zip, 'w') as zipf:
            for file in file_paths:
                if os.path.isfile(file):  # Check if file exists
                    zipf.write(file, os.path.basename(file))
                else:
                    print(f"File {file} does not exist and will be skipped.")
                    
    # Example usage
    files_to_compress = ['file1.txt', 'file2.txt', 'file3.txt']
    output_zip_file = 'compressed_files.zip'
    
    compress_files(files_to_compress, output_zip_file)
    

    Hope that helps

    EDIT: In case you have a dynamic set of files, see https://www.palantir.com/docs/foundry/transforms-python/unstructured-files/

    in particular:

    file_statuses = list(your_input.filesystem().ls())
    # Result: [FileStatus(path='students.csv', size=688, modified=...)]
    paths = [f.path for f in file_statuses]
    # Result: ['students.csv', ...]