I need to simplify process of downloading datasets from Palantir. My idea was to use it like directory on local pc, but the problem is, that when i make codespace to use my own code, it seems like it uses virtual python environment, so i cant access directories outside of the environment, which has the datasets i want to use.
So the process should be from my perspective:
Is there some way to do it?
I tried to access directory with the datasets, but as I am in virtual python environment, I dont know how.
I need to run the script inside Palantir. Right now we download datasets one by one through Palantir UI, but that consumes a lot of time.
If what you want to do (best guess, I +1 the comment below your post that it would be great if you can clarify what is what exactly - datasets, files, etc.) is: I have a lot of files on my local laptop, I need to upload them to Foundry, process them, and this will generate another dataset of lot of files, how can I download them in bulk ?
Then my guess, is:
# @lightweight() # Optional - simply doesn't use spark as not needed
# @incremental() # Optional - only to process the new files on each run
@transform(
input_files=Input("/PATH/example_incremental_dataset"),
output=Output("/PATH/example_incremental_lightweight_output"),
)
def compute_downstream(input_files, output):
fs = input_files.filesystem()
files = list(fs.ls()) # listing all the files in the dataset
timestamp = int(time.time())
logger.warning(f"These are the files that will be processed: {files}")
for curr_input_file in files:
with input_files.filesystem().open(curr_input_file.path, "rb") as f1:
with output.filesystem().open(curr_input_file.path + f"_{timestamp}.txt", "wb") as f2:
f2.write(f1.read())
top right in the dataset > Actions > Download as CSV
).dataset > Details > Files > Download
). See https://www.palantir.com/docs/foundry/code-repositories/prepare-datasets-download/#access-the-file-for-download# UNTESTED
# Note: If you want to read the files written on your output, to then save the zip file on your output as well, you will need to add the @incremental() decorator
# which acts a bit like an "advanced" mode where you can read your output - which is useful in that case
import zipfile
import os
def compress_files(file_paths, output_zip):
with zipfile.ZipFile(output_zip, 'w') as zipf:
for file in file_paths:
if os.path.isfile(file): # Check if file exists
zipf.write(file, os.path.basename(file))
else:
print(f"File {file} does not exist and will be skipped.")
# Example usage
files_to_compress = ['file1.txt', 'file2.txt', 'file3.txt']
output_zip_file = 'compressed_files.zip'
compress_files(files_to_compress, output_zip_file)
Hope that helps
EDIT: In case you have a dynamic set of files, see https://www.palantir.com/docs/foundry/transforms-python/unstructured-files/
in particular:
file_statuses = list(your_input.filesystem().ls())
# Result: [FileStatus(path='students.csv', size=688, modified=...)]
paths = [f.path for f in file_statuses]
# Result: ['students.csv', ...]