palantir-foundryfoundry-code-repositories

How to convert a Dataset of PDFs to a Media Set?


I have a dataset of PDFs. I want to convert this dataset of raw files to a MediaSet.

I tried to convert the dataset of files in Code Repository but I'm unclear how to proceed and which code to use to perform this operation.


Solution

  • There are 2 main components:

    Media Sets

    In order to have a mediaset, there have multiple approaches:

    1. Create a mediaset directly (Action > New > Mediaset). This creates a MediaSet which users with relevant permissions can upload to.
    2. Directly create a mediaset from data ingest or equivalent (e.g. virtual storage ...)
    3. Create a mediaset from a pipeline

    You are in case 3. in those cases, because you already have a dataset with raw files. In a Code Repository, you need to import the transforms-media library in your code repository (via the left icon to import any libraries)

    To do so:

    from transforms.api import transform, Input
    from transforms.mediasets import MediaSetOutput
    
    @transform(
        output_mediaset=MediaSetOutput("<your path to mediaset>"),
        input_dataset=Input("<your path to dataset with raw files>")
    )
    def compute(input_dataset, output_mediaset):
        output_mediaset.put_dataset_files(input_dataset)
    

    You can directly move to next step: create media references dataset from this raw dataset (would have been the same from a Media Set).

    Media References

    Now comes the question: How do you convert a media set into media references ?
    Multiple ways:

    1. Via a Code Repository: A code example is available on the same page a bit below to read a dataset of raw files and convert it to a media references.
    2. Via Pipeline Builder: you need to add the mediaset containing the raw files, and to transform it using the Get Media References (Datasets).

    You should try approach 2. as it will be much simpler to achieve what you require here.

    Do something with it

    Once you have media references, you can use it in your Ontology etc.