pythonos.pathpinecone

Unclear behaviour of pinecone's load_dataset


In the following program, I have the three functions get_dataset1, get_dataset2, and get_dataset3 that are all very similar. They only differ in when they call len(dataset) and os.path.join = tmp.

The functions get_dataset1 and get_dataset3 behave as intended; they load a dataset, and it has a length greater 0. However, in the case of get_dataset2, the dataset has length 0. Why is that?

import copy
import os
import time

from pinecone_datasets import load_dataset

datasetName = "langchain-python-docs-text-embedding-ada-002"


def get_dataset1():
    os.path.join = lambda *s: "/".join(s)  # pinecone bug workaround
    dataset = load_dataset(datasetName)
    print("Dataset loaded:", len(dataset) != 0)  # dataset has length greater than 0


def get_dataset2():
    os.path.join = lambda *s: "/".join(s)  # pinecone bug workaround
    dataset = load_dataset(datasetName)
    os.path.join = tmp
    print("Dataset loaded:", len(dataset) != 0)  # dataset has length 0


def get_dataset3():
    os.path.join = lambda *s: "/".join(s)  # pinecone bug workaround
    dataset = load_dataset(datasetName)
    print("Dataset loaded:", len(dataset) != 0)  # dataset has length greater than 0
    os.path.join = tmp
    print("Dataset loaded:", len(dataset) != 0)  # dataset has length greater than 0


def main():
    get_dataset1()
    get_dataset2()
    get_dataset3()


if __name__ == "__main__":
    tmp = copy.deepcopy(os.path.join)
    main()

Solution

  • Short answer

    I ended up having to look through the pinecone-datasets source code, but the answer basically just comes down to lazy evaluation.

    When you initialize the Dataset, it doesn't really have any useful values assigned. All the calculation - including actually figuring out what data it's supposed to be referencing - happens when you try to do something interesting, such as finding its length.

    At that point, it finally uses os.path.join to figure out what data it's supposed to store - and if any changes have been made to os.path.join between initialization and that moment, it'll use the most recent value.

    To get the behavior you expect, make sure that os.path.join is defined properly when you do something interesting with the Dataset - not just when you construct it.

    Long answer

    When you write the line:

    dataset = load_dataset(datasetName)
    

    This eventually ends up calling the constructor for Dataset. Most of the constructor is irrelevant, but there are a couple lines that seem to be affecting your use case:

    self._dataset_path = dataset_path
    self._documents = None
    

    The Dataset stores the information for how it can get the data, but it delays accessing it until later. So by the time you try to find len(dataset), the dataset._documents property may or may not still be None. To handle both cases, len(dataset) accesses the documents property (not _documents) using this method:

    @property
    def documents(self) -> pd.DataFrame:
        if self._documents is None:
            self._documents = self._safe_read_from_path("documents")
        return self._documents
    

    This is the lazy-evaluation portion of the class. If _documents already exists, it'll just return its value. But if _documents is still None, it'll calculate its value using _safe_read_from_path, then permanently save that value and return it.

    In turn, _safe_read_from_path contains the following line:

    read_path_str = os.path.join(self._dataset_path, data_type, "*.parquet")
    

    So the method's functionality will depend on how you most recently defined os.path.join.

    Putting it all together, this means that Dataset isn't storing very much until you call len, at which point it uses the current value of os.path.join to assign a useful value to its _document property and continue using that value in all future calls.

    Given that, we can step through each of your functions:

    get_dataset1

    The function follows the following steps:

    1. Define os.path.join as your lambda.
    2. Load a new Dataset, with the _documents propety as None.
    3. Calculate the dataset's length, using the lambda stored in os.path.join to finally assign a value to _documents.

    This is pretty vanilla, which explains why this is working as expected.

    get_dataset2

    This one follows a similar structure, but diverges in an important way:

    1. As before, define os.path.join as the lambda.
    2. Also as before, load a new Dataset, with _documents as None.
    3. Redefine os.path.join as temp.
    4. Calculate the dataset's length, using the temp value stored in os.path.join to assign a value to _documents.

    By the end of this, you end up with an _documents value that was calculated using the wrong os.path.join implementation, which explains why you're getting a zero-length Dataset.

    get_dataset3

    This is where lazy evaluation really comes into play. The function follows these steps:

    1. Do all the steps in get_dataset1
    2. Redefines os.path.join
    3. Try to get the dataset's length - and since _documents is already known from the last len call, it uses that old value instead of recalculating based on the new value of os.path.join.

    So even though you redefined os.path.join before the second len call, the Dataset uses the same _documents value that you calculated in get_dataset1 and doesn't care about the new os.path.join value in the slightest.