pythonpandasdataframeembedding

How do I generate embeddings for a large Pandas dataframe?


I have an issue with generating embeddings for my dataset. It consists of about 16 000 000 reddit comments (their bodies + some negligible metadata). I have them stored as a CSV file, from which I'm generating a Pandas dataframe using pd.read_csv().

The hard part starts when I try to generate embeddings. The previous iteration of my program used a dataset that was 3 orders of magnitude smaller, so generating an embedding for each comment was a trivial task - I would just do this individually for each comment and add the results as a new column.

The df["embedding"] = self.embedder.embed_str(temp_df["author"]) approach, however, has proven to be insufficient for my new dataset. I've spent more than 6 hours waiting for it to finish processing, only to get back to Killed at the bottom of my terminal, seemingly due to heavy memory usage. I have also tried a parallel batched approach, however this resulted in an even more rapid killing of the process.

Is there some more efficient way of doing this, or should I just give up and leave embedding for the training process instead of making it a part of my dataset? I would appreciate any general guidance on this matter, since this is my first brush with data science.

To provide extra context, the aforementioned self.embedder.embed_str() method is as follows:


    def embed_str(self, data: str) -> torch.Tensor:
        """Generates an embedding for the given str data.

        Args:
            data: The data to be embedded.

        Returns:
            A PyTorch tensor containing the embedding.
        """
        return self.model.encode(data)

, with self.model being a Jina Embeddings v2 SentenceTransformer initialised as such:

self.model = SentenceTransformer(
            "jinaai/jina-embeddings-v2-base-en",
            trust_remote_code=True,
        )

For more context, I am running the code on Intel i5 10400F and 16 GB of RAM.


Solution

  • You can try to use simpler model. May be jinaai/jina-embeddings-v2-small-en instead, i.e.:

    self.model = SentenceTransformer(
                "jinaai/jina-embeddings-v2-small-en",
                trust_remote_code=True,
            )
    
    

    It should be ~2.5 times faster than base model you have used. I'd also recommend to add show_progress_bar argument to encode function, like this: self.model.encode(data, show_progress_bar=True), so you can evaluate approximately when it will finish, and kill it on early stages if needed. You may also need to normalize the embeddings, so normalize_embeddings argument may also be needed, so your embed_str function will look like something like this:

    def embed_str(self, data: str) -> torch.Tensor:
            """Generates an embedding for the given str data.
    
            Args:
                data: The data to be embedded.
    
            Returns:
                A PyTorch tensor containing the embedding.
            """
            return self.model.encode(data, show_progress_bar=True, normalize_embeddings=True)
    

    If the comments are short, ~512 or ~1024 character long, you can try even simpler models, such as "sentence-transformers/paraphrase-MiniLM-L6-v2" or "sentence-transformers/paraphrase-MiniLM-L3-v2". Both have best speed based on this.

    self.model = SentenceTransformer(
                "jinaai/jina-embeddings-v2-base-en"
            )
    

    or

    self.model = SentenceTransformer(
                "sentence-transformers/paraphrase-MiniLM-L6-v2"
            )
    

    Based on the documentation you will get 384 dimensional embeddings, which can be used for tasks like clustering or semantic search MiniLM-L6-v2 and MiniLM-L3-v2.

    However the model you are trying to use does more: maps to 768 embedding space, supports long context up to 8k tokens, thus useful for processing long documents, semantic textual similarity, text reranking, recommendation, RAG and LLM-based generative search, etc Jina-Embeddings.

    Taking into account that reddit comment can have max 10k characters, you will need to find a model which supports may be max_position_embeddings>2048 (for more details please have a look here). The model you are using has max_position_embeddings=8192, which is overkill.

    So, key takeaway try to find simpler model for your use case.

    Regarding, multiprocessing, I agree with you. Per my experience it seems SentenceTransformer by default utilizes better the available resources, compared to multiprocessing, when we have single CPU, i.e. calling like this on single CPU generally takes longer time:

    pool = model.start_multi_process_pool()
    emb = model.encode_multi_process(data, pool)
    

    Another approach would be to use Google Colab with GPU, since it seems your machine doesn't have GPU. In free version if you will face the limitations, you may try to split your dataset into may be 100 chunks, and try to run them separately, and then aggregate.