pythonlangchainqdrantopenaiembeddingsqdrantclient

ValueError: could not broadcast input array from shape (1536,) into shape (2000,)


I'm trying to create a Qdrant vectorsore and add my documents.

I'm getting the following error: ValueError: could not broadcast input array from shape (1536,) into shape (2000,)

I understand that my error is how I configure the vectorParams, but I don't undertsand how these values need to be calculated.

here's my complete code:

import os
from typing import List

from langchain.docstore.document import Document
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Qdrant, VectorStore
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

def load_documents(documents: List[Document]) -> VectorStore:
    """Create a vectorstore from documents."""
    collection_name = "my_collection"
    vectorstore_path = "data/vectorstore/qdrant"
    embeddings = OpenAIEmbeddings(
        model="text-embedding-ada-002",
        openai_api_key=os.getenv("OPENAI_API_KEY"),
    )
    qdrantClient = QdrantClient(path=vectorstore_path, prefer_grpc=True)
    qdrantClient.create_collection(
        collection_name=collection_name,
        vectors_config=VectorParams(size=2000, distance=Distance.EUCLID),
    )
    vectorstore = Qdrant(
        client=qdrantClient,
        collection_name=collection_name,
        embeddings=embeddings,
    )
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
    )

    sub_docs = text_splitter.split_documents(documents)
    vectorstore.add_documents(sub_docs)

    return vectorstore

Any ideas on how I should configure the vector params properly?


Solution

  • So, as I see, the value of 1536 is fixed by the vector size of the OpenAIEmbeddings.

    Quoting from this article: https://openai.com/blog/new-and-improved-embedding-model

    The new embeddings have only 1536 dimensions, one-eighth the size of davinci-001 embeddings, making the new embeddings more cost effective in working with vector databases.

    Thus, changing the above code to VectorParams(size=1536, distance=Distance.EUCLID), made the trick.