deep-learningnlpopenai-apiopenaiembeddings

How does OpenAIEmbeddings() work? Is it creating a single vector of size 1536 for whole text corpus?


I'm working with the OpenAIEmbeddings() class from OpenAI, which uses the text-embedding-3-small model. According to the documentation, it generates a 1536-dimensional vector for any input text.

However, I'm a bit confused about how this works:

I was expecting this:

If there are 100 words in my input text, i expected that OpenAIEmbeddings() would output 100 vectors, each having size 1536.

But the output is a single vector of size 1536 for the whole input text.

Why I expected this?

Because in my learning, i've understood that embeddings like Word2Vec or GloVe provide vectors for each word in a corpus. How does this differ from the approach taken by OpenAIEmbeddings?

I'm trying to understand whether there's a way to extract embeddings for individual words using this model or if the output is always a single vector representing the whole input.

Any insights or examples would be greatly appreciated!


Solution

  • Everything you described is 100% expected.

    Q: Is the 1536-dimensional vector generated for the entire input text?

    A: Yes.

    Q: If the 1536-dimensional vector represents the entire input text, how does the model handle individual words versus longer texts like sentences or paragraphs?

    A: First, the OpenAI Embeddings model doesn't handle a single word any different than a long text. For the model, it's an input. The input can be even a single character (e.g., "a"), but it doesn't make sense to calculate an embedding vector out of it since "a" doesn't semantically mean anything to us humans.

    Second, what you probably meant with this question is what happens when you do a similarity search with these embeddings. In other words, what happens when you use them? What happens if you use embeddings of words, sentences, paragraphs, or the whole text? Does it matter? Yes!

    This is called chunking. The decision about how to chunk your text depends on the use case. The best thing is probably to simply try and see. If you get meaningful results after doing a similarity search, then this means that chunking is appropriate (even if this means chunking the whole text). If you don't get meaningful results after doing a similarity search, then this means that chunking isn't appropriate (e.g., instead of chunking by paragraph, try chunking by sentences).

    There's an excellent Stack Overflow blog post about this topic you should read (pay attention to the bolded text because this is the best explanation):

    With RAG, you create text embeddings of the pieces of data that you want to draw from and retrieve. That allows you to place a piece of the source text within the semantic space that LLMs use to create responses.

    /.../

    When it comes to RAG systems, you’ll need to pay special attention to how big the individual pieces of data are. How you divide your data up is called chunking, and it’s more complex than embedding whole documents.

    /.../

    The size of the chunked data is going to make a huge difference in what information comes up in a search. When you embed a piece of data, the whole thing is converted into a vector. Include too much in a chunk and the vector loses the ability to be specific to anything it discusses. Include too little and you lose the context of the data.