I’m currently building a Retrieval-Augmented Generation (RAG) system using Azure AI Search, and I've run into a problem with my index/indexer and skillset when handling chunked documents.
Overview of My Setup:
Issue:
When I try to feed my indexer with the chunked pages, I occasionally encounter a dimension mismatch error. Specifically, I receive the following error:
There's a mismatch in vector dimensions. The vector field 'content_embeddings', with dimension of '1536',
expects a length of '1536'. However, the provided vector has a length of '3072'.
Please ensure that the vector length matches the expected length of the vector field.
Observations:
Troubleshooting Steps I’ve Taken:
Assumption:
I suspect that the issue may be due to two vectors being concatenated, resulting in the oversized dimensions. However, I'm not sure how to solve this problem.
My Question:
Code Snippets:
Skillset Configuration:
"skills": [
{
"@odata.type": "#Microsoft.Skills.Text.SplitSkill",
"name": "SplitSkill",
"description": "A skill that splits text into chunks",
"context": "/document",
"defaultLanguageCode": "en",
"textSplitMode": "pages",
"maximumPageLength": 2000,
"pageOverlapLength": 500,
"maximumPagesToTake": 0,
"unit": "azureOpenAITokens",
"inputs": [
{
"name": "text",
"source": "/document/content"
}
],
"outputs": [
{
"name": "textItems",
"targetName": "pages"
}
],
"azureOpenAITokenizerParameters": {
"encoderModelName": "cl100k_base",
"allowedSpecialTokens": [
"[START]",
"[END]"
]
}
},
{
"@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
"name": "ContentEmbeddingSkill",
"description": "Connects to Azure OpenAI deployed embedding model to generate embeddings from content.",
"context": "/document/pages/*",
"resourceUri": "https://xxxx.openai.azure.com",
"apiKey": "<redacted>",
"deploymentId": "text-embedding-ada-002",
"dimensions": 1536,
"modelName": "text-embedding-ada-002",
"inputs": [
{
"name": "text",
"source": "/document/pages/*"
}
],
"outputs": [
{
"name": "embedding",
"targetName": "content_embeddings"
}
],
"authIdentity": null
}
Indexer Configuration:
{
"@odata.context": "xxxxxxxxxxx",
"@odata.etag": "xxxxxxxxxxx",
"name": "xxxxxxxxxxx-vector",
"description": null,
"dataSourceName": "sharepoint-datasource",
"skillsetName": "contentembedding",
"targetIndexName": "sharepoint-index",
"disabled": null,
"schedule": null,
"parameters": {
"batchSize": 10,
"maxFailedItems": 100,
"maxFailedItemsPerBatch": null,
"base64EncodeKeys": null,
"configuration": {
"indexedFileNameExtensions": ".csv, .docx, .pptx,.txt,.html,.pdf",
"excludedFileNameExtensions": ".png, .jpg, .gif",
"dataToExtract": "contentAndMetadata"
}
},
"fieldMappings": [
{
"sourceFieldName": "content",
"targetFieldName": "content",
"mappingFunction": null
}
],
"outputFieldMappings": [
{
"sourceFieldName": "/document/pages",
"targetFieldName": "pages",
"mappingFunction": null
},
{
"sourceFieldName": "/document/pages/*/content_embeddings/*",
"targetFieldName": "content_embeddings",
"mappingFunction": null
}
],
"cache": null,
"encryptionKey": null
}
I appreciate any insights or suggestions on how to resolve this issue!
What could be causing this dimension mismatch specifically when I attempt to index the chunked embeddings?
If multiple chunked embeddings (e.g., 1536-dimension vectors) are concatenated unintentionally before being passed into the indexer, the resulting vector could become too large (e.g., 3072 for two concatenated vectors). This happens if the chunking process produces multiple embeddings per document, and these embeddings are concatenated instead of handled individually.
{
"@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
"name": "ContentEmbeddingSkill",
"context": "/document/pages/*",
"dimensions": 1536,
"inputs": [
{
"name": "text",
"source": "/document/pages/*"
}
],
"outputs": [
{
"name": "embedding",
"targetName": "content_embeddings"
}
]
}
outputFieldMappings
should map each chunk's embedding to a separate instance of the content_embeddings
field.{
"dataSourceName": "your-datasource",
"skillsetName": "your-skillset",
"targetIndexName": "your-index",
"outputFieldMappings": [
{
"sourceFieldName": "/document/pages/*/content_embeddings",
"targetFieldName": "content_embeddings"
}
]
}
Reference: