azureindexingazure-airagvector-search

Issue with Azure AI Search: Mismatch in Vector Dimensions When Indexing Chunked Documents


I’m currently building a Retrieval-Augmented Generation (RAG) system using Azure AI Search, and I've run into a problem with my index/indexer and skillset when handling chunked documents.

Overview of My Setup:

Issue:

When I try to feed my indexer with the chunked pages, I occasionally encounter a dimension mismatch error. Specifically, I receive the following error:

There's a mismatch in vector dimensions. The vector field 'content_embeddings', with dimension of '1536',
expects a length of '1536'. However, the provided vector has a length of '3072'. 
Please ensure that the vector length matches the expected length of the vector field.

Observations:

Troubleshooting Steps I’ve Taken:

  1. Output Validation: I log the dimensions of the vectors produced after embedding and before indexing. Typically, they are correctly sized at 1536, but the chunked vectors sometimes yield oversized dimensions.
  2. Chunking Logic: I’ve ensured that my chunking process doesn’t combine multiple chunks or overlap content, but the issue still persists.
  3. Indexer Configuration: I reviewed the indexer setup to ensure it correctly maps to the content_embeddings field.

Assumption:

I suspect that the issue may be due to two vectors being concatenated, resulting in the oversized dimensions. However, I'm not sure how to solve this problem.

My Question:

Code Snippets:

Skillset Configuration:

  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
      "name": "SplitSkill",
      "description": "A skill that splits text into chunks",
      "context": "/document",
      "defaultLanguageCode": "en",
      "textSplitMode": "pages",
      "maximumPageLength": 2000,
      "pageOverlapLength": 500,
      "maximumPagesToTake": 0,
      "unit": "azureOpenAITokens",
      "inputs": [
        {
          "name": "text",
          "source": "/document/content"
        }
      ],
      "outputs": [
        {
          "name": "textItems",
          "targetName": "pages"
        }
      ],
      "azureOpenAITokenizerParameters": {
        "encoderModelName": "cl100k_base",
        "allowedSpecialTokens": [
          "[START]",
          "[END]"
        ]
      }
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
      "name": "ContentEmbeddingSkill",
      "description": "Connects to Azure OpenAI deployed embedding model to generate embeddings from content.",
      "context": "/document/pages/*",
      "resourceUri": "https://xxxx.openai.azure.com",
      "apiKey": "<redacted>",
      "deploymentId": "text-embedding-ada-002",
      "dimensions": 1536,
      "modelName": "text-embedding-ada-002",
      "inputs": [
        {
          "name": "text",
          "source": "/document/pages/*"
        }
      ],
      "outputs": [
        {
          "name": "embedding",
          "targetName": "content_embeddings"
        }
      ],
      "authIdentity": null
    }

Indexer Configuration:

{
  "@odata.context": "xxxxxxxxxxx",
  "@odata.etag": "xxxxxxxxxxx",
  "name": "xxxxxxxxxxx-vector",
  "description": null,
  "dataSourceName": "sharepoint-datasource",
  "skillsetName": "contentembedding",
  "targetIndexName": "sharepoint-index",
  "disabled": null,
  "schedule": null,
  "parameters": {
    "batchSize": 10,
    "maxFailedItems": 100,
    "maxFailedItemsPerBatch": null,
    "base64EncodeKeys": null,
    "configuration": {
      "indexedFileNameExtensions": ".csv, .docx, .pptx,.txt,.html,.pdf",
      "excludedFileNameExtensions": ".png, .jpg, .gif",
      "dataToExtract": "contentAndMetadata"
    }
  },
  "fieldMappings": [
    {
      "sourceFieldName": "content",
      "targetFieldName": "content",
      "mappingFunction": null
    }
  ],
  "outputFieldMappings": [
    {
      "sourceFieldName": "/document/pages",
      "targetFieldName": "pages",
      "mappingFunction": null
    },
    {
      "sourceFieldName": "/document/pages/*/content_embeddings/*",
      "targetFieldName": "content_embeddings",
      "mappingFunction": null
    }
  ],
  "cache": null,
  "encryptionKey": null
}

I appreciate any insights or suggestions on how to resolve this issue!


Solution

  • What could be causing this dimension mismatch specifically when I attempt to index the chunked embeddings?

    If multiple chunked embeddings (e.g., 1536-dimension vectors) are concatenated unintentionally before being passed into the indexer, the resulting vector could become too large (e.g., 3072 for two concatenated vectors). This happens if the chunking process produces multiple embeddings per document, and these embeddings are concatenated instead of handled individually.

    {
      "@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
      "name": "ContentEmbeddingSkill",
      "context": "/document/pages/*",
      "dimensions": 1536,
      "inputs": [
        {
          "name": "text",
          "source": "/document/pages/*"
        }
      ],
      "outputs": [
        {
          "name": "embedding",
          "targetName": "content_embeddings"
        }
      ]
    }
    
    {
      "dataSourceName": "your-datasource",
      "skillsetName": "your-skillset",
      "targetIndexName": "your-index",
      "outputFieldMappings": [
        {
          "sourceFieldName": "/document/pages/*/content_embeddings",
          "targetFieldName": "content_embeddings"
        }
      ]
    }
    

    Reference: