pythonazureazure-cognitive-search

How do I store vectors generated by AzureOpenAIEmbeddingSkill in indexer given my current setup


This is a follow up question to: Error in Azure Cognitive Search Service when storing document page associated to each chunk extracted from PDF in a custom WebApiSkill

How do I store the vectors generated by AzureOpenAIEmbeddingSkill in indexer given my current setup:

combined_list = [{'textItems': text, 'numberItems': number} for text, number in zip(chunks, page_numbers)]

# response object for specific pdf
response_record = {
    "recordId": recordId,
    "data": {
        "subdata": combined_list
    }
}
response_body['values'].append(response_record)
{
  ...
  "description": "Skillset to chunk documents and generating embeddings",
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
      "name": "splitclean",
      "description": "Custom split skill to chunk documents with specific chunk size and overlap",
      "context": "/document",
      "httpMethod": "POST",
      "timeout": "PT30S",
      "batchSize": 1,
      "degreeOfParallelism": null,
      "authResourceId": null,
      "inputs": [
        {
          "name": "text",
          "source": "/document/content"
        }
      ],
      "outputs": [
        {
          "name": "subdata",
          "targetName": "subdata"
        }
      ],
      "authIdentity": null
    },
    {
      "name": "#2",
      "description": "Skill to generate embeddings via Azure OpenAI",
      "context": "/document/subdata/*",
      "apiKey": "<redacted>",
      "deploymentId": "embedding-ada-002",
      "dimensions": null,
      "modelName": "experimental",
      "inputs": [
        {
          "name": "text",
          "source": "/document/subdata/*/textItems"
        }
      ],
      "outputs": [
        {
          "name": "embedding",
          "targetName": "vector"
        }
      ],
      "authIdentity": null
    }
  ],
  "cognitiveServices": null,
  "knowledgeStore": null,
  "indexProjections": {
    "selectors": [
      {
        "parentKeyFieldName": "parent_id",
        "sourceContext": "/document/subdata/*",
        "mappings": [
          {
            "name": "chunk",
            "source": "/document/subdata/*/textItems",
            "sourceContext": null,
            "inputs": []
          },
          {
            "name": "vector",
            "source": "/document/subdata/*/vector",
            "sourceContext": null,
            "inputs": []
          },
          {
            "name": "title",
            "source": "/document/metadata_storage_name",
            "sourceContext": null,
            "inputs": []
          },
          {
            "name": "page_number",
            "source": "/document/subdata/*/numberItems",
            "sourceContext": null,
            "inputs": []
          }
        ]
      }
    ],
    "parameters": {
      "projectionMode": "skipIndexingParentDocuments"
    }
  },
  "encryptionKey": null
}

I get the following error in AzureOpenAIEmbeddingSkill:

Web Api response status: 'Unauthorized', Web Api response details: '{"error":{"code":"401","message":"Access denied due to invalid subscription key or wrong API endpoint. Make sure to provide a valid key for an active subscription and use a correct regional API endpoint for your resource."}}'

Solution

  • You need to give resourceUri parameter to you azure open ai embedding skillset.

    Refer this skill parameters to know more about it.

    To get resource uri go to your azure open ai service, in Keys and Endpoint section you will find.

    enter image description here

    Also, below is the correct skillset definition.

        {
          "@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
          "name": "#2",
          "description": "",
          "context": "/document/subdata/*",
          "resourceUri": "https://jgsopenai.openai.azure.com",
          "apiKey": "<redacted>",
          "deploymentId": "ada002",
          "dimensions": 1536,
          "modelName": "text-embedding-ada-002",
          "inputs": [
            {
              "name": "text",
              "source": "/document/subdata/*/textItems"
            }
          ],
          "outputs": [
            {
              "name": "embedding",
              "targetName": "vector"
            }
          ],
          "authIdentity": null
        }
    

    You also need to add dimensions, check here for more details.