azure indexing azure-blob-storage azure-ai

Add metadata in a result of skillset Azure AI Search

I have the following aisearch skillset that takes a document from azure blob storage and splits it into chunks that are later indexed for ai search.

{
  "@odata.context": "https://ahaisearch.search.windows.net/$metadata#skillsets/$entity",
  "@odata.etag": "...",
  "name": "ai-product-skillset",
  "description": null,
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
      "name": "ai-product-skillset",
      "description": null,
      "context": "/document/content",
      "uri": "https://test.openai.azure.com/openai/preprocessing-jobs?api-version=2023-03-31-preview",
      "httpMethod": "POST",
      "timeout": "PT1M",
      "batchSize": 10,
      "degreeOfParallelism": 10,
      "authResourceId": null,
      "inputs": [
        {
          "name": "document_id",
          "source": "/document/document_id"
        },
        {
          "name": "filename",
          "source": "/document/filename"
        },
        {
          "name": "fieldname",
          "source": "='content'"
        },
        {
          "name": "text",
          "source": "/document/content"
        },
        {
          "name": "url",
          "source": "/document/url"
        }
      ],
      "outputs": [
        {
          "name": "recordId",
          "targetName": "recordId"
        }
      ],
      "httpHeaders": {
        "ingestion-request-id": "...",
        "original-request-id": "ai-product",
        "original-internal-id": "...",
        "num-tokens": "1024",
        "api-key": "...",
        "connection-string": "DefaultEndpointsProtocol=https;AccountName=...;AccountKey=...;EndpointSuffix=core.windows.net",
        "container-name": "ai-product-chunks"
      },
      "authIdentity": null
    }
  ],
  "cognitiveServices": null,
  "knowledgeStore": null,
  "indexProjections": null,
  "encryptionKey": null
}

What I want to do is to add metadata field to the result so that result file chunks were created with metadata like originalFile : file_name.pdf. Is it possible to do in a skillset or an additional layer required?

Solution

OpenAI https://test.openai.azure.com/openai/preprocessing-jobs?api-version=2023-03-31-preview preprocessing jobs only does chunking and only returns chunks where you will be not having the metadata details.

So, what you can do is the use custom web Api which accepts document content, metadata etc. and returns chunk and metadata.

Below is the sample azure function app code you need to write.

import azure.functions as func
import logging
import requests
 
app = func.FunctionApp(http_auth_level=func.AuthLevel.ANONYMOUS)
 
@app.route(route="http_trigger")
def http_trigger(req: func.HttpRequest) -> func.HttpResponse:
    logging.info('Python HTTP trigger function processed a request.')
    
    #get the content and metadata from request.
    req_body = req.get_json()
    values = req_body.get('values')
    res=[]
    
    for i in values:
        tmp=i
        t = {
        'chunk': #Do chunking on  i['data']['content'],
        'originalFile':i['data']['metadata_filename']
        }
        tmp['data'] = t
        res.append(tmp)
    if res:
        return func.HttpResponse(json.dumps({"values": res}), mimetype="application/json")

and do mappings in the indexer.

Refer this stack solution on how to map the fields.

Here, for each recordId you will get multiple chunks, to get individual chunk with filename you create secondary index and do index projections. In that case, you return only chunk and metadata no need of recordId.

Check here for more about index projection