azureazure-cognitive-servicesazure-cognitive-searchazure-ai

using date metadata from docx, pdf files for Azure cognitive search


I'm uploading a lot of DocX and PDF files into blob storage to be used in Azure cognitive search. I'm using it to experiment with some AI capabilities I already, and it works well but I would like to try the filterable freshness. I'm not sure how the metadata for these PDF files (e.g., 'author', 'date', 'title') can be added through a skill. Any advice would be appreciated. Thanks

{
  "@odata.context": ... ,
  "@odata.etag": ... ,
  "name": "freshness",
  "description": "Skillset to chunk documents and generate embeddings",
  "skills": [
    {
      ...
    },
    {
      "@odata.type": "#Microsoft.Skills.Util.ShaperSkill",
      "name": "#3",
      "description": "Extracts metadata from the document",
      "context": "/document",
      "inputs": [
        {
          "name": "metadata_creation_date",
          "source": "/document/metadata_creation_date"
        }
      ],
      "outputs": [
        {
          "name": "output",
          "targetName": "creationDate"
        }
      ]
    }
  ],
  "cognitiveServices": null,
  "knowledgeStore": null,
  "indexProjections": {
    "selectors": [
      {
        "targetIndexName": "freshness",
        "parentKeyFieldName": "parent_id",
        "sourceContext": "/document/pages/*",
        "mappings": [
          {
            "name": "creationDate",
            "source": "/document/creationDate",
            "sourceContext": null,
            "inputs": []
          }
        ]
      }
    ],
    "parameters": {
      "projectionMode": "skipIndexingParentDocuments"
    }
  },
  "encryptionKey": null
}```

Solution

  • If you already having the index then you can create new field of type Edm.DateTimeOffset

    enter image description here

    After creating, map the fields indexer in fieldMappings

    "fieldMappings": [
        {
          "sourceFieldName": "metadata_storage_path",
          "targetFieldName": "metadata_storage_path",
          "mappingFunction": {
            "name": "base64Encode",
            "parameters": null
          }
        },
            {
                "sourceFieldName":"metadata_storage_last_modified",
                "targetFieldName":"last_modified"
            }
      ]
    

    or

    while importing data in the Customize target index you can make it filterable.

    enter image description here

    Check the Filterable as shown in the image.

    enter image description here

    Output:

    enter image description here