azureazure-cognitive-search

Is it possible to index CSV file name without extension + literal string + CSV field value in Azure AI search?


I have CSV files on Azure Blob Storage which contain a field Type (among other fields). I use indexer parsing mode delimitedText to split them into documents - one per row. The AI enrichment tree for such a document should contain the node /document/Type.

According to https://learn.microsoft.com/en-us/azure/search/search-howto-indexing-azure-blob-storage when indexing Azure Blobs one also gets metadata nodes like /document/metadata_storage_name and /document/metadata_storage_path.

For example, suppose I have a file xyz.csv and there is a record inside with Type = abc. Then this record presumably (I think) will have /document/Type = "abc" and /document/metadata_storage_name = "xyz.csv" in the AI enrichment tree.

Assuming I get it so far, can I somehow generate a new field for the index that would combine the file name without the extension and the type using a certain literal string to separate the two?

In my example I would like this field to have the value of abc @ xyz.

Can it be done without a custom skill?


Solution

  • To achieve without custom skill set you need to use MergeSkill, SplitSkill and a mapping function extractTokenAtPosition to get only the file name without extension.

    First, create 2 new fields filename and combinedText, both are of string type.

    Next, in indexer definition you add the mapping function in field mappings like below. This extracts only filename.

    "fieldMappings": [
        {
          "sourceFieldName": "AzureSearch_DocumentKey",
          "targetFieldName": "AzureSearch_DocumentKey",
          "mappingFunction": {
            "name": "base64Encode",
            "parameters": null
          }
        },
        {
          "sourceFieldName": "metadata_storage_name",
          "targetFieldName": "filename",
          "mappingFunction": {
            "name": "extractTokenAtPosition",
            "parameters": {
              "delimiter": ".",
              "position": 0
            }
          }
        }
      ],
    

    Refer about this here.

    Use text split skill to convert this string to array of string. Since it is the type accepted by merge skill.

    Create a skill set like below.

    {
      "name": "skillset1725958890028",
      "description": "",
      "skills": [
        {
          "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
          "textSplitMode": "pages",
          "defaultLanguageCode": "en",
          "inputs": [
            {
              "name": "text",
              "source": "/document/filename"
            }
          ],
          "outputs": [
            {
              "name": "textItems",
              "targetName": "arrayresult"
            }
          ]
        },
        {
          "@odata.type": "#Microsoft.Skills.Text.MergeSkill",
          "name": "CombineFieldsSkill",
          "description": "Combines multiple fields into a single string",
          "context": "/document",
          "insertPreTag": "@",
          "insertPostTag": "",
          "inputs": [
            {
              "name": "text",
              "source": "/document/Type"
            },
            {
              "name": "itemsToInsert",
              "source": "/document/arrayresult"
            }
          ],
          "outputs": [
            {
              "name": "mergedText",
              "targetName": "combined_field"
            }
          ]
        }
      ],
      "cognitiveServices": {
        "@odata.type": "#Microsoft.Azure.Search.DefaultCognitiveServices"
      }
    }
    

    Add output field mappings in indexer to populate the results. Below is the full indexer definition.

    {
      "@odata.context": "https://xyx.search.windows.net/$metadata#indexers/$entity",
      "@odata.etag": "\"0x8DCD17A4933D27D\"",
      "name": "azureblob-indexer",
      "description": "",
      "dataSourceName": "ds",
      "skillsetName": "skillset1725958890028",
      "targetIndexName": "azureblob-index",
      "disabled": null,
      "schedule": null,
      "parameters": {
        "batchSize": null,
        "maxFailedItems": 0,
        "maxFailedItemsPerBatch": 0,
        "base64EncodeKeys": null,
        "configuration": {
          "dataToExtract": "contentAndMetadata",
          "parsingMode": "delimitedText",
          "firstLineContainsHeaders": true,
          "delimitedTextDelimiter": ",",
          "delimitedTextHeaders": ""
        }
      },
      "fieldMappings": [
        {
          "sourceFieldName": "AzureSearch_DocumentKey",
          "targetFieldName": "AzureSearch_DocumentKey",
          "mappingFunction": {
            "name": "base64Encode",
            "parameters": null
          }
        },
        {
          "sourceFieldName": "metadata_storage_name",
          "targetFieldName": "filename",
          "mappingFunction": {
            "name": "extractTokenAtPosition",
            "parameters": {
              "delimiter": ".",
              "position@odata.type": "#Int64",
              "position": 0
            }
          }
        }
      ],
      "outputFieldMappings": [
        {
          "sourceFieldName": "/document/combined_field",
          "targetFieldName": "combinedText"
        }
      ],
      "cache": null,
      "encryptionKey": null
    }
    

    Output:

    enter image description here