I have CSV files on Azure Blob Storage which contain a field Type
(among other fields). I use indexer parsing mode delimitedText
to split them into documents - one per row. The AI enrichment tree for such a document should contain the node /document/Type
.
According to https://learn.microsoft.com/en-us/azure/search/search-howto-indexing-azure-blob-storage when indexing Azure Blobs one also gets metadata nodes like /document/metadata_storage_name
and /document/metadata_storage_path
.
For example, suppose I have a file xyz.csv
and there is a record inside with Type = abc
. Then this record presumably (I think) will have /document/Type = "abc"
and /document/metadata_storage_name = "xyz.csv"
in the AI enrichment tree.
Assuming I get it so far, can I somehow generate a new field for the index that would combine the file name without the extension and the type using a certain literal string to separate the two?
In my example I would like this field to have the value of abc @ xyz
.
Can it be done without a custom skill?
To achieve without custom skill set you need to use MergeSkill
, SplitSkill
and a mapping function extractTokenAtPosition
to get only the file name without extension.
First, create 2 new fields filename
and combinedText
, both are of string type.
Next, in indexer definition you add the mapping function in field mappings like below. This extracts only filename.
"fieldMappings": [
{
"sourceFieldName": "AzureSearch_DocumentKey",
"targetFieldName": "AzureSearch_DocumentKey",
"mappingFunction": {
"name": "base64Encode",
"parameters": null
}
},
{
"sourceFieldName": "metadata_storage_name",
"targetFieldName": "filename",
"mappingFunction": {
"name": "extractTokenAtPosition",
"parameters": {
"delimiter": ".",
"position": 0
}
}
}
],
Refer about this here.
Use text split skill to convert this string to array of string. Since it is the type accepted by merge skill.
Create a skill set like below.
{
"name": "skillset1725958890028",
"description": "",
"skills": [
{
"@odata.type": "#Microsoft.Skills.Text.SplitSkill",
"textSplitMode": "pages",
"defaultLanguageCode": "en",
"inputs": [
{
"name": "text",
"source": "/document/filename"
}
],
"outputs": [
{
"name": "textItems",
"targetName": "arrayresult"
}
]
},
{
"@odata.type": "#Microsoft.Skills.Text.MergeSkill",
"name": "CombineFieldsSkill",
"description": "Combines multiple fields into a single string",
"context": "/document",
"insertPreTag": "@",
"insertPostTag": "",
"inputs": [
{
"name": "text",
"source": "/document/Type"
},
{
"name": "itemsToInsert",
"source": "/document/arrayresult"
}
],
"outputs": [
{
"name": "mergedText",
"targetName": "combined_field"
}
]
}
],
"cognitiveServices": {
"@odata.type": "#Microsoft.Azure.Search.DefaultCognitiveServices"
}
}
Add output field mappings in indexer to populate the results. Below is the full indexer definition.
{
"@odata.context": "https://xyx.search.windows.net/$metadata#indexers/$entity",
"@odata.etag": "\"0x8DCD17A4933D27D\"",
"name": "azureblob-indexer",
"description": "",
"dataSourceName": "ds",
"skillsetName": "skillset1725958890028",
"targetIndexName": "azureblob-index",
"disabled": null,
"schedule": null,
"parameters": {
"batchSize": null,
"maxFailedItems": 0,
"maxFailedItemsPerBatch": 0,
"base64EncodeKeys": null,
"configuration": {
"dataToExtract": "contentAndMetadata",
"parsingMode": "delimitedText",
"firstLineContainsHeaders": true,
"delimitedTextDelimiter": ",",
"delimitedTextHeaders": ""
}
},
"fieldMappings": [
{
"sourceFieldName": "AzureSearch_DocumentKey",
"targetFieldName": "AzureSearch_DocumentKey",
"mappingFunction": {
"name": "base64Encode",
"parameters": null
}
},
{
"sourceFieldName": "metadata_storage_name",
"targetFieldName": "filename",
"mappingFunction": {
"name": "extractTokenAtPosition",
"parameters": {
"delimiter": ".",
"position@odata.type": "#Int64",
"position": 0
}
}
}
],
"outputFieldMappings": [
{
"sourceFieldName": "/document/combined_field",
"targetFieldName": "combinedText"
}
],
"cache": null,
"encryptionKey": null
}
Output: