I'm working on indexing static HTML content in blob storage. The documentation states that preprocessing analyzers will strip surrounding HTML tags when indexing content from that data source. However, our content
value is always the entire raw HTML document. I'm also unable to pull out the value of our "meta description" tags. According to the documentation on Indexing Blob Storage, HTML content should automatically produce a metadata_description
property, but the value is always null.
I've tried many different indexer configurations, but thus far have not been able to tell if I have something misconfigured or if Azure Search doesn't recognize the content type properly.
All of the files in blob storage have a .html
file extension, and the Content Type column shows text/html
.
This is the indexer configuration (some bits <redacted>):
{
"@odata.context": "https://<instance>.search.windows.net/$metadata#indexers/$entity",
"@odata.etag": "\"<tag>\"",
"name": "<name>",
"description": null,
"dataSourceName": "<datasource name>",
"skillsetName": null,
"targetIndexName": "<target index>",
"disabled": null,
"schedule": {
"interval": "PT2H",
"startTime": "0001-01-01T00:00:00Z"
},
"parameters": {
"batchSize": null,
"maxFailedItems": -1,
"maxFailedItemsPerBatch": null,
"base64EncodeKeys": null,
"configuration": {
"parsingMode": "text",
"dataToExtract": "contentAndMetadata",
"excludedFileNameExtensions": ".png .jpg .mpg .pdf",
"indexedFileNameExtensions": ".html"
}
},
"fieldMappings": [
{
"sourceFieldName": "metadata_storage_path",
"targetFieldName": "id",
"mappingFunction": {
"name": "base64Encode",
"parameters": null
}
},
{
"sourceFieldName": "metadata_description",
"targetFieldName": "description",
"mappingFunction": null
},
{
"sourceFieldName": "metadata_storage_path",
"targetFieldName": "url",
"mappingFunction": {
"name": "extractTokenAtPosition",
"parameters": {
"delimiter": "<delimiter>",
"position": 1
}
}
}
],
"outputFieldMappings": [],
"cache": null
}
This is likely due to the configuration in your indexer "parsingMode": "text"
This parsing mode is for extracting literal text values from the documents. In this case, that includes all of the html tags.
Change that configuration to "parsingMode": "default" to strip html tags from your documents.