[SOLVED] Indexing static HTML blob storage content with Azure Cognitive Search is not working as expected

Indexing static HTML blob storage content with Azure Cognitive Search is not working as expected

I'm working on indexing static HTML content in blob storage. The documentation states that preprocessing analyzers will strip surrounding HTML tags when indexing content from that data source. However, our content value is always the entire raw HTML document. I'm also unable to pull out the value of our "meta description" tags. According to the documentation on Indexing Blob Storage, HTML content should automatically produce a metadata_description property, but the value is always null.

I've tried many different indexer configurations, but thus far have not been able to tell if I have something misconfigured or if Azure Search doesn't recognize the content type properly.

All of the files in blob storage have a .html file extension, and the Content Type column shows text/html.

This is the indexer configuration (some bits <redacted>):

{
  "@odata.context": "https://<instance>.search.windows.net/$metadata#indexers/$entity",
  "@odata.etag": "\"<tag>\"",
  "name": "<name>",
  "description": null,
  "dataSourceName": "<datasource name>",
  "skillsetName": null,
  "targetIndexName": "<target index>",
  "disabled": null,
  "schedule": {
    "interval": "PT2H",
    "startTime": "0001-01-01T00:00:00Z"
  },
  "parameters": {
    "batchSize": null,
    "maxFailedItems": -1,
    "maxFailedItemsPerBatch": null,
    "base64EncodeKeys": null,
    "configuration": {
      "parsingMode": "text",
      "dataToExtract": "contentAndMetadata",
      "excludedFileNameExtensions": ".png .jpg .mpg .pdf",
      "indexedFileNameExtensions": ".html"
    }
  },
  "fieldMappings": [
    {
      "sourceFieldName": "metadata_storage_path",
      "targetFieldName": "id",
      "mappingFunction": {
        "name": "base64Encode",
        "parameters": null
      }
    },
    {
      "sourceFieldName": "metadata_description",
      "targetFieldName": "description",
      "mappingFunction": null
    },
    {
      "sourceFieldName": "metadata_storage_path",
      "targetFieldName": "url",
      "mappingFunction": {
        "name": "extractTokenAtPosition",
        "parameters": {
          "delimiter": "<delimiter>",
          "position": 1
        }
      }
    }
  ],
  "outputFieldMappings": [],
  "cache": null
}

Solution

This is likely due to the configuration in your indexer "parsingMode": "text"

This parsing mode is for extracting literal text values from the documents. In this case, that includes all of the html tags.

Change that configuration to "parsingMode": "default" to strip html tags from your documents.