elasticsearchregex-lookaroundsamazon-opensearch

Avoid dot on ElasticSearch/OpenSearch


I am new with ElasticSeach, and currently working with OpenSearch from AWS OpenSearch service. In Dev Tools, I have the following query:

GET _search
{
  "from": 0,
  "size": 10,
  "query": {
    "bool": {
      "must_not": [
        {
          "regexp": {
            "handler_id": "[^_*\\-.;?#$%^@!`,/?+()~<>:'\\[\\]{}]*`"
          }
        }
      ],
      "must": [
        {
          "regexp": {
            "handler_id": "~([^.])*[A-Za-z]{2}[a-zA-Z0-9]{2}[0-9]{8}"
          }
        }
      ]
    }
  },
  "sort": [
    {
      "handler_id.keyword": {
        "order": "asc"
      }
    }
  ]
}

The above query supposed to get all handler_id without special characters on it, and then also meet the must format. It works, but it always return this handler_id = .MP4137879580. I also tried regex ^[A-Za-z]{2}[a-zA-Z0-9]{2}[0-9]{8}(?![^.]+$), then "~([^.])*[A-Za-z]{2}[a-zA-Z0-9]{2}[0-9]{8}" to escape dot, but the id still showed up.

Please give me some pointer on how to troubleshoot this problem. Thank you!


Solution

  • TLDR:

    GET _search
    {
      "from": 0,
      "size": 10,
      "query": {
        "bool": {
         "must": [
            {
              "regexp": {
                "handler_id.keyword": "~([^.])*[A-Za-z]{2}[a-zA-Z0-9]{2}[0-9]{8}"
              }
            },
            {
              "regexp": {
                "handler_id.keyword": "[^_*\\-.;?#$%^@!`,/?+()~<>:'\\[\\]{}].*"
              }
            }
          ]
        }
      },
      "sort": [
        {
          "handler_id.keyword": {
            "order": "asc"
          }
        }
      ]
    }
    

    This was tested that on elasticsearch. Sorry, I am not using opensearch and have not plans to start, but it is trivial enough so it should work.

    There are several problems in your query.

    The first one is that by default, elasticsearch indexes each record twice - one time in an analyzed form and another time in non-analyzed form. The analyzed form is stored in handler_id and for your test string it is converted into mp4137879580 (lowercase split by spaces with punctuation removed). In the handler_id.keyword your original string is indexed as is. So, when you use handler_id in regexp you are search these converted strings instead of original strings. So, the first fix is to use handler_id.keyword in your query.

    The second issue is that regexp contains an extra back tick at the end, which doesn't match. Just remove it.

    The third issue is that you are using double negative here. First you find all handler_ids that don't contain punctuation, and then you wrapping it into must_not essentially saying "I don't want any of these". So you need to with either move your regex into must or change your regex to match handlers with punctuation and keep it in must_not. I picked the first solution in my example.