I am new with ElasticSeach, and currently working with OpenSearch from AWS OpenSearch service. In Dev Tools, I have the following query:
GET _search
{
"from": 0,
"size": 10,
"query": {
"bool": {
"must_not": [
{
"regexp": {
"handler_id": "[^_*\\-.;?#$%^@!`,/?+()~<>:'\\[\\]{}]*`"
}
}
],
"must": [
{
"regexp": {
"handler_id": "~([^.])*[A-Za-z]{2}[a-zA-Z0-9]{2}[0-9]{8}"
}
}
]
}
},
"sort": [
{
"handler_id.keyword": {
"order": "asc"
}
}
]
}
The above query supposed to get all handler_id without special characters on it, and then also meet the must
format. It works, but it always return this handler_id = .MP4137879580
. I also tried regex ^[A-Za-z]{2}[a-zA-Z0-9]{2}[0-9]{8}(?![^.]+$)
, then "~([^
.])*[A-Za-z]{2}[a-zA-Z0-9]{2}[0-9]{8}"
to escape dot, but the id still showed up.
Please give me some pointer on how to troubleshoot this problem. Thank you!
TLDR:
GET _search
{
"from": 0,
"size": 10,
"query": {
"bool": {
"must": [
{
"regexp": {
"handler_id.keyword": "~([^.])*[A-Za-z]{2}[a-zA-Z0-9]{2}[0-9]{8}"
}
},
{
"regexp": {
"handler_id.keyword": "[^_*\\-.;?#$%^@!`,/?+()~<>:'\\[\\]{}].*"
}
}
]
}
},
"sort": [
{
"handler_id.keyword": {
"order": "asc"
}
}
]
}
This was tested that on elasticsearch. Sorry, I am not using opensearch and have not plans to start, but it is trivial enough so it should work.
There are several problems in your query.
The first one is that by default, elasticsearch indexes each record twice - one time in an analyzed form and another time in non-analyzed form. The analyzed form is stored in handler_id
and for your test string it is converted into mp4137879580
(lowercase split by spaces with punctuation removed). In the handler_id.keyword
your original string is indexed as is. So, when you use handler_id
in regexp you are search these converted strings instead of original strings. So, the first fix is to use handler_id.keyword
in your query.
The second issue is that regexp
contains an extra back tick at the end, which doesn't match. Just remove it.
The third issue is that you are using double negative here. First you find all handler_ids that don't contain punctuation, and then you wrapping it into must_not
essentially saying "I don't want any of these". So you need to with either move your regex into must
or change your regex to match handlers with punctuation and keep it in must_not
. I picked the first solution in my example.