Here is a sample json
document indexed in Opensearch
:
{
"_index": "filebeat-7.12.1-2024.08.28",
"_type": "_doc",
"_id": "RF64mZEBFMf-66jeR0WD",
"_version": 1,
"_score": null,
"_source": {
"cloud": {},
"message": "%xwEx2024-08-28 18:01:15.557 DEBUG 24220 --- [7781-exec-28719] c.b.k.s.s.i.ScorerServiceImpl : Query from ES took:1.5s",
"event": {
"created": "2024-08-28T18:01:15.557Z"
}
},
"fields": {
"event.created": [
"2024-08-28T18:01:15.557Z"
]
},
"highlight": {
"logger.type": [
"@opensearch-dashboards-highlighted-field@WLS@/opensearch-dashboards-highlighted-field@"
],
"message": [
"%xwEx2024-08-28 18:01:15.557 DEBUG 24220 --- [7781-exec-28719] c.b.k.s.s.i.ScorerServiceImpl : Query from ES took:@opensearch-dashboards-highlighted-field@1.5s@/opensearch-dashboards-highlighted-field@"
]
},
"sort": [
1,
1724868075557
]
}
I wish to regexp
filter on field message
here its mapping
"message" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
Using this DSL filter to regexp
match the time part of the message field works:
{
"query": {
"regexp": {
"message": {
"value": "[0-9]\\.?[0-9]*s"
}
}
}
}
Using this DSL filter to regexp
match the whole text part of the message field fails:
{
"query": {
"regexp": {
"message": {
"value": "Q.*[0-9]\\.?[0-9]*s"
}
}
}
}
This DSL filter also fails:
{
"query": {
"regexp": {
"message.keyword": {
"value": "Q.*[0-9]\\.?[0-9]*s"
}
}
}
}
The matched message field text value in the above sample:
"%xwEx2024-08-28 18:01:15.557 DEBUG 24220 --- [7781-exec-28719] c.b.k.s.s.i.ScorerServiceImpl : Query from ES took:1.5s"
The difference in the regexp patterns:
"value": "Q.*[0-9]\\.?[0-9]*s"
"value": "[0-9]\\.?[0-9]*s"
Please advise a DSL filter with regular expression pattern like "Query from ES took:[0-9]\\.?[0-9]*s"
to match text like Query from ES took:12.553s
The time number can range from 0 to 999.999
You are using this mapping for the message field:
{
"message": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
If you are using for example the standard tokenizer and you are using this query, the message field will be tokenized and the regex will search for a match in the tokens, where one of the tokens is 1.5s
so there is a match:
{
"query": {
"regexp": {
"message": {
"value": "[0-9]\\.?[0-9]*s"
}
}
}
}
If you are using this query:
{
"query": {
"regexp": {
"message.keyword": {
"value": "Q.*[0-9]\\.?[0-9]*s"
}
}
}
}
You are searching in the keyword field which is not analyzed and should have an exact match. If you are using a regex, you should match the whole field by updating the regex to:
{
"query": {
"regexp": {
"message.keyword": {
"value": ".*Q.*[0-9]\\.?[0-9]*s"
}
}
}
}
If there is more text after the final s
char you can match the rest of the line with:
"value": ".*Q.*[0-9]\\.?[0-9]*s.*"
Note that you can test what the tokens look like by using the _analyze api by making a POST request using this payload:
{
"analyzer": "standard",
"text": "%xwEx2024-08-28 18:01:15.557 DEBUG 24220 --- [7781-exec-28719] c.b.k.s.s.i.ScorerServiceImpl : Query from ES took:1.5s"
}
Then you will see that there is a token "token": "1.5s"
The docs state:
The standard tokenizer provides grammar based tokenization (based on the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29) and works well for most languages.
There is a section about "Word Boundary Rules" https://unicode.org/reports/tr29/#Word_Boundary_Rules where it mentions:
Do not break within sequences, such as “3.2” or “3,456.789”.
So your initial regex for the message field [0-9]\\.?[0-9]*s
matches 1.5s