For example:
description
in the document by uax_url_email
tokenizer/analyzer;description
does have any url, put the url into another field named urls
array;Now i can check whether field urls
is empty to know whether description
has any url.
Is this possible? Or does analyzer only contributes to the inverted index, not other fields?
You can use Ingest Pipeline Script processor with painless script. I hope this will help you.
POST _ingest/pipeline/_simulate?verbose
{
"pipeline": {
"processors": [
{
"script": {
"description": "Extract 'tags' from 'env' field",
"lang": "painless",
"source": """
def m = /(http|ftp|https):\/\/([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:\/~+#-]*[\w@?^=%&\/~+#-])/.matcher(ctx["content"]);
ArrayList urls = new ArrayList();
while(m.find())
{
urls.add(m.group());
}
ctx['urls'] = urls;
""",
"params": {
"delimiter": "-",
"position": 1
}
}
}
]
},
"docs": [
{
"_source": {
"content": "My name is Sagar patel and i visit https://apple.com and https://google.com"
}
}
]
}
Above Pipeline will generate result like below:
{
"docs": [
{
"processor_results": [
{
"processor_type": "script",
"status": "success",
"description": "Extract 'tags' from 'env' field",
"doc": {
"_index": "_index",
"_id": "_id",
"_source": {
"urls": [
"https://apple.com",
"https://google.com"
],
"content": "My name is Sagar patel and i visit https://apple.com and https://google.com"
},
"_ingest": {
"pipeline": "_simulate_pipeline",
"timestamp": "2022-07-13T12:45:00.3655307Z"
}
}
}
]
}
]
}