The task is to find the part of expression, which matches to selected regex and put this part in to value of new field.
The source file is in csv format:
id;description
1234;The latitude is 12324.24
2345;12324.24 this value
12234;12324.24 is latitude for 12234
234;No information about latitude
My CSVReader in UpdateRecords:
The "input_schema" is:
{
"type": "record",
"name": "geo_data",
"fields": [
{
"name": "id",
"type": [
"int",
"null"
]
},
{
"name": "description",
"type": [
"string",
"null"
]
}
]
}
CSVRecordSetWriter is:
I need to extract "12324.24"-value from "description" and, using UpdateRecord processor, create new field:
"latitude = 12324.24"
I was looking towards matchesRegex. For example:
/description[matchesRegex(., '(\d+)(\.)(\d+)')]
But it only checks, if the entire expression matches to regex. For example, when "description" includes only "12324.24", matchesRegex works fine, but the field include not only latitude-value.
One important notice: field value "description" doesn't necessarily contain latitude. So it may include for example: "No information about latitude".
Therefore, substring, substringAfter, substringBefore and similar methods, based on the position (or values) of tokens in text, do not fit for my task.
I tried to use suggested by Endzeit approach:
replaceRegex(/description, '.*?(\\d+.\\d+).*', '$1')
But it returns the entire value for description fields, which contains matched expression. For example, it returns "12324.24 is latitude" instead of expected: "12324.24"
Or it's better to look forward ExtractText? But problem is that this processor, based on the result of extraction, create only one attribute "latitude" for all records. The possible workaround is to use additionaly SplitText (before extraction) and MergeRecords (after ExtractText). But may be it is somethins overcomplicated?
You are using a "Filter", as described in the "Record Path Guide" inside the documentation.
Your expression /description[matchesRegex(., '(\d+)(\.)(\d+)')]
means:
description
If the field's value does not match the query, the expression does not resolve to any field (and its value) but null
instead.
Just a side note, you can apply such filter expressions to multiple fields simultaneously, e.g. imagine description
was an array holding multiple values, your filter could retain only array items that match.
Instead of filtering values, you want to make use of a "Standalone Function" to transform your original input.
The following expression replaceRegex(/description, '.*?(\d+[.]\d+).*', '$1')
:
description
The substitution in turn can reference capture groups, either by index ($1
) or by name ${groupName}
.
A complete configuration of the UpdateRecord
used contains:
{
"/latitude": "replaceRegex(/description[matchesRegex(., '.*?(\\d+.\\d+).*')], '^.*?(\\d+.\\d+).*$', '$1')",
"Record Writer": "d79fc7b7-872c-3fd0-8a2a-cf4ff21bfa2e",
"Record Reader": "44045758-d88e-3c93-ae9f-2173c7860a26",
"Replacement Value Strategy": "record-path-value"
}
I've uploaded a Flow definition to GitHub that provides a working example. It works both for NiFi 2.0 and 1.22.0 (just needs some property name adjustments in the latter). You can find it here.