i try to get everything that doesn't fit in hooks with regex in OpenRefine but i'm stuck.
i have done this :
/^([a-z]+)\[[a-z]+\]([a-z]+)/
but I can't "repeat" my rule so that it applies in all these cases.
here are my test character strings :
abcd[zz]efgh[zz]ijkl[zz]
# i want: abcd efgh ijkl
abcd[zz]efgh[zz]ijkl
# i want: abcd efgh ijkl
abcd[zz]efgh
# i want: abcd efgh
abcd[zz]
# i want: abcd
[zz]abcd
# i want: abcd
Thank you in advance
You can extract strings that do not contain ]
and [
that are not immediately followed with any chars other than square brackets and then a ]
char:
(?=([^\]\[]+))\1(?![\]\[]*])
The trick is also to use an atomic first pattern so as to stop backtracking to return a part of a match. In JavaScript regex, the atomic pattern can be defined with a positive lookahead capturing a pattern, and then using a backreference to the matched text right after.
Details:
(?=([^\]\[]+))
- a positive lookahead that captures into Group 1 one or more chars other than [
and ]
\1
- the backreference to Group 1 that consumes the text captured into Group 1(?![\]\[]*])
- a negative lookahead that fails the match if, immediately to the right, there are zero or more chars other than [
and ]
and then a ]
.