I am trying to write a rule-based logic to extract information from a text. I need to assign each extracted string to each specific case. However, I am stuck working with a negative lookahead use case. I need to find a word "cash", followed by "rp" or "idr" and then digits that can contain ".", "," or any white spaces within the numbers, but cannot be followed by a standalone "juta|jt|m".
Here's my work so far:
cash\s*[\:,.-]?\s*(rp|idr)[\.,]?\s*([\d\s,.]+)(?!juta|jt|m)\b
These are the test cases:
harga cash: rp 130jt (nego alu
harga cash: rp 230juta (nego alu
harga cash: rp 330 juta (nego alu
harga cash: rp 430,000,000 juta (nego alu
harga cash: rp 530m (nego alu
harga cash: rp 630 (nego alu
harga cash: rp 730000000 (nego alu
harga cash: rp 830,000,000 (nego alu
harga cash: rp 930 000 000 (nego alu
The regex erroneously matches all these lines, while it should only match the last four and yield:
cash: rp 630
cash: rp 730000000
cash: rp 830,000,000
cash: rp 930 000 000
So, all strings with juta
, jt
and m
after the digit should not have been matched. Can anyone point me where did I did wrong?
The (?!juta|jt|m)\b
pattern fails the match if there are words starting with juta
, it
or m
immediately on the right, but the preceding pattern, [\d\s,.]+
, allows backtracking, so the lookahead restriction can be re-triggered on text the [\d\s,.]+
pattern matched, and thus you get extra matches. Also, the regex now attempts to also match whitespaces on the right, and if the words you want to exclude appear after whitespaces, these strings will also be matched. Besides, relying on a word boundary won't help here, since backtracking can find the digits before comma or dot.
So, there are two main suggestions to fix the regex:
(?!\S)
, instead of the word boundary so as to match numbers with commas/dots.The pattern will look like
cash\s*[:,.-]?\s*(rp|idr)[.,]?\s*(\d(?:[\d\s,.]*\d)?)(?!\S)(?!\s*(?:juta|jt|m)\b)
See the regex demo.
Details:
cash
- a string cash
\s*
- zero or more whitespaces[:,.-]?
- an optional occurrence of :
, ,
, .
or -
\s*
- zero or more whitespaces(rp|idr)
- Group 1: rp
or idr
strings[.,]?
- an optional occurrence of .
or ,
\s*
- zero or more whitespaces(\d(?:[\d\s,.]*\d)?)
- Group 2: a digit, then an optional occurrence of zero or more digits, whitespaces, commas or dots and then a digit(?!\S)
- no whitespace allowed immediately on the right(?!\s*(?:juta|jt|m)\b)
- also, immediately on the right, there should be no zero or more whitespaces followed with juta
, jt
or m
words (they are now followed with a word boundary). Remove the word boundary if you want to match words starting with juta
, jt
, m
.