regexregex-group

Negative lookahead for stand alone words


I am trying to write a rule-based logic to extract information from a text. I need to assign each extracted string to each specific case. However, I am stuck working with a negative lookahead use case. I need to find a word "cash", followed by "rp" or "idr" and then digits that can contain ".", "," or any white spaces within the numbers, but cannot be followed by a standalone "juta|jt|m".

Here's my work so far: cash\s*[\:,.-]?\s*(rp|idr)[\.,]?\s*([\d\s,.]+)(?!juta|jt|m)\b

These are the test cases:

harga cash: rp 130jt (nego alu
harga cash: rp 230juta (nego alu
harga cash: rp 330 juta (nego alu
harga cash: rp 430,000,000 juta (nego alu
harga cash: rp 530m (nego alu
harga cash: rp 630 (nego alu
harga cash: rp 730000000 (nego alu
harga cash: rp 830,000,000 (nego alu
harga cash: rp 930 000 000 (nego alu

The regex erroneously matches all these lines, while it should only match the last four and yield:

cash: rp 630
cash: rp 730000000
cash: rp 830,000,000
cash: rp 930 000 000

So, all strings with juta, jt and m after the digit should not have been matched. Can anyone point me where did I did wrong?


Solution

  • The (?!juta|jt|m)\b pattern fails the match if there are words starting with juta, it or m immediately on the right, but the preceding pattern, [\d\s,.]+, allows backtracking, so the lookahead restriction can be re-triggered on text the [\d\s,.]+ pattern matched, and thus you get extra matches. Also, the regex now attempts to also match whitespaces on the right, and if the words you want to exclude appear after whitespaces, these strings will also be matched. Besides, relying on a word boundary won't help here, since backtracking can find the digits before comma or dot.

    So, there are two main suggestions to fix the regex:

    The pattern will look like

    cash\s*[:,.-]?\s*(rp|idr)[.,]?\s*(\d(?:[\d\s,.]*\d)?)(?!\S)(?!\s*(?:juta|jt|m)\b)
    

    See the regex demo.

    Details: