javaregex

Regex - how to match a certain number of arbitrary words followed by a specific Word


I was wondering if somebody could help me on that topic, I'm currently trying to do a kind of fuzzy matching.

Basically I wan't to derive relationships from unstructured text and identified common patterns for these relationships. Nevertheless the input strings are a bit arbitrary - as usual to human produced input.

E.g. this two input strings:

ENTITY is typically bigger than ENTITY

ENTITY is ... a few other words... bigger than ENTITY

I've successfully to matched those two strings with the following regex:

(ENTITY) is (.+?(?=bigger))bigger than (ENTITY)

But since .+? matches everything up to it reaches bigger there can be an arbitrary amount of words in between "is" and "bigger". This leads to false matches in certain cases therefore I want to limit the number of "words" in between "is" and "bigger".

I've defined a word as a at least one non whitespace followed by at least one whitespace character. I know that this is not actually a word but for my purpose it should be ok. If i want to match e.g. up to 5 words this would be

(\S+\s+){0,5}

Combining this with the previous regex leads me to

(ENTITY) is ((\S+\s+){0,5}?(?=bigger))bigger than (ENTITY)

But this does not work out. Can somebody give me advice on this? Can I actually match this with regex?

This is a Java Project. For readability I've removed the escaping backslashes in the regex patterns.


Solution

  • This regex should work for you:

    ^(ENTITY) is ((?:\S+\s+){0,5})bigger than \1$
    

    RegEx Demo