I was wondering if somebody could help me on that topic, I'm currently trying to do a kind of fuzzy matching.
Basically I wan't to derive relationships from unstructured text and identified common patterns for these relationships. Nevertheless the input strings are a bit arbitrary - as usual to human produced input.
E.g. this two input strings:
ENTITY is typically bigger than ENTITY
ENTITY is ... a few other words... bigger than ENTITY
I've successfully to matched those two strings with the following regex:
(ENTITY) is (.+?(?=bigger))bigger than (ENTITY)
But since .+? matches everything up to it reaches bigger there can be an arbitrary amount of words in between "is" and "bigger". This leads to false matches in certain cases therefore I want to limit the number of "words" in between "is" and "bigger".
I've defined a word as a at least one non whitespace followed by at least one whitespace character. I know that this is not actually a word but for my purpose it should be ok. If i want to match e.g. up to 5 words this would be
(\S+\s+){0,5}
Combining this with the previous regex leads me to
(ENTITY) is ((\S+\s+){0,5}?(?=bigger))bigger than (ENTITY)
But this does not work out. Can somebody give me advice on this? Can I actually match this with regex?
This is a Java Project. For readability I've removed the escaping backslashes in the regex patterns.