javaregexgreedy

Regex on Java: avoiding unnecessary "greedy" strategy by Matcher class


I'm trying to develop a regex to look for a sequence of tags inside a string. For example, I can have the tag (NP .*) at least one time (can be multiple times), followed by a punctuation symbol (in this case a ./.). If there's another tag between de (NP) and the ./. (like the VP in the example below), the Matcher must not find anything. The problem is that even if I use the question mark after the .* it will keep looking for a ) that makes the expression match something in the string. Here is my method:

public void myMethod() {
    String input = "(NP first tag) (VP second tag) ./.";
    String regex = "(\\(NP .*?\\)( )?)+" + "\\./\\.";

    Pattern pattern = Pattern.compile("(" + regex + ")");
    Matcher matcher = pattern.matcher(input);

    if (matcher.find()) {
        System.out.println("<S "+matcher.group(0)+">");
    } else {
        System.out.println("sem grupos.");
    }
}

The method will still match the regex, but it shouldn't. I need it to tell me that no group was found, since the "VP" tag shouldn't be there. I believe the problem relies on the greedy strategy adopted by the Regex in Java. It tries to find some combination of characters that attend to the pattern described by the regex. I have no idea on how to rewrite this expression.

Any Help?

EDITED:

1) I noticed that my question was a bit confusing, so I changed a bit the example to make it clearer.

2) Thanks Aan Moore. I Agree that I was using more groups than necessary, but this happened because of the operators like +. I tried to cut away the unnecessary groups. Also your simple idea of replacing the .*? by a [^)]*? was great! The only thing that I adjusted was that I escaped the ) symbol by using [^\\)]*?. Below I show the final REGEX used.

String regex = "(\\(NP [^\\)]*?\\) ?)+\\./\\.";

Thanks a lot! :)


Solution

  • ((\(NP .*?\)( )?)+\./\.) is the compiled pattern.

    Simplify:

    \(NP .*?\) ?+\./\. removing unused capturing groups.

    Now, lets look at the example strings you have:

    In (NP first tag) (VP second tag) ./., the .*? matches first tag) (VP second tag.
    In (NP first tag) (VP second tag) (MISC tag that must not be catch) ./., .*? matches first tag) (VP second tag) (MISC tag that must not be catch.

    Why? I mean, it's non-greedy right? Right, but...

    .*?\) starts off matching first tag), what you want. However, the rest of the regex fails the match and the regex engine throws that out as a possible answer and keeps looking.

    If you don't have tags in tags like (NP (tag)), you can change the pattern: \(NP [^)]*?\)

    To match the string you described in your question: \(NP [^)]*?\) ?\(VP [^)]*?\) \./\.

    With Java escaping it becomes \\(NP [^)]*?\\) ?\\(VP [^)]*?\\) \./\..

    For further reading, there is a great Stack Overflow question covering more of the theory and practice around this.