javaregex

How to determine where a regex failed to match using Java APIs


I have tests where I validate the output with a regex. When it fails it reports that output X did not match regex Y.

I would like to add some indication of where in the string the match failed. E.g. what is the farthest the matcher got in the string before backtracking. Matcher.hitEnd() is one case of what I'm looking for, but I want something more general.

Is this possible to do?


Solution

  • If a match fails, then Match.hitEnd() tells you whether a longer string could have matched. In addition, you can specify a region in the input sequence that will be searched to find a match. So if you have a string that cannot be matched, you can test its prefixes to see where the match fails:

    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    
    public class LastMatch {
        private static int indexOfLastMatch(Pattern pattern, String input) {
            Matcher matcher = pattern.matcher(input);
            for (int i = input.length(); i > 0; --i) {
                matcher.region(0, i);
                if (matcher.matches() || matcher.hitEnd()) {
                    return i;
                }
            }
    
            return 0;
        }
    
        public static void main(String[] args) {
            Pattern pattern = Pattern.compile("[A-Z]+[0-9]+[a-z]+");
            String[] samples = {
                    // partial matches, need more text
                    "",
                    "ABC",
                    "ABC123",
    
                    // full matches
                    "A1x",
                    "ABC123xyz",
    
                    // mismatches
                    "-BC123xyz",
                    "A-C123xyz",
                    "AB-123xyz",
                    "ABC-23xyz",
                    "ABC1-3xyz",
                    "ABC12-xyz",
                    "ABC123-yz",
                    "ABC123x-z",
                    "ABC123xy-"
            };
    
            for (String sample : samples) {
                System.out.printf("`%s`: last match at %d of %d\n",
                        sample, indexOfLastMatch(pattern, sample), sample.length());
            }
        }
    }
    

    The output of this class is:

    ``: last match at 0 of 0
    `ABC`: last match at 3 of 3
    `ABC123`: last match at 6 of 6
    `A1x`: last match at 3 of 3
    `ABC123xyz`: last match at 9 of 9
    `-BC123xyz`: last match at 0 of 9
    `A-C123xyz`: last match at 1 of 9
    `AB-123xyz`: last match at 2 of 9
    `ABC-23xyz`: last match at 3 of 9
    `ABC1-3xyz`: last match at 4 of 9
    `ABC12-xyz`: last match at 5 of 9
    `ABC123-yz`: last match at 6 of 9
    `ABC123x-z`: last match at 7 of 9
    `ABC123xy-`: last match at 8 of 9
    

    As suggested by Don Hatch, using binary search reduces the number of required matches from O(N) to O(log N), where N is the length of the input string -- that is the overall complexity is O(N log N) instead of O(N * N) for the linear search in indexOfLastMatch.

    private static int binarySearchLastMatch(Pattern pattern, String input) {
        Matcher matcher = pattern.matcher(input);
        int lastMatch = 0;
    
        int left = 0;
        int right = input.length();
        while (left <= right) {
            int mid = left + (right - left) / 2;
            matcher.region(0, mid);
            if (matcher.matches() || matcher.hitEnd()) {
                lastMatch = mid;
                left = mid + 1;
            } else {
                right = mid - 1;
            }
        }
    
        return lastMatch;
    }