javaregexiterator

Java iterator of regex matches from large source


The basic problem is that I need to output all matches of a regex within a file, but there's a few properties that make a solution tricky:

  1. The file could be very large, so the whole file cannot be brought into memory and scanned all at once.
  2. The regex is user-provided, so there is no estimation of how large a match could be
  3. The regex can be multiline, making possible matches even longer, and scanning per-line could cause a match that crosses lines to be missed.
  4. Every match should be returned with its position in the whole file.
  5. Matches should be evaluated lazily (no need to find all matches in the file at once)

As an example for 3, this algorithm

while (not end of file):
  line = file.readNextLine()
  for (each match in regex.find(line)):
    results.append(match)
return results

breaks in the case of

// inline modifier to turn on DOTALL
regex = "(?s)start.*end"
content = "start
end"

because start then end would be presented individually, when the dot can match a newline character.

My current idea is to implement Iterator<MatchResult>, and read into a CharBuffer as a sliding window. java.util.regex.Matcher also has hitEnd(), which can signal if more input would have changed the result of the last find() operation (if the regex is "foobar" and the matcher was evaluated on "foo", hitEnd() would return true. If the matcher was evaluated on "bar" instead, it returns false). If the buffer is full and more input could find a match, the size of the buffer is doubled until a match is found. I could implement Spliterator instead, but I can't tell if that's more or less complicated.

So far I have this basic structure (exception handling omitted for brevity):

public class RegexStreamIterator implements Iterator<MatchResult> {
  private static final int MIN_BUFFER_SIZE_BYTES = 8192;
  private static final int MAX_BUFFER_EXPANSIONS = 4;
  private static final int MAX_BUFFER_SIZE_BYTES = (int) Math.pow(2, MAX_BUFFER_EXPANSIONS) * MIN_BUFFER_SIZE_BYTES;

  private final Reader reader;
  private final Pattern pattern;

  private int readerIndex = 0;
  private Matcher matcher;
  private CharBuffer buffer = CharBuffer.allocate(MIN_BUFFER_SIZE_BYTES);
  private MatchResult currentMatch;

  public RegexStreamIterator(Reader reader, Pattern pattern) {
    this.reader = reader;
    this.pattern = pattern;
  }

  public boolean hasNext() {
    if (currentMatch != null) return true;
    advance();
    return currentMatch != null;
  }

  public MatchResult next() {
    if (currentMatch == null)
      advance();
    if (currentMatch == null)
      throw new NoSuchElementException();

    MatchResult ret = currentMatch;
    currentMatch = null;
    return ret;
  }

  private void advance() {
    // pseudocode while I'm stuck on actual implementation
    if matcher is null
      fill buffer
      if buffer is null
        return
      matcher = pattern.matcher(buffer)
    
    while not end of file:
      if matcher found result and not matcher.hitEnd() // match found and won't grow or get rejected with more input
        this.currentMatch = match
        advance matcher to next match
        return

      if matcher.hitEnd()
        if no match was found in buffer (matcher start index at 0)
          double buffer size 
            if new size bigger than max size, throw exception
          fill from file 
          reset matcher
        else (part of buffer consumed)
          move remaining buffer to new buffer 
          fill from file
          reset matcher
      else
        clear buffer (resetting size)
        fill from file
        reset matcher
  }
}

I'm very stuck on problems like null handling and making a loop that can both seek to the next match and is stable between multiple calls to advance(). This absolutely seems like a problem that has been solved before as well, if common libraries like Apache or Guava have something that cover this that would be fantastic.


Solution

  • The findAll method of Scanner can do this. It will look for regular expression matches in any character stream, without loading all of the content into memory.