The basic problem is that I need to output all matches of a regex within a file, but there's a few properties that make a solution tricky:
As an example for 3, this algorithm
while (not end of file):
line = file.readNextLine()
for (each match in regex.find(line)):
results.append(match)
return results
breaks in the case of
// inline modifier to turn on DOTALL
regex = "(?s)start.*end"
content = "start
end"
because start
then end
would be presented individually, when the dot can match a newline character.
My current idea is to implement Iterator<MatchResult>
, and read into a CharBuffer
as a sliding window. java.util.regex.Matcher
also has hitEnd()
, which can signal if more input would have changed the result of the last find()
operation (if the regex is "foobar" and the matcher was evaluated on "foo", hitEnd()
would return true. If the matcher was evaluated on "bar" instead, it returns false). If the buffer is full and more input could find a match, the size of the buffer is doubled until a match is found. I could implement Spliterator
instead, but I can't tell if that's more or less complicated.
So far I have this basic structure (exception handling omitted for brevity):
public class RegexStreamIterator implements Iterator<MatchResult> {
private static final int MIN_BUFFER_SIZE_BYTES = 8192;
private static final int MAX_BUFFER_EXPANSIONS = 4;
private static final int MAX_BUFFER_SIZE_BYTES = (int) Math.pow(2, MAX_BUFFER_EXPANSIONS) * MIN_BUFFER_SIZE_BYTES;
private final Reader reader;
private final Pattern pattern;
private int readerIndex = 0;
private Matcher matcher;
private CharBuffer buffer = CharBuffer.allocate(MIN_BUFFER_SIZE_BYTES);
private MatchResult currentMatch;
public RegexStreamIterator(Reader reader, Pattern pattern) {
this.reader = reader;
this.pattern = pattern;
}
public boolean hasNext() {
if (currentMatch != null) return true;
advance();
return currentMatch != null;
}
public MatchResult next() {
if (currentMatch == null)
advance();
if (currentMatch == null)
throw new NoSuchElementException();
MatchResult ret = currentMatch;
currentMatch = null;
return ret;
}
private void advance() {
// pseudocode while I'm stuck on actual implementation
if matcher is null
fill buffer
if buffer is null
return
matcher = pattern.matcher(buffer)
while not end of file:
if matcher found result and not matcher.hitEnd() // match found and won't grow or get rejected with more input
this.currentMatch = match
advance matcher to next match
return
if matcher.hitEnd()
if no match was found in buffer (matcher start index at 0)
double buffer size
if new size bigger than max size, throw exception
fill from file
reset matcher
else (part of buffer consumed)
move remaining buffer to new buffer
fill from file
reset matcher
else
clear buffer (resetting size)
fill from file
reset matcher
}
}
I'm very stuck on problems like null handling and making a loop that can both seek to the next match and is stable between multiple calls to advance()
. This absolutely seems like a problem that has been solved before as well, if common libraries like Apache or Guava have something that cover this that would be fantastic.
The findAll method of Scanner can do this. It will look for regular expression matches in any character stream, without loading all of the content into memory.