I want to enumerate all the possible values of a finite regular expression in Java for testing purposes.
For some context, I have a regular expression that I'm using to match allowable color values in words. Here's a shortened version of it as an example:
(white|black)|((light|dark) )?(red|green|blue|gray)
I wanted to create a unit test that would enumerate all these values and pass each of them to my utility class which produces a Color
object from these, that way if I change the regular expression, my unit tests will fail if an error occurs (i.e. the new color value is unsupported).
I know enumeration is possible, of course (see this question), but is there an existing library for Java which will enumerate all the possible matches for a regex?
Edit: I've implemented a library that does this. See my answer below for links.
You are right, didn't find such a tool online as well but you can try Xeger from google
it can create a random matching string from a regexp, and with some code tweaking might do what you want. generation a random match:
String regex = "[ab]{4,6}c";
Xeger generator = new Xeger(regex);
String result = generator.generate();
assert result.matches(regex);
Xeger code is very simple, it consists of 2 files which contain 5 methods between them..
it uses dk.brics.automaton to conver the regex to an automaton, then goes over the automaton transitions making random choices in every node.
the main function is generate:
private void generate(StringBuilder builder, State state) {
List<Transition> transitions = state.getSortedTransitions(true);
if (transitions.size() == 0) {
assert state.isAccept();
return;
}
int nroptions = state.isAccept() ? transitions.size() : transitions.size() - 1;
int option = XegerUtils.getRandomInt(0, nroptions, random);
if (state.isAccept() && option == 0) { // 0 is considered stop
return;
}
// Moving on to next transition
Transition transition = transitions.get(option - (state.isAccept() ? 1 : 0));
appendChoice(builder, transition);
generate(builder, transition.getDest());
}
you can see that in order to change it so you get all possible matches, you need to iterate over all possible combinations in every possible node (like incrementing a multi digit counter) you will need a hash to prevent loops, but that shouldn't take more than 5 senconds to code..
i would also suggest first checking that the regex is actually finate, by checking that it doesn't have *,+ and other symbols that make this action impossible (just to make this a complete tool for reuse)...