I have written a lexical analyzer program in java. I am splitting the input string with a whitespace or a semicolon, but what I really wish to do is consider a semicolon or a comma as a separate token even if there is no space before it.
Right now I have
arr = (str).split("[; ]");
But when I am giving the input string as
return a = 25;
The string is split like : [return, a, =, 25]
It's not recognizing ; or using it as a delimiter. I tried to find a regex for it but I can't find the one that works. I am trying to avoid StringTokenizer but if it's something that should be used and can't be avoided, please let me know. I am fairly new to regex so I am not able to figure it out.
I tried the regex ((?<=;)|(?=;))
but it only splits with ; no whitespace is considered.
Rather than using complex, costly lookaheads and lookbehinds, I would just use a word boundary:
String[] tokens = str.split(" *\\b *");
This will return spaces as tokens, but you can remove them easily enough:
tokens = Arrays.stream(tokens)
.filter(t -> !t.isBlank()).toArray(String[]::new);
You can also do it in one step using splitAsStream:
String[] tokens = Pattern.compile(" *\\b *").splitAsStream(str)
.filter(t -> !t.isBlank()).toArray(String[]::new);