javaregexlexical-analysis

Regex in java for semicolon and comma as delimiter but also considered in the string


I have written a lexical analyzer program in java. I am splitting the input string with a whitespace or a semicolon, but what I really wish to do is consider a semicolon or a comma as a separate token even if there is no space before it.

Right now I have arr = (str).split("[; ]");

But when I am giving the input string as return a = 25;

The string is split like : [return, a, =, 25] It's not recognizing ; or using it as a delimiter. I tried to find a regex for it but I can't find the one that works. I am trying to avoid StringTokenizer but if it's something that should be used and can't be avoided, please let me know. I am fairly new to regex so I am not able to figure it out.

I tried the regex ((?<=;)|(?=;)) but it only splits with ; no whitespace is considered.


Solution

  • Rather than using complex, costly lookaheads and lookbehinds, I would just use a word boundary:

    String[] tokens = str.split(" *\\b *");
    

    This will return spaces as tokens, but you can remove them easily enough:

    tokens = Arrays.stream(tokens)
        .filter(t -> !t.isBlank()).toArray(String[]::new);
    

    You can also do it in one step using splitAsStream:

    String[] tokens = Pattern.compile(" *\\b *").splitAsStream(str)
        .filter(t -> !t.isBlank()).toArray(String[]::new);