javaregexsplitocpjp

How exactly does String.split() method in Java work when regex is provided?


I'm preparing for OCPJP exam and I ran into the following example:

class Test {
   public static void main(String args[]) {
      String test = "I am preparing for OCPJP";
      String[] tokens = test.split("\\S");
      System.out.println(tokens.length);
   }
}

This code prints 16. I was expecting something like no_of_characters + 1. Can someone explain me, what does the split() method actually do in this case? I just don't get it...


Solution

  • It splits on every "\\S" which in regex engine represents \S non-whitespace character.

    So lets try to split "x x" on non-whitespace (\S). Since this regex can be matched by one character lets iterate over them to mark places of split (we will use pipe | for that).

    So as result we need to split our string at start and at end which initially gives us result array

    ["", " ", ""]
       ^    ^ - here we split
    

    But since trailing empty strings are removed, result would be

    [""," "]     <- result
            ,""] <- removed trailing empty string
    

    so split returns array ["", " "] which contains only two elements.

    BTW. To turn off removing last empty strings you need to use split(regex,limit) with negative value of limit like split("\\S",-1).


    Now lets get back to your example. In case of your data you are splitting on each of

    I am preparing for OCPJP
    | || ||||||||| ||| |||||
    

    which means

     ""|" "|""|" "|""|""|""|""|""|""|""|""|" "|""|""|" "|""|""|""|""|""
    

    So this represents this array

    [""," ",""," ","","","","","","","",""," ","",""," ","","","","",""]  
    

    but since trailing empty strings "" are removed (if their existence was caused by split - more info at: Confusing output from String.split)

    [""," ",""," ","","","","","","","",""," ","",""," ","","","","",""]  
                                                         ^^ ^^ ^^ ^^ ^^
    

    you are getting as result array which contains only this part:

    [""," ",""," ","","","","","","","",""," ","",""," "]  
    

    which are exactly 16 elements.