javastringpattern-matchingstringtokenizer

How to find a whole word in a String in Java?


I have a String that I have to parse for different keywords. For example, I have the String:

"I will come and meet you at the 123woods"

And my keywords are

'123woods'
'woods'

I should report whenever I have a match and where. Multiple occurrences should also be accounted for.

However, for this one, I should get a match only on '123woods', not on 'woods'. This eliminates using String.contains() method. Also, I should be able to have a list/set of keywords and check at the same time for their occurrence. In this example, if I have '123woods' and 'come', I should get two occurrences. Method execution should be somewhat fast on large texts.

My idea is to use StringTokenizer but I am unsure if it will perform well. Any suggestions?


Solution

  • The example below is based on your comments. It uses a List of keywords, which will be searched in a given String using word boundaries. It uses StringUtils from Apache Commons Lang to build the regular expression and print the matched groups.

    String text = "I will come and meet you at the woods 123woods and all the woods";
    
    List<String> tokens = new ArrayList<String>();
    tokens.add("123woods");
    tokens.add("woods");
    
    String patternString = "\\b(" + StringUtils.join(tokens, "|") + ")\\b";
    Pattern pattern = Pattern.compile(patternString);
    Matcher matcher = pattern.matcher(text);
    
    while (matcher.find()) {
        System.out.println(matcher.group(1));
    }
    

    If you are looking for more performance, you could have a look at StringSearch: high-performance pattern matching algorithms in Java.