javaregexstring

Remove alphanumeric word from string


I am trying to remove alphanumeric word from string..

 String[] sentenceArray= {"India123156 hel12lo 10000 cricket 21355 sport news 000Fifa"};
    for(String s: sentenceArray)
        {
            String finalResult = new String();
            String finalResult1 = new String();
            String str= s.toString();
            System.out.println("before regex : "+str);
            String regex = "(\\d?[,/%]?\\d|^[a-zA-Z0-9_]*)";
            finalResult1 = str.replaceAll(regex, " ");
            finalResult = finalResult1.trim().replaceAll(" +", " ");
            System.out.println("after regex : "+finalResult);
        }

output: hel lo cricket sport news Fifa

but my required output is: cricket sport news

Guys please help.. Thank you in advance


Solution

  • To match the words you want to exclude and the following space characters, you can use the following regex in case-insensitive mode (demo):

    \b(?=[a-z]*\d+)\w+\s*\b
    

    In Java, to replace this, you can do:

    String replaced = your_original_string.replaceAll("(?i)\\b(?=[a-z]*\\d+[a-z]*)\\w+\\s*\\b", "");
    

    Token-by-Token Explanation

    \b                       # the boundary between a word char (\w) and
                             # something that is not a word char
    (?=                      # look ahead to see if there is:
      [a-z]*                 #   any character of: 'a' to 'z' (0 or more
                             #   times (matching the most amount
                             #   possible))
      \d+                    #   digits (0-9) (1 or more times (matching
                             #   the most amount possible))
    )                        # end of look-ahead
    \w+                      # word characters (a-z, A-Z, 0-9, _) (1 or
                             # more times (matching the most amount
                             # possible))
    \s*                      # whitespace (\n, \r, \t, \f, and " ") (0 or
                             # more times (matching the most amount
                             # possible))
    \b                       # the boundary between a word char (\w) and
                             # something that is not a word char