javaregexstring

Reqular expression - Searching a string using regex and extracting the match from the original string


Let say i have the following string.

Lorem ipsum XYZ1234-123456-12 lorem ipsum

I want to search the string for any occurrence of a string that has the pattern XXXDDDDDDDDDDDD (i.e. 3 characters followed by 12 digits ignoring any non alphanumeric characters)

To achieve this i do something like this

String incomingId = "Lorem ipsum XYZ1234-123456-12 lorem ipsum"

private final static Pattern NONCHARACTER = Pattern.compile("[^a-zA-Z0-9]");
String removedNonChars = NONCHARACTER.matcher(incomingId ).replaceAll("")      //returns LoremipsumXYZ123412345612loremipsum

I then i run another regex to search for the sequence i want (i.e. XXXDDDDDDDDDDDD)

private final static Pattern IDENTIFIERPATTERN = Pattern.compile("([a-zA-Z]{3,})(\d{3})(\d{6})(\d{2})");
String extractedString = IDENTIFIERPATTERN.matcher(removedNonChars)     //returns a match on XYZ123412345612

Once i get the string that has the format i am looking for (i.e. XYZ123412345612), i want to extract that string from the original non modified string (i.e. the value XYZ1234-123456-12)

Note - The hyphens are just an example, the NONCHARACTER could be any non alphanumeric character - Examples:

Lorem ipsum XYZ1234-123456-12 lorem ipsum
Lorem ipsum XYZ123412345612 lorem ipsum
Lorem ipsum XYZ1234 123456 12 lorem ipsum
Lorem ipsum XYZ1234!123456#12 lorem ipsum
Lorem ipsum XYZ1234--123456#12 lorem ipsum

Basically what i am doing is searching a string for identifiers. The identifiers usually have a defined format but sometimes people dont use the rules for the identifier hence i am searching without the non-characters in the string. After i have extracted the string without the non-characters, i want to extract the original string WITH the non-characters.

How can i extract the string from the original string using the string that was returned as a match in the initial search.

Edit

The separators are always non-alphanumeric. i.e. not a digit and not a character (i.e. only special characters such as -,#£$"(!__£($&£^" and including the 'space' character).

Thanks in advance.


Solution

  • By replacing those non-digit characters, you're making your task difficult. Rather you should make a regex that extracts that part directly from the string.

    The issue here is, you can't directly do \\d{12}, as the digits are not contiguous. So, let's modify that part. Since you can have 0 or more non-digit characters in between, you can use - \\d\\D* instead of \\d, and apply match that 11 times, and at the end, match single digit.

    So you can use the following regex:

    "[a-zA-Z]{3}(\\d\\D*){11}\\d)"
    

    Use it with Matcher#find() method, and get the entire group out of it.

    String str = "Lorem ipsum XYZ1234-123456-12 lorem ipsum";
    
    Pattern pattern = Pattern.compile("[a-zA-Z]{3}(\\d\\D*){11}\\d");
    Matcher matcher = pattern.matcher(str);
    
    if (matcher.find()) {
        System.out.println(matcher.group());
    }
    

    Output:

    XYZ1234-123456-12
    

    Update:

    If the separators between digits are non-alphanumeric, then you can use [\\W_] instead of \\D, as already pointed out by @Pshemo in comments:

    "[a-zA-Z]{3}(\\d[\\W_]*){11}\\d"