Let say i have the following string.
Lorem ipsum XYZ1234-123456-12 lorem ipsum
I want to search the string for any occurrence of a string that has the pattern XXXDDDDDDDDDDDD
(i.e. 3 characters followed by 12 digits ignoring any non alphanumeric characters)
To achieve this i do something like this
String incomingId = "Lorem ipsum XYZ1234-123456-12 lorem ipsum"
private final static Pattern NONCHARACTER = Pattern.compile("[^a-zA-Z0-9]");
String removedNonChars = NONCHARACTER.matcher(incomingId ).replaceAll("") //returns LoremipsumXYZ123412345612loremipsum
I then i run another regex to search for the sequence i want (i.e. XXXDDDDDDDDDDDD
)
private final static Pattern IDENTIFIERPATTERN = Pattern.compile("([a-zA-Z]{3,})(\d{3})(\d{6})(\d{2})");
String extractedString = IDENTIFIERPATTERN.matcher(removedNonChars) //returns a match on XYZ123412345612
Once i get the string that has the format i am looking for (i.e. XYZ123412345612
), i want to extract that string from the original non modified string (i.e. the value XYZ1234-123456-12
)
Note - The hyphens are just an example, the NONCHARACTER could be any non alphanumeric character - Examples:
Lorem ipsum XYZ1234-123456-12 lorem ipsum
Lorem ipsum XYZ123412345612 lorem ipsum
Lorem ipsum XYZ1234 123456 12 lorem ipsum
Lorem ipsum XYZ1234!123456#12 lorem ipsum
Lorem ipsum XYZ1234--123456#12 lorem ipsum
Basically what i am doing is searching a string for identifiers. The identifiers usually have a defined format but sometimes people dont use the rules for the identifier hence i am searching without the non-characters in the string. After i have extracted the string without the non-characters, i want to extract the original string WITH the non-characters.
How can i extract the string from the original string using the string that was returned as a match in the initial search.
The separators are always non-alphanumeric. i.e. not a digit and not a character (i.e. only special characters such as -,#£$"(!__£($&£^" and including the 'space' character).
Thanks in advance.
By replacing those non-digit characters, you're making your task difficult. Rather you should make a regex that extracts that part directly from the string.
The issue here is, you can't directly do \\d{12}
, as the digits are not contiguous. So, let's modify that part. Since you can have 0 or more non-digit characters in between, you can use - \\d\\D*
instead of \\d
, and apply match that 11
times, and at the end, match single digit.
So you can use the following regex:
"[a-zA-Z]{3}(\\d\\D*){11}\\d)"
Use it with Matcher#find()
method, and get the entire group out of it.
String str = "Lorem ipsum XYZ1234-123456-12 lorem ipsum";
Pattern pattern = Pattern.compile("[a-zA-Z]{3}(\\d\\D*){11}\\d");
Matcher matcher = pattern.matcher(str);
if (matcher.find()) {
System.out.println(matcher.group());
}
Output:
XYZ1234-123456-12
Update:
If the separators between digits are non-alphanumeric, then you can use [\\W_]
instead of \\D
, as already pointed out by @Pshemo in comments:
"[a-zA-Z]{3}(\\d[\\W_]*){11}\\d"