javaregexstringsearch

Regular Expressions - Using Regex to search for a specific string within another string


I want to search a string for an identifier. The identifier can 4 have variations

REF964758362562
REF964-758362-562
964758362562
964-758362-562

The identifier can be located anywhere in a string or on it own. Example:

Lorem ipsum REF964-758362-562
Lorem ipsum ABCD964-758362-562 lorem ipsum
Lorem ipsum REF964-758362-562 lorem ipsum
REF964-758362-562 Lorem ipsum 1234-123456-22
Lorem ipsum 964-758362-562 lorem ipsum
REF964758362562
REF964-758362-562
964758362562
964-758362-562

When a hyphen/dash character is used in the identifier, the hyphen will always appear after the third and 9th digits as shown in the examples.

Here is what i have come up with but i suspect that the regular expression is getting too long and it can probably be shortened. This also does work well when the identifier is not at the beginning of the string. Any tips/ideas?

^[A-Z]*REF[A-Z]*([12]\d{3})(\d{6})(\d{2})$|^([12]\d{3})(\d{6})(\d{2})[A-Z]*REF[A-Z]*|^([12]\d{3})(\d{6})(\d{2})$

I have put them in groups because once i have extracted the identifiers, i want to add the hyphen if the identifier does not have a hyphen. For example, if the identifier extracted is 964758362562, i want to save it as 964-758362-562.

Here are some tests i have run and as you can see not a lot of them match

testRegex = "^[A-Z]*REF[A-Z]*([12]\\d{3})(\\d{6})(\\d{2})$|^([12]\\d{3})(\\d{6})(\\d{2})[A-Z]*REF[A-Z]*|^([12]\\d{3})(\\d{6})(\\d{2})$";
        PATTERN = Pattern.compile(testRegex, Pattern.CASE_INSENSITIVE);

        m = PATTERN.matcher("Lorem ipsum REF964-758362-562");
        if(m.matches()) {
            System.out.println("Match = " + m.group());
        }else{
            System.out.println("No match");
        }

        m = PATTERN.matcher("REF964-758362-562 Lorem ipsum 1234-123456-22");
        if(m.matches()) {
            System.out.println("Match = " + m.group());
        }else{
            System.out.println("No match");
        }

        m = PATTERN.matcher("Lorem ipsum 964-758362-562 lorem ipsum");
        if(m.matches()) {
            System.out.println("Match = " + m.group());
        }else{
            System.out.println("No match");
        }

        m = PATTERN.matcher("Lorem ipsum ABCD964-758362-562 lorem ipsum");
        if(m.matches()) {
            System.out.println("Match = " + m.group());
        }else{
            System.out.println("No match");
        }

        m = PATTERN.matcher("REF964758362562");
        if(m.matches()) {
            System.out.println("Match = " + m.group());
        }else{
            System.out.println("No match");
        }

        m = PATTERN.matcher("REF964-758362-562");
        if(m.matches()) {
            System.out.println("Match = " + m.group());
        }else{
            System.out.println("No match");
        }

        m = PATTERN.matcher("964758362562");
        if(m.matches()) {
            System.out.println("Match = " + m.group());
        }else{
            System.out.println("No match");
        }

        m = PATTERN.matcher("964-758362-562");
        if(m.matches()) {
            System.out.println("Match = " + m.group());
        }else{
            System.out.println("No match");
        }

Output

No match
Match = Not known
No match
No match
No match
No match
No match
No match
No match
No match

Solution

  • It looks like the identifier follows this general pattern:

    That being the case this pattern will work

    (?>REF)?(\\d{3}+)(-?)(\\d{6}+)\\2(\\d{3}+)
    

    Breaking down the pattern:

    The nifty trick is to capture the optional hyphen and then back-reference it so that if the first hyphen is present then second must be; conversely if the first hyphen is not present the second cannot be.

    Testcase in Java:

    public static void main(String[] args) throws Exception {
        final String[] test = {"Lorem ipsum REF964-758362-562",
            "Lorem ipsum ABCD964-758362-562 lorem ipsum",
            "REF964-758362-562 Lorem ipsum 1234-123456-22",
            "Lorem ipsum 964-758362-562 lorem ipsum",
            "REF964758362562",
            "REF964-758362-562",
            "964-758362562",
            "964758362-562",
            "964758362562",
            "964-758362-562"};
        final Pattern patt = Pattern.compile("(?>REF)?(\\d{3}+)(-?)(\\d{6}+)\\2(\\d{3}+)");
        final MessageFormat format = new MessageFormat("{0}-{1}-{2}");
        for (final String in : test) {
            final Matcher mat = patt.matcher(in);
            while (mat.find()) {
                final String id = format.format(new Object[]{mat.group(1), mat.group(3), mat.group(4)});
                System.out.println(id);
            }
        }
    }
    

    Output:

    964-758362-562
    964-758362-562
    964-758362-562
    964-758362-562
    964-758362-562
    964-758362-562
    964-758362-562
    964-758362-562
    

    Your main problem is using Matcher.matches() which requires the whole input to match the pattern. What you actually want is to find the pattern in the input. For this purpose there is the while(Matcher.find()) idiom - this finds each occurrence of the pattern in the input in turn.