javaregexdoi

Whats the correct format of Java String REGEX to identify DOI


I am conducting some research on identify DOI in free format text.

I am using Java 8 and REGEX

I Have found these REGEX's that are supposed to fulfil my requirements

/^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i
/^10.1002/[^\s]+$/i
/^10.\d{4}/\d+-\d+X?(\d+)\d+<[\d\w]+:[\d\w]*>\d+.\d+.\w+;\d$/i
/^10.1021/\w\w\d++$/i
/^10.1207/[\w\d]+\&\d+_\d+$/i

The code I am trying is

private static final Pattern pattern_one = Pattern.compile("/^10.\\d{4,9}/[-._;()/:A-Z0-9]+$/i", Pattern.CASE_INSENSITIVE);

Matcher matcher = pattern_one.matcher("http://journals.ametsoc.org/doi/full/10.1175/JPO3002.1");
while (matcher.find()) {
                System.out.print("Start index: " + matcher.start());
                System.out.print(" End index: " + matcher.end() + " ");
                System.out.println(matcher.group());
        }

However the matcher doesnt find anything.

Where have I gone wrong?

UPDATE

I have encountered a valid DOI that my set of REGEXs do not match

heres an example DOI : 10.1175/1520-0485(2002)032<0870:CT>2.0.CO;2

Why doesn't this pattern work?

/^10.\d{4}/\d+-\d+X?(\d+)\d+<[\d\w]+:[\d\w]*>\d+.\d+.\w+;\d$/i

Solution

  • In Java, a regex is written as a String. In other languages, the regex is quoted using /.../, with options like i given after the ending /. So, what is written as /XXX/i will in Java be done like this:

    // Using flags parameter
    Pattern p = Pattern.compile("XXX", Pattern.CASE_INSENSITIVE);
    
    // Using embedded flags
    Pattern p = Pattern.compile("(?i)XXX");
    

    In most languages, regex are using to find a matching substring. Java can do that too, using the find() method (or any of the many replaceXxx() regex methods), however Java also has the matches() method which will match against the entire string, eliminating the need for the begin and end boundary matchers ^ and $.

    Anyway, your problem is that the regex has both ^ and $ boundary matchers, which means it will only work if string is nothing but the text you want to match. Since you actually want to find a substring, remove those matchers.

    To search for one of multiple patterns, using the | logical regex operator.

    And finally, since Java regex is given as a String literal, any special characters, most notably \, needs to be escaped.

    So, to build a single regex that can find substrings matching any of the following:

    /^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i
    /^10.1002/[^\s]+$/i
    /^10.\d{4}/\d+-\d+X?(\d+)\d+<[\d\w]+:[\d\w]*>\d+.\d+.\w+;\d$/i
    /^10.1021/\w\w\d++$/i
    /^10.1207/[\w\d]+\&\d+_\d+$/i
    

    You would write it like this:

    String regex = "10.\\d{4,9}/[-._;()/:A-Z0-9]+" +
                  "|10.1002/[^\\s]+" +
                  "|10.\\d{4}/\\d+-\\d+X?(\\d+)\\d+<[\\d\\w]+:[\\d\\w]*>\\d+.\\d+.\\w+;\\d" +
                  "|10.1021/\\w\\w\\d++" +
                  "|10.1207/[\\w\\d]+\\&\\d+_\\d+";
    Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);
    
    String input = "http://journals.ametsoc.org/doi/full/10.1175/JPO3002.1";
    Matcher m = p.matcher(input);
    while (m.find()) {
        System.out.println("Start index: " + m.start() +
                           " End index: " + m.end() +
                           " " + m.group());
    }
    

    Output

    Start index: 37 End index: 54 10.1175/JPO3002.1