phpregexdoi

Make a regular expression considering the conditions


I'm very bad at regular expressions and don't really understand them. But I want to make my code work great! I have a task to get all the DOI from a large text, but immediately there are problems, namely the difference in the display of the DOI.

DOI: 10.1...
DOI:10.1...
http(s)://dx.doi.org/10.1...
http(s)://doi.org/10.1...
DOI: 11.11111/aaa.111&1111&11

Now I have a regular expression that handles the code just 10.1.... $doiPattern = "/\b(10\.[0-9]{4,}(?:\.[0-9]+)*\/(?:(?![\"&\'])\S)+)\b/";

Unfortunately, my code can't process all the options that I presented. But just pulls out the 10.1... from the full links. https://doi.org/ 11.11111/aaa.111111111

The code I have tried:

$element = "Lorem Lorem Lorem Lorem Lorem Lorem Lorem `https://dx.doi.org/11.11111/aaa.111111111` Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem `https://doi.org/11.11111/aaa.111111111` Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem `DOI: 11.11111/aaa.111&1111&11` Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem `DOI:11.11111/aaa.111&1111&11` Lorem Lorem Lorem "

$doiPattern = "/\b(10\.[0-9]{4,}(?:\.[0-9]+)*\/(?:(?![\"&\'])\S)+)\b/";

$doiData = preg_match_all($doiPattern, $element, $doiMatches);

foreach ($doiMatches[0] as $doiMatch) {
       //...
}

how do I update the regular expression so that it takes into account the conditions. For example, a space between the words DOI:, as well as a links?


Solution

  • Here is a working solution with an updated regex pattern and PHP code.

    <?php
    
    $string = "Lorem Lorem Lorem Lorem Lorem Lorem Lorem https://dx.doi.org/11.11111/aaa.111111111 Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem https://doi.org/11.11111/aaa.111111111 Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem DOI: 11.11111/aaa.111&1111&11 Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem DOI:11.11111/aaa.111&1111&11 Lorem Lorem Lorem";
    
    // Define regex pattern for DOI
    $pattern = '/((https?:\/\/(?:dx\.)?doi\.org\/|DOI:\s*)(\d+(\.\d+)*\/\S+))/i';
    
    // Perform the match
    preg_match_all($pattern, $string, $matches);
    
    // Print all matches
    print_r($matches[0]);
    

    Output

    Array
    (
        [0] => https://dx.doi.org/11.11111/aaa.111111111
        [1] => https://doi.org/11.11111/aaa.111111111
        [2] => DOI: 11.11111/aaa.111&1111&11
        [3] => DOI:11.11111/aaa.111&1111&11
    )
    

    Explanation of the Regex Pattern Used Above