I'm very bad at regular expressions and don't really understand them. But I want to make my code work great! I have a task to get all the DOI
from a large text, but immediately there are problems, namely the difference in the display of the DOI
.
DOI: 10.1...
DOI:10.1...
http(s)://dx.doi.org/10.1...
http(s)://doi.org/10.1...
DOI: 11.11111/aaa.111&1111&11
Now I have a regular expression that handles the code just 10.1....
$doiPattern = "/\b(10\.[0-9]{4,}(?:\.[0-9]+)*\/(?:(?![\"&\'])\S)+)\b/";
Unfortunately, my code can't process all the options that I presented. But just pulls out the 10.1...
from the full links. https://doi.org/ 11.11111/aaa.111111111
The code I have tried:
$element = "Lorem Lorem Lorem Lorem Lorem Lorem Lorem `https://dx.doi.org/11.11111/aaa.111111111` Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem `https://doi.org/11.11111/aaa.111111111` Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem `DOI: 11.11111/aaa.111&1111&11` Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem `DOI:11.11111/aaa.111&1111&11` Lorem Lorem Lorem "
$doiPattern = "/\b(10\.[0-9]{4,}(?:\.[0-9]+)*\/(?:(?![\"&\'])\S)+)\b/";
$doiData = preg_match_all($doiPattern, $element, $doiMatches);
foreach ($doiMatches[0] as $doiMatch) {
//...
}
how do I update the regular expression so that it takes into account the conditions. For example, a space between the words DOI:
, as well as a links?
Here is a working solution with an updated regex pattern and PHP code.
<?php
$string = "Lorem Lorem Lorem Lorem Lorem Lorem Lorem https://dx.doi.org/11.11111/aaa.111111111 Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem https://doi.org/11.11111/aaa.111111111 Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem DOI: 11.11111/aaa.111&1111&11 Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem Lorem DOI:11.11111/aaa.111&1111&11 Lorem Lorem Lorem";
// Define regex pattern for DOI
$pattern = '/((https?:\/\/(?:dx\.)?doi\.org\/|DOI:\s*)(\d+(\.\d+)*\/\S+))/i';
// Perform the match
preg_match_all($pattern, $string, $matches);
// Print all matches
print_r($matches[0]);
Output
Array
(
[0] => https://dx.doi.org/11.11111/aaa.111111111
[1] => https://doi.org/11.11111/aaa.111111111
[2] => DOI: 11.11111/aaa.111&1111&11
[3] => DOI:11.11111/aaa.111&1111&11
)
Explanation of the Regex Pattern Used Above
Protocol Matching:
https?:\/\/
: This part checks for a URL starting with either http:// or https://.
Domain and Path:
(?:dx\.)?doi\.org\/
: This section matches the domain doi.org optionally preceded by dx..
OR Operator:
|
: This symbol means "or". It's like saying "match either this part or that part".
DOI Identifier:
DOI:\s*
: This part matches the literal string DOI: followed by optional whitespace characters.
DOI Value:
\d+(\.\d+)*\/\S+
: This part captures the actual DOI value:
\d+
: Matches one or more digits.
(\.\d+)*
: Allows for zero or more occurrences of a dot followed by one or more digits. This is for version numbers.
\/
: Matches a forward slash.
\S+
: Matches one or more non-whitespace characters, which typically represent the remainder of the DOI.
Case Insensitivity:
/i
: This flag makes the pattern case-insensitive, so it matches both uppercase and lowercase letters.