phpregexphp-7php-7.3

Difference in matching end of line with PHP regex


Given the code:

$my_str = '
Rollo is*
My dog*
And he\'s very*
Lovely*
';

preg_match_all('/\S+(?=\*$)/m', $my_str, $end_words);
print_r($end_words);

In PHP 7.3.2 (XAMPP) I get the unexpected output

Array ( [0] => Array ( ) )

Whereas in PhpFiddle, on PHP 7.0.33, I get what I expected:

Array ( [0] => Array ( [0] => is [1] => dog [2] => very [3] => Lovely ) )

Why am I getting this difference? Did something change in regular expression behaviour after 7.0.33?


Solution

  • It seems that in the environment you have, the PCRE library was compiled without the PCRE_NEWLINE_ANY option, and $ in the multiline mode only matches before the LF symbol and . matches any symbol but LF.

    You can fix it by using the PCRE (*ANYCRLF) verb:

    '~(*ANYCRLF)\S+(?=\*$)~m'
    

    (*ANYCRLF) specifies a newline convention: (*CR), (*LF) or (*CRLF) and is equivalent to PCRE_NEWLINE_ANY option. See the PCRE documentation:

    PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be recognized.

    In the end, this PCRE verb enables . to match any character but a CR and LF symbols and $ will match right before either of these two characters.

    See more about this and other verbs at rexegg.com:

    By default, when PCRE is compiled, you tell it what to consider to be a line break when encountering a . (as the dot it doesn't match line breaks unless in dotall mode), as well the ^ and $ anchors' behavior in multiline mode. You can override this default with the following modifiers:

    (*CR) Only a carriage return is considered to be a line break
    (*LF) Only a line feed is considered to be a line break (as on Unix)
    (*CRLF) Only a carriage return followed by a line feed is considered to be a line break (as on Windows)
    (*ANYCRLF) Any of the above three is considered to be a line break
    (*ANY) Any Unicode newline sequence is considered to be a line break

    For instance, (*CR)\w+.\w+ matches Line1\nLine2 because the dot is able to match the \n, which is not considered to be a line break. See the demo.