Setting substitute char doesn't work on mb_convert_encoding() for character outside input code page range.
My PHP engine is 7.4.26 (cli). Simple script I wrote:
$setting_result = mb_substitute_character('long');
$decodedText = 'a'.chr(0x81).'b';
print(mb_convert_encoding($decodedText, 'UTF-8', 'ASCII'));
If I run this script in linux console I expected long description in the place of chr(0x81) char, because it is beyond the ASCII code page range. But, the result is "ab" - nothing printable here.
I tested all generated chars by:
$ php test.php | od -c -t x1
0000000 a 302 201 b
61 c2 81 62
It looks like substitute was double bytes (0xc2,0x81) pair.
I tested $setting_result variable, and it is "true". I tested also the configuration after change by read result of mb_substitute_character() empty call. It shows 'long'. But in my mb_convert_encoding() example doesn't work.
Until PHP 8.0 (including) "\x81" was replaced by ext-mbstring in UTF-8 out ASCII in as "\xC2\x81". I suspect it was not a bug and intentionally turning 0x81 (from C1) in the input stream into U+0081 in UTF-8 and remaining out of ASCII.
$ curl php7.4 -r 'mb_substitute_character(0x1A); echo mb_convert_encoding("a\x81b", "UTF-8", "ASCII");' | xxd
00000000: 61c2 8162 a..b
Example: mbstring in PHP <= 8.0 treats \x81 from the extended character set
$ php8.4 -r 'mb_substitute_character(0x1A); echo mb_convert_encoding("a\x81b", "UTF-8", "ASCII");' | xxd
00000000: 611a 62 a.b
Example: mbstring PHP >= 8.1 substitutes \x81
This naturally is a flaw, the input should have been entirely rejected, but mbstring is not designed that way.
The flaw is then clearly a bug since PHP 8.1 as the illegal byte is replaced with a question mark (?) for ASCII in, turning it now into a printable character, destroying the ASCII information.
It is strongly recommended to reject the input instead and not perform any substitutions. In PHP <= 8.0 you are safe if you filter the high bit bytes (as you source from ASCII), in PHP >= 8.1 you have to reject the illegal input beforehand (whether or not you still want to make use of mbstring, some options if you have to do that in PHP).
The question mark "?" is not a substitution character. This mbstring library seems deliberately broken and then confused.
$ php7.4 -r 'mb_substitute_character("long"); echo mb_convert_encoding("a\x81b", "UTF-8", "ASCII");' | xxd
00000000: 61c2 8162 a..b
$ php8.0 -r 'mb_substitute_character("long"); echo mb_convert_encoding("a\x81b", "UTF-8", "ASCII");' | xxd
00000000: 61c2 8162 a..b
# Change in 8.1:
$ php8.1 -r 'mb_substitute_character("long"); echo mb_convert_encoding("a\x81b", "UTF-8", "ASCII");' | xxd
00000000: 613f 62 a?b
$ php8.4 -r 'mb_substitute_character("long"); echo mb_convert_encoding("a\x81b", "UTF-8", "ASCII");' | xxd
00000000: 613f 62 a?b