The iconv function sometimes gives me an error:
Notice:
iconv() [function.iconv]:
Detected an incomplete multibyte character in input string in [...]
Is there a way to detect that there are illegal characters in a UTF-8 string before sending data to inconv()?
First, note that it is not possible to detect whether text belongs to a specific undesired encoding. You can only check whether a string is valid in a given encoding.
You can make use of the UTF-8 validity check that is available in preg_match
[PHP Manual] since PHP 4.3.5. It will return empty ¹ (with no additional information² ) if an invalid string is given:
$validUTF8 = (bool) preg_match('//u', $string);
Another possibility is mb_check_encoding
[PHP Manual]:
$validUTF8 = mb_check_encoding($string, 'UTF-8');
Another function you can use is mb_detect_encoding
[PHP Manual]:
$validUTF8 = ! (false === mb_detect_encoding($string, 'UTF-8', true));
It's important to set the strict
parameter to true
.
Additionally, iconv
[PHP Manual] allows you to change/drop invalid sequences on the fly. (However, if iconv
encounters such a sequence, it generates a notification; this behavior cannot be changed.)
echo 'TRANSLIT : ', iconv("UTF-8", "ISO-8859-1//TRANSLIT", $string), PHP_EOL;
echo 'IGNORE : ', iconv("UTF-8", "ISO-8859-1//IGNORE", $string), PHP_EOL;
You can use @
and check the length of the return string:
strlen($string) === strlen(@iconv('UTF-8', 'UTF-8//IGNORE', $string));
Check the examples on the iconv
manual page as well.
Remarks:
¹ preg_match() empty return value:
0
until 5.3.3 (including)false
since 5.3.4.(before 4.3.5/until 4.3.4: the //u
test is not useful as it returns 1
on subject string "\x80"
which is not a complete binary sequence in UTF-8, only a continuation byte at best, ref)
² with no additional information:
The original 0
return value itself does not host any additional information nor does preg_match() yield a diagnostic message.
As earlier outlined in comment/s, some more information can be obtained, especially there was a PREG_*_ERROR in case of a match error (no-match).
This works by calling preg_last_error()PHP >= 5.2 after preg_match() and testing the return integer value against PREG_BAD_UTF8_ERROR to identify that the subject string is not UTF-8.
For the diagnostic message use preg_last_error_msg()PHP >= 8, it returns the string "Malformed UTF-8 characters, possibly incorrectly encoded" (without the quotes) given the last error is PREG_BAD_UTF8_ERROR. (same ref)