There are a lot of questions and documentation about converting HTML entities and special characters to UTF8 text in PHP. And also there is the PHP documentation itself, such as this htmlspecialchars_decode()
and this html_entity_decode()
. However, I could not find any function/solution that clearly describes how to convert any HTML characters and special entities to UTF-8 text. All of them state something like "if you want to do this, then do that", etc. But no solution ever states "to have pure UTF-8 text that could be read by humans, then do this".
My DB contains text. I would like to convert that text (which contains HTML entities and special characters), to UTF-8 text that I can display to the end user on the webpage. This text in the database is written in multiple languages (such as French, Arabic, English ...etc.). All those can contains HTML entities for special characters. So how can I convert all that to UTF-8 text that can be read by humans who understand those languages? I like to remove those special characters and convert them to something that can be read by humans.
The reason for me asking is that I really don't have a test case. I am reading off of a database, and it is multilingual. However the only guarantee is that the characters are in HTML, and I need to convert those to UTF-8, in a way that can be read by humans who understand those languages. Now, how can I do that? What is the proper way to sanitize/decode the input so it is pure text?
This works for me for decoding entities to utf8:
html_entity_decode($str, ENT_QUOTES | ENT_HTML5, 'UTF-8');
Edit:--
The "trick" to it is the combination in the second parameter, and including the encoding in the third parameter. That is, if you just did html_entity_decode($str);
the result would not be utf8.