phphtmltexthtmlpurifier

How to identify if a text is HTML or not? (in PHP)


I want to read text entries from a database, some of which are actually HTML entries, others are just plain text that might contain HTML markup which should be displayed as text.

Those that are plain text should then be converted to HTML, by first calling PHP's htmlspecialchars() function and then running the result through HTMLPurifier.

Or in other words, I'm looking for some tips on how to implement the isHTML() function:

$text = getTextFromDatabase();
if (!isHTML($text)) {
    $text = htmlspecialchars($text);
}
$purifier = new HTMLPurifier();
$clean_html = $purifier->purify($text);

So for example following text would be run through htmlspecialchars:

The <p> tag of HTML has to be followed by a </p> tag to end the paragraph.

And following text would not be run through htmlspecialchars:

<p>These are few lines of HTML.</p>
<div>There might be multiple independent</div>
<p>but valid HTML blocks in it.</p>

It seems like there should already be an isHTML() function out there, but I just can't happen to find it and I don't want to reinvent the wheel :-). Maybe it's even possible to do this with some kind of HTMLPurifier settings?

Note that, if the HTML code is buggy, this should be handled by HTMLPurifier and the code should not be run through htmlspecialchars. :-) Like for example having an opening <p> tag when there really should be a closing </p> tag in the HTML code.

Any help is appreciated, thanks already :-),
Robert.


Solution

  • you can only check for chars specific for html in string

    function is_html($string)
    {
      return preg_match("/<[^<]+>/",$string,$m) != 0;
    }