phphtmlparsingprephpquery

Parsing html with phpQuery : how to handle C++ code inside a pre tag?


In the database I have some code like this one

Some text
<pre>
#include <cstdio> 

int x = 1;
</pre>
Some text

When I'm trying to use phpQuery to do the parsing it fails because the <cstdio> is interpreted as a tag.

I could use htmlspecialchars but to apply it only inside pre tags I still need to do some parsing. I could use regex but it will be much more difficult (I will need to handle the possible attributes of the pre tag) and the idea of using a parser was to avoid this kind of regex thing.

What's the best way to do what I need to do ?


Solution

  • I finally went the regex way, considering only simple attributes for the pre tag (no '>' inside the attributes) :

      foreach(array('pre', 'code') as $sTag)
         $s = preg_replace_callback("#\<($sTag)([^\>]*?)\>(.+?)\<\/$sTag\>#si",
         function($matches)
         {
            $matches[3] = str_replace(array('&amp;', '&lt;', '&gt;'), array('&', '<', '>'), $matches[3]);      
            return "<{$matches[1]} {$matches[2]}>".htmlentities($matches[3], ENT_COMPAT, "UTF-8")."</{$matches[1]}>";
         },
         $s);
    

    It also deals with caracters being already converted to html entities (we don't want to have it twice).

    Not a perfect solution but given the data I need to apply it on it will do the work.