phphtmlhtmlpurifier

How do I stop htmlPurifier from automatically decoding html entities?


I have a strange issue. I use CKEditor-4 to collect formatted text from user in form of html. Also, the html content is filtered using htmlpurifier from the server.

When the user use quotes like , and CKEditor converts them into html entities like ”, ’, and “, which is fine. The issue is, when I filter them using htmlpurifier - this quotes get's automatically decoded. This prevents the content from: being presented to user for later edit as the quotes are literally encoded in strage ways like “

How do i fix this? I think, if I could stop htmlpurifier from automatically decoding things, this would work, But I am new to htmlpurifier - so I can't find a way.

I have tried using htmlentities before passing it to htmlpurifier. but it would encode the whole html, Hence: stopping htmlpurifier from purifying html at all.


Solution

  • After CBroe's comment, I found out that my application is not using UTF-8 all the way through.

    And I can't rectify it also. For those who are in similar situation, I found a work-around. htmlPurifier does support a configuration to encode all non-ASCII charecters with some trade-offs - It's fine with my case(I think).

    you can enable the htmlpurifier config Core.EscapeNonASCIICharacters like so

    $config->set('Core.EscapeNonASCIICharacters', true);
    

    which did the trick for me.


    This is the full function

    /**
     * Purifies dirty html
     *
     * @param string $dirty_html
     * @return string
     */
    function purifyHtml($dirty_html)
    {
        $config = HTMLPurifier_Config::createDefault();
        $config->set('Core.Encoding', 'UTF-8');
        $config->set('Core.EscapeNonASCIICharacters', true);
        $config->set('HTML.Doctype', 'HTML 4.01 Transitional');
        $config->set('Cache.SerializerPath', getStoragePath('cache/html-purifier'));
    
        $htmlPurifier = new HTMLPurifier($config);
        return $htmlPurifier->purify($dirty_html);
    }