sqlregexpostgresqlreplace

PostgreSQL - Replace HTML Entities


I have just set about the task of stripping out HTML entities from our database, as we do a lot of crawling and some of the crawlers didn't do this at input time :(

So I started writing a bunch of queries that look like;

UPDATE nodes SET name=regexp_replace(name, 'à', 'à', 'g') WHERE name LIKE '%#xe0%';
UPDATE nodes SET name=regexp_replace(name, 'á', 'á', 'g') WHERE name LIKE '%#xe1%';
UPDATE nodes SET name=regexp_replace(name, 'â', 'â', 'g') WHERE name LIKE '%#xe2%';

Which is clearly a pretty naive approach. I've been trying to figure out if there is something clever I can do with the decode function; maybe grabbing the html entity by regex like /&#x(..);/, then passing just the %1 part to the ascii decoder, and reconstructing the string...or something...

Shall I just press on with the queries? There will probably only be 40 or so of them.


Solution

  • Write a function using pl/perlu and use this module https://metacpan.org/pod/HTML::Entities

    Of course you need to have perl installed and pl/perl available.

    1) First of all create the procedural language pl/perlu:

    CREATE EXTENSION plperlu;
    

    2) Then create a function like this:

    CREATE FUNCTION decode_html_entities(text) RETURNS TEXT AS $$
        use HTML::Entities;
        return decode_entities($_[0]);
    $$ LANGUAGE plperlu;
    

    3) Then you can use it like this:

    select decode_html_entities('aaabbb&.... asasdasdasd …');
       decode_html_entities    
    ---------------------------
     aaabbb&.... asasdasdasd …
    (1 row)