postgresql stored-procedures utf-8 latin1 plperl

PostgreSQL: character of encoding "UTF8" has no equivalent in "LATIN1" in plperl stored procedure

I have this stored procedure written in Pl/Perl:

CREATE FUNCTION strip_html_tags(text) RETURNS TEXT AS $$
    use HTML::Strip;
    my $hs = HTML::Strip->new();
    my $clean_text = $hs->parse($_[0]);
    $hs->eof;
    return $clean_text;
$$ LANGUAGE plperlu;

I have some fields in my database (LATIN1 encodend) that probably have some invalid chars because i get things like:

db=# select strip_html_tags(field) from table;
ERROR:  character 0xe2809c of encoding "UTF8" has no equivalent in "LATIN1"
CONTEXT:  PL/Perl function "strip_html_tags"

I've tried using PostgreSQL's convert() and convert_from() to try to change the encoding but without any luck. Any ideas?

Thanks in advance.

Solution

I imagine what's happening here is that strip_html_tags is decoding HTML entities into native Unicode code-points expressed as utf-8 encoded text. 0xe2809c decoded as a utf-8 byte sequence is the unicode code point U+201c LEFT DOUBLE QUOTATION MARK - the character “, which is entirely believable as something you'd get from a decoded escape in HTML, especially HTML produced by GUI editors or MS Word. It'd be represented as “, “ (decimal) or “ (hex) in HTML.

Since your database encoding is latin-1, you cannot represent many of these decoded characters in the database.

You should really think about changing your database to utf-8 if you're going to work with full unicode data. This isn't usually hard if your DB is really in latin-1 not (ugh) SQL_ASCII; just dump the database, make a new DB with ENCODING 'utf-8' and load the data into it to verify and check it. Test your applications against the converted database and make sure they deal correctly with unicode text. When you're happy, stop your app, dump the db again, reload it again, rename the old DB and then rename the new DB to have the same name the old one used to.

If you're willing to mangle your HTML you can use Perl modules features to do a lossy encoding conversion from UTF-8 to Latin-1. There are Perl modules that'll do things like replace “ with ", — (em dash) with - (minus sign), etc, and can strip non-replacable chars or replace them with a substitute char like "?". This is a one-way, lossy conversion; you can't get the original data back if you don't keep a copy of the original unchanged version.

Your only other alternative is to return the data as bytea - byte strings in utf-8 encoding - and then decode them back to text in your application. I really do not recommend this.