htmlxmlperlutf-8latin1

Reading and Writing XML files with unknown encoding in Perl?


I am picking up pieces of someone else's large project and trying to right the wrongs. The problem is, I'm just not sure what the correct ways are.

So, I am cURLing a bunch of HTML pages, then writing it to files with simple commands like:

$src = `curl http://google.com`;
open FILE, ">output.html";
print FILE $src;
close FILE;

Now I wanted those to be saved as UTF-8. What is it saved as? Then I am reading the html file in using the same basic 'open' command, parsing the html with regex calls, and using string concatenation to make a big string and writing it to an XML file (using the same code as above). I have already started using XML::Writer instead, but now I must go through and fix the files that have inaccurate encoding.

So, I don't have the html anymore, but I still have the XML that have to display proper characters. Here is an example: http://filevo.com/wkkixmebxlmh.html

The main problem is detecting and replacing the character in question with a "\x{2019}" that displays in editors properly. But I can't figure out a regex to actually capture the character in the wild.

UPDATE:

I still cannot detect the ALT-0146 character that's in the XML file I uploaded to Filevo above. I've tried opening it in UTF-8, and searching for /\x{2019}/, /chr(0x2019)/, and just /’/, nothing.


Solution

  • To make sure you are producing output in UTF-8, apply the utf8 layer to the output stream using binmode

    open FILE, '>output.html';
    binmode FILE, ':utf8';
    

    or in the 3-argument open call

    open FILE, '>:utf8', 'output.html'
    

    Arbitrary input is trickier. If you are lucky, HTML input will tell you its encoding early on:

    wget http://www.google.com/ -O foo ; head -1 foo
    
    <!doctype html><html><head><meta http-equiv="content-type" content="text/html; 
    charset=ISO-8859-1"><title>Google</title><script>window.google=
    {kEI:"xgngTYnYIoPbgQevid3cCg",kEXPI:"23933,28505,29134,29229,29658,
    29695,29795,29822,29892,30111,30174,30215,30275,30562",kCSI:
    {e:"23933,28505,29134,29229,29658,29695,29795,29822,29892,30111,
    30174,30215,30275,30562",ei:"xgngTYnYIoPbgQevid3cCg",expi:
    "23933,28505,29134,29229,29658,29695,29795,29822,29892,30111,
    30174,30215,30275,30562"},authuser:0,ml:function(){},kHL:"en",
    time:function(){return(new Date).getTime()},
    

    Ah, there it is: <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">. Now you may continue to read input as raw bytes and find some way to decode those bytes with the known encoding. CPAN can help with this.