delphicharacter-encodingararat-synapse

HttpGetText(), autodetect charset, and convert source to UTF8


I'm using HttpGetText with Synapse for Delphi 7 Professional to get the source of a web page - but feel free to recommend any component and code.

The goal is to save some time by 'unifying' non-ASCII characters to a single charset, so I can process it with the same Delphi code.

So I'm looking for something similar to "Select All and Convert To UTF without BOM in Notepad++", if you know what I mean. ANSI instead of UTF8 would also be okay.

Webpages are encoded in 3 charsets: UTF8, "ISO-8859-1=Win 1252=ANSI" and straight up the alley HTML4 without charset spec, ie. htmlencoded Å type characters in the content.

If I need to code a PHP page that does the conversion, that's fine too. Whatever is the least code / time.


Solution

  • I instead did the reverse conversion directly after retrieving the HTML using GpTextStream. Making the documents conform to ISO-8859-1 made them processable using straight up Delphi, which saved quite a bit of code changes. On output all the data was converted to UTF-8 :)

    Here's some code. Perhaps not the prettiest solution but it certainly got the job done in less time. Note that this is for the reverse conversion.

    procedure UTF8FileTo88591(fileName: string);
    const bufsize=1024*1024;
    var
    fs1,fs2: TFileStream;
    ts1,ts2: TGpTextStream;
    buf:PChar;
    siz:integer;
        procedure LG2(ss:string);
        begin
            //dont log for now.
        end;
    
    begin
        fs1 := TFileStream.Create(fileName,fmOpenRead);
        fs2 := TFileStream.Create(fileName+'_ISO88591.txt',fmCreate);
        //compatible enough for my purposes with default 'Windows/Notepad' CP 1252 ANSI and Swe ANSI codepage, Latin1 etc.
        //also works for ASCII sources with htmlencoded accent chars, naturally
        try
          LG2('Files opened OK.');
          GetMem(buf,bufsize);
          ts1 := TGpTextStream.Create(fs1,tsaccRead,[],CP_UTF8);
          ts2 := TGpTextStream.Create(fs2,tsaccWrite,[],ISO_8859_1);
          try
            siz:=ts1.Read(buf^,bufsize);
            LG2(inttostr(siz)+' bytes read.');
            if siz>0 then ts2.Write(buf^,siz);
          finally
            LG2('Bytes read and written OK.');
          FreeAndNil(ts1);FreeAndNil(ts2);end;
        finally FreeAndNil(fs1);FreeAndNil(fs2);FreeMem(buf);
            LG2('Everything freed OK.');
        end;
    end; // UTF8FileTo88591