delphiunicodeutf-8delphi-6

Can Delphi 6 convert UTF-8 Portuguese to WideString?


I am using Delphi 6.

I want to decode a Portuguese UTF-8 encoded string to a WideString, but I found that it isn't decoding correctly.

The original text is "ANÁLISE8". After using UTF8Decode(), the result is "ANALISE8". The symbol on top of the "A" disappears.

Here is the code:

var
  f : textfile;
  s : UTF8String;
  w, test : WideString;    
begin
  while not eof(f) do
  begin
    readln(f,s);
    w := UTF8Decode(s);

How can I decode the Portuguese UTF-8 string to WideString correctly?


Solution

  • Note that the implementation of UTF8Decode() in Delphi 6 is incomplete. Specifically, it does not support encoded 4-byte sequences, which are needed to handle Unicode codepoints above U+FFFF. Which means UTF8Decode() can only decode Unicode codepoints in the UCS-2 range, not the full Unicode repertoire. Thus making UTF8Decode() basically useless in Delphi 6 (and all the way up to Delphi 2007 - it was finally fixed in Delphi 2009).

    Try using the Win32 MultiByteToWideChar() function instead, eg:

    uses
      ..., Windows;
    
    function MyUTF8Decode(const s: UTF8String): WideString;
    var
      Len: Integer;
    begin
      Len := MultiByteToWideChar(CP_UTF8, 0, PAnsiChar(s), Length(s), nil, 0);
      SetLength(Result, Len);
      if Len > 0 then
        MultiByteToWideChar(CP_UTF8, 0, PAnsiChar(s), Length(s), PWideChar(Result), Len));
    end;
    
    var
      f : textfile;
      s : UTF8String;
      w, test : WideString;
    begin
      while not eof(f) do
      begin
        readln(f,s);
        w := MyUTF8Decode(s);
    

    That being said, your ANÁLISE8 string falls within the UCS-2 range, so I tested UTF8Decode() in Delphi 6 and it decoded the UTF-8 encoded form of ANÁLISE8 just fine. I would conclude that either: