delphiutf-8delphi-7emojiansi

Is there a way to get just the ANSI characters from a string? Utf8decode fails when string contains emojis


First I get a TMemoryStream from an HTTP request, which contains the body of the response. Then I load it in a TStringList and save the text in a widestring (also tried with ansistring).

The problem is that I need to convert the string because the users language is spanish, so vowels with accent marks are very common and I need to store the info.

lServerResponse := TStringList.Create;
lServerResponse.LoadFromStream(lResponseMemoryStream);

lStringResponse := lServerResponse.Text;
lDecodedResponse := Utf8Decode(lStringResponse );

If the response (a part of it) is "Hólá Múndó", lStringResponse value will be "Hólá Múndó", and lDecodedResponse will be "Hólá Múndó".

But if the user adds any emoji (lStringResponse value will be "Hólá Múndó 😀" if the emoji is 😀) Utf8Decode fails and returns an empty string. Is there a way to get just the ANSI characters from a string (or MemoryStream)?, or removing whatever Utf8Decode can't convert?

Thanks for your time.


Solution

  • TMemoryStream is just raw bytes. There is no reason to loading that stream into a TStringList just to extract a (Wide|Ansi)String from it. You can assign the bytes directly to an AnsiString/UTF8String using SetString() instead, eg:

    var
      lStringResponse: UTF8String;
      lDecodedResponse: WideString;
    begin
      SetString(lStringResponse, PAnsiChar(lResponseMemoryStream.Memory), lResponseMemoryStream.Size);
      lDecodedResponse := UTF8Decode(lStringResponse);
    end;
    

    Just make sure the HTTP content really is encoded as UTF-8, or else this approach will not work.

    That being said - UTF8Decode() (and UTF8Encode()) in Delphi 7 DO NOT support Unicode codepoints above U+FFFF, which means they DO NOT support Emojis at all. That was fixed in Delphi 2009.

    To work around that issue in earlier versions, you can use the Win32 API MultiByteToWideChar() function instead, eg:

    uses
      ..., Windows;
    
    function My_UTF8Decode(const S: UTF8String): WideString;
    var
      WLen: Integer;
    begin
      WLen := MultiByteToWideChar(CP_UTF8, 0, PAnsiChar(S), Length(S), nil, 0);
      if WLen > 0 then
      begin
        SetLength(Result, WLen);
        MultiByteToWideChar(CP_UTF8, 0, PAnsiChar(S), Length(S), PWideChar(Result), WLen);
      end else
        Result := '';
    end;
    
    var
      lStringResponse: UTF8String;
      lDecodedResponse: WideString;
    begin
      SetString(lStringResponse, PAnsiChar(lResponseMemoryStream.Memory), lResponseMemoryStream.Size);
      lDecodedResponse := My_UTF8Decode(lStringResponse);
    end;
    

    Alternatively:

    uses
      ..., Windows;
    
    function My_UTF8Decode(const S: PAnsiChar; const SLen: Integer): WideString;
    var
      WLen: Integer;
    begin
      WLen := MultiByteToWideChar(CP_UTF8, 0, S, SLen, nil, 0);
      if WLen > 0 then
      begin
        SetLength(Result, WLen);
        MultiByteToWideChar(CP_UTF8, 0, S, SLen, PWideChar(Result), WLen);
      end else
        Result := '';
    end;
    
    var
      lDecodedResponse: WideString;
    begin
      lDecodedResponse := My_UTF8Decode(PAnsiChar(lResponseMemoryStream.Memory), lResponseMemoryStream.Size);
    end;
    

    Or, use a 3rd party Unicode conversion library, like ICU or libiconv, which handle this for you.