utf-8iso-8859-1

How to detect and fix incorrect character encoding


A upstream service reads a stream of UTF-8 bytes, assumes they are ISO-8859-1, applies ISO-8859-1 to UTF-8 encoding, and sends them to my service, labeled as UTF-8.

The upstream service is out of my control. They may fix it, it may never be fixed.

I know that I can fix the encoding by applying UTF-8 to ISO-8859-1 encoding then labeling the bytes as UTF-8. But what happens if my upstream fixes their issue?

Is there any way to detect this issue and fix the encoding only when I find a bad encoding?

I'm also not sure that the upstream encoding is ISO-8859-1. I think the upstream is perl so that encoding makes sense and each sample I've tried decoded correctly when I apply ISO-8859-1 encoding.


When the source sends e4 9c 94 (✔) to my upstream, my upstream sends me c3 a2 c2 9c c2 94 (â).

I can fix it applying upstream.encode('ISO-8859-1').force_encoding('UTF-8') but it will break as soon as the upstream issue is fixed.


Solution

  • Since you know how it is mangled, you can try to unmangle it by decoding the received UTF-8 bytes, encoding to latin1, and decoding as UTF-8 again. Only your mangled strings, pure ASCII strings, or very unlikely latin-1 string combinations will successfully decode twice. If that decoding fails, assume the upstream was fixed and just decode once as UTF-8. A pure ASCII string will correctly decode with either method so there is no issue there as well. There are valid UTF-8-encoded sequences that survive a double-decode but they are unlikely to occur in normal text.

    Here's an example in Python (you didn't mention a language...):

        # Assume bytes are latin1, but return encoded UTF-8.
        def bad(b):
            return b.decode('latin1').encode('utf8')
        
        # Assume bytes are UTF-8, and pass them along.
        def good(b):
            return b
        
        def decoder(b):
            try:
                return b.decode('utf8').encode('latin1').decode('utf8')
            except UnicodeError:
                return b.decode('utf8')
        
        b = '✔'.encode('utf8')
        print(decoder(bad(b)))
        print(decoder(good(b)))
    

    Output:

    ✔
    ✔