A upstream service reads a stream of UTF-8 bytes, assumes they are ISO-8859-1, applies ISO-8859-1 to UTF-8 encoding, and sends them to my service, labeled as UTF-8.
The upstream service is out of my control. They may fix it, it may never be fixed.
I know that I can fix the encoding by applying UTF-8 to ISO-8859-1 encoding then labeling the bytes as UTF-8. But what happens if my upstream fixes their issue?
Is there any way to detect this issue and fix the encoding only when I find a bad encoding?
I'm also not sure that the upstream encoding is ISO-8859-1. I think the upstream is perl so that encoding makes sense and each sample I've tried decoded correctly when I apply ISO-8859-1 encoding.
When the source sends e4 9c 94
(✔) to my upstream, my upstream sends me c3 a2 c2 9c c2 94
(â).
✔
as bytes: e4 9c 94
e4 9c 94
as latin1 string: â
â
as bytes: c3 a2 c2 9c c2 94I can fix it applying upstream.encode('ISO-8859-1').force_encoding('UTF-8')
but it will break as soon as the upstream issue is fixed.
Since you know how it is mangled, you can try to unmangle it by decoding the received UTF-8 bytes, encoding to latin1, and decoding as UTF-8 again. Only your mangled strings, pure ASCII strings, or very unlikely latin-1 string combinations will successfully decode twice. If that decoding fails, assume the upstream was fixed and just decode once as UTF-8. A pure ASCII string will correctly decode with either method so there is no issue there as well. There are valid UTF-8-encoded sequences that survive a double-decode but they are unlikely to occur in normal text.
Here's an example in Python (you didn't mention a language...):
# Assume bytes are latin1, but return encoded UTF-8.
def bad(b):
return b.decode('latin1').encode('utf8')
# Assume bytes are UTF-8, and pass them along.
def good(b):
return b
def decoder(b):
try:
return b.decode('utf8').encode('latin1').decode('utf8')
except UnicodeError:
return b.decode('utf8')
b = '✔'.encode('utf8')
print(decoder(bad(b)))
print(decoder(good(b)))
Output:
✔
✔