I'm reading an HTML file using Java and am having some trouble with a Unicode character. The problematic statement is:
<span class="xml-lang" lang="cmn-Hant" xml:lang="cmn-Hant">𦮼</span>
The character is 𦮼 (f0 a6 ae bc)
Whereas I read in ম¼ (e0 a6 ae c2 bc)
It's close but obviously wrong.
The file I'm reading is marked utf-8 (and I'm reading it in as utf-8) and has LOADS of other CJK strings that get read in perfectly.
I'm hoping someone can simply look at these strings and understand how the f0 -> e0 and the introduction of c2.
Any ideas?
I wrote most of the code I was running 20 years ago and it has worked perfectly since then. The data was passing through several libraries that I had some amount of confidence in. I couldn't figure out who could be changing the data.
The first problem is that the character involved is 4 bytes and didn't print correctly in my output. In my code I try several fonts and the last one I try is Unifont, which I was lead to believe never fails since it contains all of the possible codepoints (HAHA, nope, not even close).
So I fired up the debugger in Eclipse to try and track down what was happening. When I looked at the data, that one character was changed. But it really wasn't, it was perfect, the debugger was showing me an inaccurate view of the world.
It took a little while, but I finally figured out that my code and libraries were working perfectly and this ended up being a font problem. I'm not sure what the deal is with this 4-byte character, but none of the "typical" CJK fonts seems to have it. I eventually tracked down a font that contained the character, and now everything works fine.