I am fetching a JSON string as a response and converting it to a JSON object.
In the image above the description String can be seen to have a weird ?
character surrounded by a color. I checked in the debugger the issue is after converting a JSON string to a JsonObject. So there is a code (mm is the JSON string):
JsonObject con=getCon(mm)
private JsonObject getCon(String mm) {
var file=new String(mm.getBytes(),StandardCharsets.UTF_8);
return new GsonBuilder().create().fromJson(file,JsonObject.class).getAsJsonObject("dict").getAsJsonObject("con");
}
I converted the first line to var file=new String(mm.getBytes("UTF-8"),StandardCharsets.UTF_8);
After this, the description String becomes like the last line in the attached image. This is really confusing. Not sure what could be going wrong here. The actual String in JSON is like Post Approval - Completed, Post Approval - Pending
There are a lot of description attributes in the JSON string and this is happening only for a few of them. How can I debug this further?
Gson works only based on char
s, for example in the form of a String
or from a Reader
. So any encoding issues you encounter most likely happen before Gson is called.
The reason why new String(mm.getBytes(),StandardCharsets.UTF_8);
is causing encoding issues is that String.getBytes()
uses the platform default charset of your OS, which most likely is not UTF-8, and might not even support all Unicode characters. So converting the bytes then again to UTF-8 will produce incorrect results. There is normally never a good reason to use String.getBytes()
(without Charset
parameter); code analysis tools also often flag this as warning. Maybe the Policeman's Forbidden API Checker could be useful for you, it detects usage of error-prone methods like this.
Your adjusted code new String(mm.getBytes("UTF-8"),StandardCharsets.UTF_8)
is effectively a no-op; you are first converting a String
to byte[]
using UTF-8 and then reverse this again. (The only effect this might have is that incomplete surrogate pairs are replaced.)
To debug this further you would have to check where the value of mm
is coming from and at which point (if any) it still has the correct value. If you are reading it from a file, make sure you specify the correct encoding. Possibly it is not using UTF-8; editors such as VS Code and Notepad++ can automatically detect the encoding and show it.
If the value comes from an HTTP response, verify that you are respecting the charset specified by the server in the Content-Type
header. While the latest JSON specification says UTF-8 must be used, maybe the server is specifying a different encoding for whatever reason.