I am downloading online text, that can be uploaded by users, so texts can be UTF-8, ISO-8859-1, etc...
The problem is that I don't know which encoding are using the users, and if the user has uploaded a UTF-8 text it works perfect but if the user has uploaded a ISO-8859-1 text with accents (á é etc..) these characters are not shown correctly.
I tried to force text encoding to UTF-8 but it not works for all the cases (buffer.toString("UTF-8"))
This is my code:
javaUrl = new URL(URLParser.parse(textResource.getUrlStr()));
connection = javaUrl.openConnection();
connection.setConnectTimeout(2000);
connection.setReadTimeout(2000);
InputStream input = new BufferedInputStream(connection.getInputStream());
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
int nRead;
try{
byte [] data = new byte [1024];
while ((nRead = input.read(data, 0, data.length)) != -1) {
buffer.write(data, 0, nRead);
}
buffer.flush();
total = buffer.toString();
}finally{
input.close();
buffer.close();
}
Since you have multiple possible encodings and you don't know which is correct you have little choice but to use a CharsetDecoder
here.
The plan:
InputStream
from the connection;byte[]
array;Here is one possible method to find the correct encoding:
public boolean isCharset(final Charset charset, final byte[] contents)
throws IOException
{
final CharsetDecoder decoder = charset.newDecoder()
.onMalformedInput(CodingErrorAction.REPORT);
final ByteBuffer buf = ByteBuffer.wrap(contents);
try {
decoder.decode(buf);
return true;
} catch (CharacterCodingException ignored) {
return false;
}
}
Try this with a different set of encodings (preferrably starting with UTF-8).