javaandroidencodingutf-8iso-8859-1

Downloading online text with different encodings


I am downloading online text, that can be uploaded by users, so texts can be UTF-8, ISO-8859-1, etc...

The problem is that I don't know which encoding are using the users, and if the user has uploaded a UTF-8 text it works perfect but if the user has uploaded a ISO-8859-1 text with accents (á é etc..) these characters are not shown correctly.

I tried to force text encoding to UTF-8 but it not works for all the cases (buffer.toString("UTF-8"))

This is my code:

javaUrl = new URL(URLParser.parse(textResource.getUrlStr()));
                    connection = javaUrl.openConnection();                      
                    connection.setConnectTimeout(2000);
                    connection.setReadTimeout(2000);
                    InputStream input = new BufferedInputStream(connection.getInputStream());
                    ByteArrayOutputStream buffer = new ByteArrayOutputStream();
                    int nRead;
                    try{        
                        byte [] data = new byte [1024];
                        while ((nRead = input.read(data, 0, data.length)) != -1) {
                            buffer.write(data, 0, nRead);
                        }
                        buffer.flush();
                        total = buffer.toString();                  
                    }finally{
                        input.close();
                        buffer.close();
                    }

Solution

  • Since you have multiple possible encodings and you don't know which is correct you have little choice but to use a CharsetDecoder here.

    The plan:

    Here is one possible method to find the correct encoding:

    public boolean isCharset(final Charset charset, final byte[] contents)
        throws IOException
    {
        final CharsetDecoder decoder = charset.newDecoder()
            .onMalformedInput(CodingErrorAction.REPORT);
        final ByteBuffer buf = ByteBuffer.wrap(contents);
    
        try {
            decoder.decode(buf);
            return true;
        } catch (CharacterCodingException ignored) {
            return false;
        }
    }
    

    Try this with a different set of encodings (preferrably starting with UTF-8).