javaencodingjericho-html-parser

How do I convert a Windows-1251 text to something readable?


I have a string, which is returned by the Jericho HTML parser and contains some Russian text. According to source.getEncoding() and the header of the respective HTML file, the encoding is Windows-1251.

How can I convert this string to something readable?

I tried this:

import java.io.UnsupportedEncodingException;

public class Program {
    public void run() throws UnsupportedEncodingException {
        final String windows1251String = getWindows1251String();
        System.out.println("String (Windows-1251): " + windows1251String);
        final String readableString = convertString(windows1251String);
        System.out.println("String (converted): " + readableString);
    }
    private String convertString(String windows1251String) throws UnsupportedEncodingException {
        return new String(windows1251String.getBytes(), "UTF-8");
    }
    private String getWindows1251String() {
        final byte[] bytes = new byte[] {32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
        return new String(bytes);
    }
    public static void main(final String[] args) throws UnsupportedEncodingException {
        final Program program = new Program();
        program.run();
    }
}

The variable bytes contains the data shown in my debugger, it's the result of net.htmlparser.jericho.Element.getContent().toString().getBytes(). I just copy and pasted that array here.

This doesn't work - readableString contains garbage.

How can I fix it, i. e. make sure that the Windows-1251 string is decoded properly?

Update 1 (30.07.2015 12:45 MSK): When change the encoding in the call in convertString to Windows-1251, nothing changes. See the screenshot below.

Screenshot

Update 2: Another attempt:

Second screenshot

Update 3 (30.07.2015 14:38): The texts that I need to decode correspond to the texts in the drop-down list shown below.

Expected result

Update 4 (30.07.2015 14:41): The encoding detector (code see below) says that the encoding is not Windows-1251, but UTF-8.

public static String guessEncoding(byte[] bytes) {
    String DEFAULT_ENCODING = "UTF-8";
    org.mozilla.universalchardet.UniversalDetector detector =
        new org.mozilla.universalchardet.UniversalDetector(null);
    detector.handleData(bytes, 0, bytes.length);
    detector.dataEnd();
    String encoding = detector.getDetectedCharset();
    System.out.println("Detected encoding: " + encoding);
    detector.reset();
    if (encoding == null) {
        encoding = DEFAULT_ENCODING;
    }
    return encoding;
}

Solution

  • I fixed this problem by modifying the piece of code, which read the text from the web site.

    private String readContent(final String urlAsString) {
        final StringBuilder content = new StringBuilder();
        BufferedReader reader = null;
        InputStream inputStream = null;
        try {
            final URL url = new URL(urlAsString);
            inputStream = url.openStream();
            reader =
                new BufferedReader(new InputStreamReader(inputStream);
    
            String inputLine;
            while ((inputLine = reader.readLine()) != null) {
                content.append(inputLine);
            }
        } catch (final IOException exception) {
            exception.printStackTrace();
        } finally {
            IOUtils.closeQuietly(reader);
            IOUtils.closeQuietly(inputStream);
        }
        return content.toString();
    }
    

    I changed the line

    new BufferedReader(new InputStreamReader(inputStream);
    

    to

    new BufferedReader(new InputStreamReader(inputStream, "Windows-1251"));
    

    and then it worked.