I have a string, which is returned by the Jericho HTML parser and contains some Russian text. According to source.getEncoding()
and the header of the respective HTML file, the encoding is Windows-1251.
How can I convert this string to something readable?
I tried this:
import java.io.UnsupportedEncodingException;
public class Program {
public void run() throws UnsupportedEncodingException {
final String windows1251String = getWindows1251String();
System.out.println("String (Windows-1251): " + windows1251String);
final String readableString = convertString(windows1251String);
System.out.println("String (converted): " + readableString);
}
private String convertString(String windows1251String) throws UnsupportedEncodingException {
return new String(windows1251String.getBytes(), "UTF-8");
}
private String getWindows1251String() {
final byte[] bytes = new byte[] {32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32};
return new String(bytes);
}
public static void main(final String[] args) throws UnsupportedEncodingException {
final Program program = new Program();
program.run();
}
}
The variable bytes
contains the data shown in my debugger, it's the result of net.htmlparser.jericho.Element.getContent().toString().getBytes()
. I just copy and pasted that array here.
This doesn't work - readableString
contains garbage.
How can I fix it, i. e. make sure that the Windows-1251 string is decoded properly?
Update 1 (30.07.2015 12:45 MSK): When change the encoding in the call in convertString
to Windows-1251
, nothing changes. See the screenshot below.
Update 2: Another attempt:
Update 3 (30.07.2015 14:38): The texts that I need to decode correspond to the texts in the drop-down list shown below.
Update 4 (30.07.2015 14:41): The encoding detector (code see below) says that the encoding is not Windows-1251
, but UTF-8
.
public static String guessEncoding(byte[] bytes) {
String DEFAULT_ENCODING = "UTF-8";
org.mozilla.universalchardet.UniversalDetector detector =
new org.mozilla.universalchardet.UniversalDetector(null);
detector.handleData(bytes, 0, bytes.length);
detector.dataEnd();
String encoding = detector.getDetectedCharset();
System.out.println("Detected encoding: " + encoding);
detector.reset();
if (encoding == null) {
encoding = DEFAULT_ENCODING;
}
return encoding;
}
I fixed this problem by modifying the piece of code, which read the text from the web site.
private String readContent(final String urlAsString) {
final StringBuilder content = new StringBuilder();
BufferedReader reader = null;
InputStream inputStream = null;
try {
final URL url = new URL(urlAsString);
inputStream = url.openStream();
reader =
new BufferedReader(new InputStreamReader(inputStream);
String inputLine;
while ((inputLine = reader.readLine()) != null) {
content.append(inputLine);
}
} catch (final IOException exception) {
exception.printStackTrace();
} finally {
IOUtils.closeQuietly(reader);
IOUtils.closeQuietly(inputStream);
}
return content.toString();
}
I changed the line
new BufferedReader(new InputStreamReader(inputStream);
to
new BufferedReader(new InputStreamReader(inputStream, "Windows-1251"));
and then it worked.