androidunicodeutf-8

how to decode russian language


I have trying to load several sites with different languages content. And only russian content I have seen as <?> elements. Please help me to decode it to right symbols. My code samples:

RequestTask t = new RequestTask();
response = t.doIt("http://google.ru"); //troubles 
//response = t.doIt("http://stackoverflow.com"); //ok
//response = t.doIt("http://web.de/"); //ok
//response = t.doIt("http://www.china.com/"); // omg, it's ok too!

StatusLine statusLine = response.getStatusLine();

if(statusLine.getStatusCode() == HttpStatus.SC_OK){
    ByteArrayOutputStream out = new ByteArrayOutputStream();                    
    response.getEntity().writeTo(out);
    out.close();
    String response_string = new String(out.toByteArray(), "UTF-8"); 

Request code:

public class RequestTask {
    public HttpResponse doIt(String... uri) 
    throws ConnectTimeoutException, UnknownHostException, IOException{
        HttpParams params = new BasicHttpParams();
        HttpConnectionParams.setConnectionTimeout(params, 6000);
        HttpConnectionParams.setSoTimeout(params, 6000);
        HttpClient httpclient = new DefaultHttpClient(params);
        HttpResponse response = null;
        Log.d(this.toString(), "HTTP GET to " + uri[0]);
        response = httpclient.execute(new HttpGet(uri[0]));
        Log.d(this.toString(), "response: " + response.getStatusLine().getReasonPhrase());

        return response;
    }
}

Solution

  • I don't see any troubles with google.ru:

    $ wget google.ru
    [...skipped....]
    $ enca -L ru index.html 
    MS-Windows code page 1251
      LF line terminators
    

    you should always remember, there are at least 3 other more or less used encodings, that can be found on the pages with Russian content. Besides "UTF-8", I would most definitely check for "KOI-8R", "WIN-1251" and (not very popular) "Mac Cyrillic".

    You might be better off using something like this:

    encoding = ( "win-1251", "koi8-r" )  # maybe some others...
    
    for enc in encoding:
        try:
            result = unicode( data, enc )
            break
        except:
            result = ""
            continue
    
    if result:
        print name + "\t: " + enc
    else:
        print name + "\t: unable to determine the encoding"