htmlunicodeutf-8decodefont-face

How can retrieve / decode html utf-8 character with unicode?


When I try to visit any website which is integrated with unicode हिंदी text then browser display that contain like... ¤ªà¤•à¥�षी à¤•े à¤ªà¤¾à¤¸ à¤µà¥‹à¤¸à¤¾à¤°à¥€ à¤¸à¥�ख à¤¸à¥�विधाà¤�à¤� हैं, à¤œà¥‹ à¤‰à¤¨à¤•े à¤œà

How to decode this character and convert it into pure unicode?


Solution

  • This is UTF-8 encoded Devanagari wrongly displayed as Windows-1252. If you reverse the direction, e.g.

    piconv -f utf-8 -t windows-1252 -s '¤ªà¤•à¥�षी के पास वोसारी सà¥�ख सà¥�विधाà¤�à¤� हैं, जो उनके जà'
    

    then you get parts of the original text back:

    ��क��?षी के पास वोसारी स��?ख स��?विधा��?��? हैं, जो उनके ज�
    

    Your copy-paste operation made decoding here lossy. Redirect input into a file instead of copy-paste so that you do not introduce any defects.

    piconv ships with Perl.