python unicode character-encoding python-unicode ncr

Converting Hex NCR text representations to Unicode in Python

I have a string parsed from a web page originally in chinese as:

若き葉末には風が立ち 森は翡翠の息を返す 雲の切れ間から注ぐ 光に君を見初めん

碧き瞳のほほえむとき そは鐘のひびき胸に打つ さしのべた腕に絡む 蔦の葉に君を逃す

残る　香り 水面をかけゆく恋の舟 つかの間の波に　揺られ

やさしき幻影　心に映るその姿よ 永遠なる君に　想いを捧げん

若き葉末には風は眠り 森は密やかに息を止む 抱きしめた腕のなかで 静かに君は消えゆく

月は　満ちて 黄金の羽根が舞いおちる 我はただひとり森に

祈りたまえや

However in the process of parsing it, it was converted into Hex NCR string in the following form:

&#x82E5;&#x304D;&#x8449;&#x672B;&#x306B;&#x306F;&#x98A8;&#x304C;&#x7ACB;&#x3061;\n&#x68EE;&#x306F;&#x7FE1;&#x7FE0;&#x306E;&#x606F;&#x3092;&#x8FD4;&#x3059;\n&#x96F2;&#x306E;&#x5207;&#x308C;&#x9593;&#x304B;&#x3089;&#x6CE8;&#x3050;\n&#x5149;&#x306B;&#x541B;&#x3092;&#x898B;&#x521D;&#x3081;&#x3093;\n\n&#x78A7;&#x304D;&#x77B3;&#x306E;&#x307B;&#x307B;&#x3048;&#x3080;&#x3068;&#x304D;\n&#x305D;&#x306F;&#x9418;&#x306E;&#x3072;&#x3073;&#x304D;&#x80F8;&#x306B;&#x6253;&#x3064;\n&#x3055;&#x3057;&#x306E;&#x3079;&#x305F;&#x8155;&#x306B;&#x7D61;&#x3080;\n&#x8526;&#x306E;&#x8449;&#x306B;&#x541B;&#x3092;&#x9003;&#x3059;\n\n&#x6B8B;&#x308B;&#x3000;&#x9999;&#x308A;\n&#x6C34;&#x9762;&#x3092;&#x304B;&#x3051;&#x3086;&#x304F;&#x604B;&#x306E;&#x821F;\n&#x3064;&#x304B;&#x306E;&#x9593;&#x306E;&#x6CE2;&#x306B;&#x3000;&#x63FA;&#x3089;&#x308C;\n\n&#x3084;&#x3055;&#x3057;&#x304D;&#x5E7B;&#x5F71;&#x3000;&#x5FC3;&#x306B;&#x6620;&#x308B;&#x305D;&#x306E;&#x59FF;&#x3088;\n&#x6C38;&#x9060;&#x306A;&#x308B;&#x541B;&#x306B;&#x3000;&#x60F3;&#x3044;&#x3092;&#x6367;&#x3052;&#x3093;\n\n&#x82E5;&#x304D;&#x8449;&#x672B;&#x306B;&#x306F;&#x98A8;&#x306F;&#x7720;&#x308A;\n&#x68EE;&#x306F;&#x5BC6;&#x3084;&#x304B;&#x306B;&#x606F;&#x3092;&#x6B62;&#x3080;\n&#x62B1;&#x304D;&#x3057;&#x3081;&#x305F;&#x8155;&#x306E;&#x306A;&#x304B;&#x3067;\n&#x9759;&#x304B;&#x306B;&#x541B;&#x306F;&#x6D88;&#x3048;&#x3086;&#x304F;\n\n&#x6708;&#x306F;&#x3000;&#x6E80;&#x3061;&#x3066;\n&#x9EC4;&#x91D1;&#x306E;&#x7FBD;&#x6839;&#x304C;&#x821E;&#x3044;&#x304A;&#x3061;&#x308B;\n&#x6211;&#x306F;&#x305F;&#x3060;&#x3072;&#x3068;&#x308A;&#x68EE;&#x306B;\n\n&#x7948;&#x308A;&#x305F;&#x307E;&#x3048;&#x3084;

I want to convert this string into its appropriate unicode format.

From my research I have been able to gather that for example 一 maps to the unicode string b'\\u4e00'.

This can be manually done by stripping &#x and prefixing a \\u to the beginning of the string along with making the whole thing lowercase and converting to a bytestring by adding a b before the string. This is done in this repo but through the use of the inefficient eval function through code such as eval("b'\\u4e00").

[EDIT: The above para is incorrect. It is not a bytestring but a unicode string as present in python2. The correct mapping would be 一 -> u'\u4e00']

Is there a better way to do this? Considering edge cases where these hex map strings can be present in the middle of regular text such as here:

Je me levais t&#xF4;t
Travailler en homme
Je me souviens du go&#xFB;t
Du caf&#xE9; br&#xFB;lant
Dans la tasse rouge
Et la femme qui dort
Les portes ouvertes de la grande usine
Bouffaient nos fils le jour de leurs quinze ans
On se levait t&#xF4;t
Sortis de nos draps
On se retrouvait en bas
Les rues du village s'allumaient d'un coup
A six heures moins le quart
Les portes ouvertes de la grande usine
Bouffaient nos fils bien avant leurs quinze ans
On se l&#xE8;ve trop t&#xF4;t
On sait plus quoi faire
Dans le caf&#xE9; des vieux
Les mains dans nos poches
Cachent nos poings noirs
Y'a plus qu'&#xE0; qui change pas
Les portes sont ferm&#xE9;es
Y'a plus de feu qui gronde
L'usine a tout vomi d'un seul coup
Pourquoi on fait &#xE7;a
Pourquoi &#xE7;a m'fait &#xE7;a
Pourquoi on nous fait &#xE7;a &#xE0; nous

I am dealing with a large set of data where such characters can be strewn anywhere, and I need a meaningful way to deal with them.

So is there any better way to do this? Ideally one that is supported inherently by python.

If someone has a solution to my problem here, I will be immensely grateful. Thanks in advance.

Solution

Have a look at the html module in the standard library:

>>> import html
>>> html.unescape('Je me levais t&#xF4;t')
'Je me levais tôt'
>>> html.unescape('&#x82E5;&#x304D;&#x8449;&#x672B;&#x306B;&#x306F;')
'若き葉末には'

The result is a Unicode string (type str in Python 3). Note that the b'...' notation is for byte strings. The literal b'\\u4e00' in your example does not make much sense, since it is a byte string with 6 characters (\, u, 4, e, 0, 0). You probably meant '\u4e00' (or u'\u4e00' in Python 2), which is a single-character Unicode string.