I’m trying to read an utf-16 text from a GSM module (Sim800 L). It gives me :
0633064406270645 06280647 0647064506af06cc
Rather than :
\u0633\u0644\u0627\u0645 \u0628\u0647 \u0647\u0645\u06af\u06cc
I tried many ways to add ‘\u’ to first string or even convert it to bytes , but every time python recognizes them real ascii characters.
For example:
> Str=r’\u’ + Str
Result: \\u633064406270645 06280647 0647064506af06cc
And because of double backslash python doesn’t recognize it as utf-16
I am looking for any method to convert the output of GSM module to Unicode.
Using a combination of modules
re
(replace any hexadecimal quadruplet with a suitable character), andjson
(handle surrogate pairs correctly).Note: added a valid surrogate pair (D83DDE0E
) as well as a noncharacter (FFFE
) to the hard-coded string, merely for debugging purposes:
import re
import json
def repl_unicode( matchobj):
mo_int = int( matchobj.group(0), 16)
return chr( mo_int)
text_16 = '0633064406270645 06280647 0647064506af06cc D83DDE0E FFFE'
pattern = '[0-9A-Za-z]{4}'
text_u8 = json.loads( json.dumps( re.sub( pattern, repl_unicode, text_16)))
print( text_16)
print( text_u8)
print( json.dumps( text_u8, ensure_ascii=True).strip('"'))
Output: .\SO\78628940.py
0633064406270645 06280647 0647064506af06cc D83DDE0E FFFE سلام به همگی 😎 \u0633\u0644\u0627\u0645 \u0628\u0647 \u0647\u0645\u06af\u06cc \ud83d\ude0e \ufffe