python-3.xunicodepython-unicodeunicode-literals

Python3 - Convert unicode literals string to unicode string


From command line parameters (sys.argv) I receive string of unicode literals like this: '\u041f\u0440\u0438\u0432\u0435\u0442\u0021'

For example this script uni.py:

import sys
print(sys.argv[1])

command line:

python uni.py \u041f\u0440\u0438\u0432\u0435\u0442\u0021

output:

\u041f\u0440\u0438\u0432\u0435\u0442\u0021

I want to convert it to unicode string 'Привет!'


Solution

  • You don't have to convert it the Unicode, because it already is Unicode. In Python 3.x, strings are Unicode by default. You only have to convert them (to or from bytes) when you want to read or write bytes, for example, when writing to a file.

    If you just print the string, you'll get the correct result, assuming your terminal supports the characters.

    print('\u041f\u0440\u0438\u0432\u0435\u0442\u0021')
    

    This will print:

    Привет!

    UPDATE

    After updating your question it became clear to me that the mentioned string is not really a string literal (or unicode literal), but input from the command line. In that case you could use the "unicode-escape" encoding to get the result you want. Note that encoding works from Unicode to bytes, and decoding works from bytes to Unicode. In this case you want a transformation from Unicode to Unicode, so you have to add a "dummy" decoding step using latin-1 encoding, which transparently converts Unicode codepoints to bytes.

    The following code will print the correct result for your example:

    text = sys.argv[1].encode('latin-1').decode('unicode-escape')
    print(text)
    

    UPDATE 2

    Alternatively, you could use ast.literal_eval() to parse the string from the input. However, this method expects a proper Python literal, including the quotes. You could do something like to solve this:

    text = ast.literal_eval("'" + sys.argv[1] + "'")
    

    But note that this would break if you would have a quote as part of your input string. I think it's a bit of a hack, since the method is probably not intended for the purpose you use it. The unicode-escape is simpler and robuster. However, what the best solution is depends on what you're building.