EDIT:
The following print shows my intended value.
(both sys.stdout.encoding and sys.stdin.encoding are 'UTF-8').
Why is the variable value different than its print value? I need to get the raw value into a variable.
>>username = 'Jo\xc3\xa3o'
>>username.decode('utf-8').encode('latin-1')
'Jo\xe3o'
>>print username.decode('utf-8').encode('latin-1')
João
Original question:
I'm having an issue querying a BD and decoding the values into Python.
I confirmed my DB NLS_LANG using
select property_value from database_properties where property_name='NLS_CHARACTERSET';
'''AL32UTF8 stores characters beyond U+FFFF as four bytes (exactly as Unicode defines
UTF-8). Oracle’s “UTF8” stores these characters as a sequence of two UTF-16 surrogate
characters encoded using UTF-8 (or six bytes per character)'''
os.environ["NLS_LANG"] = ".AL32UTF8"
....
conn_data = str('%s/%s@%s') % (db_usr, db_pwd, db_sid)
sql = "select user_name apex.users where user_id = '%s'" % userid
...
cursor.execute(sql)
ldap_username = cursor.fetchone()
...
where
print ldap_username
>>'Jo\xc3\xa3o'
I've both tried (which return the same)
ldap_username.decode('utf-8')
>>u'Jo\xe3o'
unicode(ldap_username, 'utf-8')
>>u'Jo\xe3o'
where
u'João'.encode('utf-8')
>>'Jo\xc3\xa3o'
how to get the queries result back to the proper 'João' ?
You already have the proper 'João', methinks. The difference between >>> 'Jo\xc3\xa3o'
and >>> print 'Jo\xc3\xa3o'
is that the former calls repr
on the object, while the latter calls str
(or probably unicode
, in your case). It's just how the string is represented.
Some examples might make this more clear:
>>> print 'Jo\xc3\xa3o'.decode('utf-8')
João
>>> 'Jo\xc3\xa3o'.decode('utf-8')
u'Jo\xe3o'
>>> print repr('Jo\xc3\xa3o'.decode('utf-8'))
u'Jo\xe3o'
Notice how the second and third result are identical. The original ldap_username
currently is an ASCII string. You can see this on the Python prompt: when it is displaying an ACSII object, it shows as 'ASCII string'
, while Unicode objects are shown as u'Unicode string'
-- the key being the leading u
.
So, as your ldap_username
reads as 'Jo\xc3\xa3o'
, and is an ASCII string, the following applies:
>>> 'Jo\xc3\xa3o'.decode('utf-8')
u'Jo\xe3o'
>>> print 'Jo\xc3\xa3o'.decode('utf-8') # To Unicode...
João
>>> u'João'.encode('utf-8') # ... back to ASCII
'Jo\xc3\xa3o'
Summed up: you need to determine the type of the string (use type
when unsure), and based on that, decode to Unicode, or encode to ASCII.