I recognized that int(unicode_string)
sometimes gives obscure results.
E.g. int('᪐᭒') == 2
.
>>> bytes('᪐᭒', 'utf-8')
b'\xe1\xaa\x90\xe1\xad\x92'
>>> [f'U+{ord(c):04X}' for c in '᪐᭒']
['U+1A90', 'U+1B52']
My expectation would be it fails, because the string does not contain a number.
Is there some explanation for this behaviour?
...the string does not contain a number.
The two characters you show are numbers; they're both in the decimal number class, and per Python's documentation:
The values 0–9 can be represented by any Unicode decimal digit.
Specifically:
᪐
is Tai Tham Tham Digit Zero᭒
is Balinese Digit TwoSo int('᪐᭒')
is effectively int('02')
, which is indeed 2
.
Does that also mean that a simple
int(...)
[scans] all possible number representations of all languages?
In CPython, when you create PyLong_FromUnicodeObject
, it first uses _PyUnicode_TransformDecimalAndSpaceToASCII
which:
Converts a Unicode object holding a decimal value to an ASCII string for using in int, float and complex parsers.
Transforms code points that have decimal digit property to the corresponding ASCII digit code points.
Transforms spaces to ASCII.
Transforms code points starting from the first non-ASCII code point that is neither a decimal digit nor a space to the end into '?'.
So you're getting something a little like:
>>> import unicodedata
>>> [unicodedata.digit(c, "?") for c in "᪐᭒"]
[0, 2]
prior to being parsed in the specified base. So it's not so much "scans all possible number representations" as looks up each character in the input to see if it's considered a digit - if so, part of the information in the Unicode properties is which digit that character represents.