I'm trying to understand how NumPy determines the dtype
when creating an array with mixed types. I noticed that the inferred dtype
for strings can vary significantly depending on the order and type of elements in the list.
print(np.array([1.0, True, 'is']))
# Output: array(['1.0', 'True', 'is'], dtype='<U32')
print(np.array(['1.0', True, 'is']))
# Output: array(['1.0', 'True', 'is'], dtype='<U5')
print(np.array(['1.0', 'True', 'is']))
# Output: array(['1.0', 'True', 'is'], dtype='<U4')
I understand that NumPy upcasts everything to a common type, usually the most general one, and that strings tend to dominate. But why does the resulting dtype
(<U32
, <U5
, <U4
) differ so much when the content looks almost the same?
Specifically:
np.array([1.0, True, 'is'])
result in <U32
?dtype
(e.g., <U4
vs <U5
)?dtype
and string length in such cases?Looking at your array contents and the NumPy type promotion rules, I guess the following applies:
For some purposes NumPy will promote almost any other datatype to strings. This applies to array creation or concatenation.
This leaves us with the question about the string lengths. For the complete array, NumPy needs to choose a string length so that all values can be represented without loss of information. In your examples, the contents have the following data types:
'is'
, 'True'
, '1.0'
): for these, NumPy just needs to reserve their actual length (thus, if there are multiple strings in the same array, the maximum length of all occurring strings).True
): for converting them to a string, NumPy reserves a string length of 5, since all possible converted values are 'True'
(length 4) and 'False'
(length 5). We can easily verify this:
np.array(True).astype(str) # >>> array('True', dtype='<U5')
1.0
): for converting them to a string, NumPy reserves a string length of 32. I assume this is for round-trip safety (i.e. to get the exact same value when converting the string representation back to float). I would have expected that a shorter length (somewhere between 20 and 30) should be enough, but maybe 32, a power of 2, has been chosen for better memory alignment properties. In any case, again, we can verify this:
np.array(1.).astype(str) # >>> array('1.0', dtype='<U32')
Now to your examples:
np.array([1.0, True, 'is'])
: we have a float 1.0
(→ string length 32), a boolean True
(→ string length 5), and a string 'is'
of length 2: The maximum length to represent all values is 32.np.array(['1.0', True, 'is'])
: we have a string '1.0'
of length 3, a boolean True
(→ string length 5), and a string 'is'
of length 2: The maximum length to represent all values is 5.np.array(['1.0', 'True', 'is'])
: we have a string '1.0'
of length 3, a string 'True'
of length 4, and a string 'is'
of length 2: The maximum length to represent all values is 4.In all of this, the order of elements does not seem to play a role, by the way.