pythonarraysnumpy

Why does NumPy assign different string dtypes when mixing types in np.array()?


I'm trying to understand how NumPy determines the dtype when creating an array with mixed types. I noticed that the inferred dtype for strings can vary significantly depending on the order and type of elements in the list.

print(np.array([1.0, True, 'is']))     
# Output: array(['1.0', 'True', 'is'], dtype='<U32')

print(np.array(['1.0', True, 'is']))    
# Output: array(['1.0', 'True', 'is'], dtype='<U5')

print(np.array(['1.0', 'True', 'is']))  
# Output: array(['1.0', 'True', 'is'], dtype='<U4')

I understand that NumPy upcasts everything to a common type, usually the most general one, and that strings tend to dominate. But why does the resulting dtype (<U32, <U5, <U4) differ so much when the content looks almost the same?

Specifically:


Solution

  • Looking at your array contents and the NumPy type promotion rules, I guess the following applies:

    For some purposes NumPy will promote almost any other datatype to strings. This applies to array creation or concatenation.

    This leaves us with the question about the string lengths. For the complete array, NumPy needs to choose a string length so that all values can be represented without loss of information. In your examples, the contents have the following data types:

    Now to your examples:

    In all of this, the order of elements does not seem to play a role, by the way.