python arrays python-3.x numpy structured-array

Is there a way to declare a structured array that has a string field of arbitrary lengh?

This has bugged me for some time now, but I haven't really found a satisfying solution.

If you declare a structured array with a field that contains strings, how can you set the dtype of that field to something so that you don't have to worry about the length of the strings in that field?

With floats and ints it is so much easier. So far, I've always used 'i4' or 'f4' as the respective dtypes and never had any issues (although I am not sure if this is bad practice, feel free to point it out). And in the unlikely case that numbers are actually too long for these dtypes, Python tells me so by raising an OverflowError. But if a string is too long, it is just silently cut off.

Is there any way to declare the string dtype so that you don't have to know exactly how long your strings are (going to be) that you want to store in the structured array prior to creating it? I mean you could always guesstimate and assume that, say, 'U30' is probably going to be enough and hope for the best, but I don't really like that. So far, my workaround has always been to use the object dtype 'O' because it just takes whatever, but I never really liked that either.

I think in the case of ints or floats, you could use ìnt and float as dtypes just as well, without having to worry about the number of bits necessary to store the data. Why is it not implemented in the same way for strings when using str as the dtype? I followed this chain of posts, and in the github issue, it is explained that the str dtype defaults to an empty string if I am not mistaken.

According to the numpy documentation on data type objects:

To use actual strings in Python 3 use U or np.unicode_.

So I thought I give a couple of things a try in the example below, but (as expected) none of them work.

import numpy as np


array = np.array(
    [
        ('Apple', 'green', 'round', 'fresh', 'good', 10e4, np.pi)], dtype=[
        ('fruit', np.str_), ('color', np.unicode_), ('shape', np.dtype(str)),
        ('state', str), ('taste', 'U2'), ('weight', 'i4'), ('radius', float)
    ]
)

# this causes OverflowError: Python int too large to convert to C long
# array[0]['weight'] = 10e10

# this is just 'ignored'
array[0]['color'] = 'red'

print(array)

Solution

All the variants that you tried do the same thing, define a 'U0'. This isn't just a structured array issue.

dtype=[('fruit', '<U'), ('color', '<U'), ('shape', '<U'), ('state', '<U'), ('taste', '<U2'), ('weight', '<i4'), ('radius', '<f8')])

Either specify a longer length like 'U10' or 'O', object:

In [239]: arr = np.array( 
     ...:     [ 
     ...:         ('Apple', 'green', 'round', 'fresh', 'good', 10e4, np.pi)], dtype=[ 
     ...:         ('fruit', 'U10'), ('color', 'O'), ('shape', 'O'), 
     ...:         ('state', 'S10'), ('taste', 'U2'), ('weight', 'i4'), ('radius', float) 
     ...:     ] 
     ...: )                                                                                            
In [240]: arr                                                                                          
Out[240]: 
array([('Apple', 'green', 'round', b'fresh', 'go', 100000, 3.14159265)],
      dtype=[('fruit', '<U10'), ('color', 'O'), ('shape', 'O'), ('state', 'S10'), ('taste', '<U2'), ('weight', '<i4'), ('radius', '<f8')])
In [241]: arr['color']                                                                                 
Out[241]: array(['green'], dtype=object)
In [242]: arr['color']='yellow_green'                                                                  
In [243]: arr['fruit']                                                                                 
Out[243]: array(['Apple'], dtype='<U10')
In [244]: arr['fruit']='pineapple'                                                                     
In [245]: arr                                                                                          
Out[245]: 
array([('pineapple', 'yellow_green', 'round', b'fresh', 'go', 100000, 3.14159265)],
      dtype=[('fruit', '<U10'), ('color', 'O'), ('shape', 'O'), ('state', 'S10'), ('taste', '<U2'), ('weight', '<i4'), ('radius', '<f8')])

pandas opts for using object dtype for all of its strings. The numpy fixed string length is ok when the strings tend to be all the same size and know ahead of time, e.g. np.array(['one','two','three', 'four', 'five'])