This has bugged me for some time now, but I haven't really found a satisfying solution.
If you declare a structured array with a field that contains strings, how can you set the dtype
of that field to something so that you don't have to worry about the length of the strings in that field?
With floats
and ints
it is so much easier. So far, I've always used 'i4'
or 'f4'
as the respective dtypes and never had any issues (although I am not sure if this is bad practice, feel free to point it out). And in the unlikely case that numbers are actually too long for these dtypes, Python tells me so by raising an OverflowError
. But if a string is too long, it is just silently cut off.
Is there any way to declare the string dtype so that you don't have to know exactly how long your strings are (going to be) that you want to store in the structured array prior to creating it? I mean you could always guesstimate and assume that, say, 'U30'
is probably going to be enough and hope for the best, but I don't really like that. So far, my workaround has always been to use the object dtype 'O'
because it just takes whatever, but I never really liked that either.
I think in the case of ints
or floats
, you could use ìnt
and float
as dtypes just as well, without having to worry about the number of bits necessary to store the data. Why is it not implemented in the same way for strings when using str
as the dtype? I followed this chain of posts, and in the github issue, it is explained that the str
dtype defaults to an empty string if I am not mistaken.
According to the numpy documentation on data type objects:
To use actual strings in Python 3 use
U
ornp.unicode_
.
So I thought I give a couple of things a try in the example below, but (as expected) none of them work.
import numpy as np
array = np.array(
[
('Apple', 'green', 'round', 'fresh', 'good', 10e4, np.pi)], dtype=[
('fruit', np.str_), ('color', np.unicode_), ('shape', np.dtype(str)),
('state', str), ('taste', 'U2'), ('weight', 'i4'), ('radius', float)
]
)
# this causes OverflowError: Python int too large to convert to C long
# array[0]['weight'] = 10e10
# this is just 'ignored'
array[0]['color'] = 'red'
print(array)
All the variants that you tried do the same thing, define a 'U0'. This isn't just a structured array issue.
dtype=[('fruit', '<U'), ('color', '<U'), ('shape', '<U'), ('state', '<U'), ('taste', '<U2'), ('weight', '<i4'), ('radius', '<f8')])
Either specify a longer length like 'U10' or 'O', object:
In [239]: arr = np.array(
...: [
...: ('Apple', 'green', 'round', 'fresh', 'good', 10e4, np.pi)], dtype=[
...: ('fruit', 'U10'), ('color', 'O'), ('shape', 'O'),
...: ('state', 'S10'), ('taste', 'U2'), ('weight', 'i4'), ('radius', float)
...: ]
...: )
In [240]: arr
Out[240]:
array([('Apple', 'green', 'round', b'fresh', 'go', 100000, 3.14159265)],
dtype=[('fruit', '<U10'), ('color', 'O'), ('shape', 'O'), ('state', 'S10'), ('taste', '<U2'), ('weight', '<i4'), ('radius', '<f8')])
In [241]: arr['color']
Out[241]: array(['green'], dtype=object)
In [242]: arr['color']='yellow_green'
In [243]: arr['fruit']
Out[243]: array(['Apple'], dtype='<U10')
In [244]: arr['fruit']='pineapple'
In [245]: arr
Out[245]:
array([('pineapple', 'yellow_green', 'round', b'fresh', 'go', 100000, 3.14159265)],
dtype=[('fruit', '<U10'), ('color', 'O'), ('shape', 'O'), ('state', 'S10'), ('taste', '<U2'), ('weight', '<i4'), ('radius', '<f8')])
pandas
opts for using object dtype for all of its strings. The numpy
fixed string length is ok when the strings tend to be all the same size and know ahead of time, e.g. np.array(['one','two','three', 'four', 'five'])