This problem appeared in some larger code but I will give simple example:
from io import StringIO
import numpy as np
example_data = "A B\na b\na b"
data1 = np.genfromtxt(StringIO(example_data), usecols=["A", "B"], names=True, dtype=None)
print(data1["A"], data1["B"]) # ['a' 'a'] ['b' 'b'] which is correct
data2 = np.genfromtxt(StringIO(example_data), usecols=["B", "A"], names=True, dtype=None)
print(data2["A"], data2["B"]) # ['b' 'b'] ['a' 'a'] which is not correct
As you can see, if I change passed column order in regard of column order in file, I get wrong results. What's interesting is that dtype
s are same:
print(data1.dtype) # [('A', '<U1'), ('B', '<U1')]
print(data2.dtype) # [('A', '<U1'), ('B', '<U1')]
In this example it's not hard to sort column names before passing them, but in my case column names are gotten from some other part of system and it's not guaranteed that they will be in same order as those in file. I can probably circumvent that but I'm wondering if there is something wrong with my logic in this example or is there some kind of bug here.
Any help is appreciated.
Update:
What I just realized playing around a bit is following, if I add one or more columns into example data (not important where) and pass subset of columns to np.genfromtxt
in whichever order I want, it gives correct result.
Example:
example_data = "A B C\na b c\na b c"
data1 = np.genfromtxt(StringIO(example_data), usecols=["A", "B"], names=True, dtype=None)
print(data1["A"], data1["B"]) # ['a' 'a'] ['b' 'b'] which is correct
data2 = np.genfromtxt(StringIO(example_data), usecols=["B", "A"], names=True, dtype=None)
print(data2["A"], data2["B"]) # ['a' 'a'] ['b' 'b'] which is correct
[62]: text = "A B\na b\na b".splitlines()
In [63]: np.genfromtxt(text,dtype=None, usecols=[1,0],names=True)
Out[63]: array([('b', 'a'), ('b', 'a')], dtype=[('A', '<U1'), ('B', '<U1')])
In [64]: np.genfromtxt(text3,dtype=None, usecols=[1,0])
Out[64]:
array([['B', 'A'],
['b', 'a'],
['b', 'a']], dtype='<U1')
So it uses the columns in the order you specify in usecols
, but takes the structured array dtype
from the names
In [65]: text3="A B C\na b c\na b c".splitlines()
In [66]: np.genfromtxt(text3,dtype=None, usecols=[1,0])
Out[66]:
array([['B', 'A'],
['b', 'a'],
['b', 'a']], dtype='<U1')
In [67]: np.genfromtxt(text3,dtype=None, usecols=[1,0],names=True)
Out[67]: array([('b', 'a'), ('b', 'a')], dtype=[('B', '<U1'), ('Af', '<U1')])
In the subset case it pays attention to the usecols
when constructing the dtype.
From the genfromtxt
code (read from [source] or ipython ??
firstvalues
is the names derived from the first line, and nbcol
is their count.
After making sure usecols
is a list, and converting to numbers if needed, it:
nbcols = len(usecols or first_values)
...
if usecols:
for (i, current) in enumerate(usecols):
# if usecols is a list of names, convert to a list of indices
if _is_string_like(current):
usecols[i] = names.index(current)
elif current < 0:
usecols[i] = current + len(first_values)
# If the dtype is not None, make sure we update it
if (dtype is not None) and (len(dtype) > nbcols):
descr = dtype.descr
dtype = np.dtype([descr[_] for _ in usecols])
names = list(dtype.names)
# If `names` is not None, update the names
elif (names is not None) and (len(names) > nbcols):
names = [names[_] for _ in usecols]
So with usecols
, nbcols
is the number of columns it's to use. In the subset case it selects from the names, but if it isn't a subset, then the names
isn't modified, in number or order.
For a structured array you really don't need to specify the order
In [79]: data=np.genfromtxt(text,dtype=None, names=True); data
Out[79]: array([('a', 'b'), ('a', 'b')], dtype=[('A', '<U1'), ('B', '<U1')])
In [80]: data['B'], data['A']
Out[80]: (array(['b', 'b'], dtype='<U1'), array(['a', 'a'], dtype='<U1'))
Columns can be reordered after loading with indexing:
In [87]: data[['A','B']]
Out[87]: array([('a', 'b'), ('a', 'b')], dtype=[('A', '<U1'), ('B', '<U1')])
In [88]: data[['B','A']]
Out[88]:
array([('b', 'a'), ('b', 'a')],
dtype={'names': ['B', 'A'], 'formats': ['<U1', '<U1'], 'offsets': [4, 0], 'itemsize': 8})
I suppose this could be raised as an issue. The logic in applying usecols
, names
, etc, is complicated as it is :)
With explicit dtype
In [96]: dt=[('B','U1'),('A','U1')]
In [97]: data=np.genfromtxt(text,dtype=dt, usecols=[1,0], skip_header=1); data
Out[97]: array([('b', 'a'), ('b', 'a')], dtype=[('B', '<U1'), ('A', '<U1')])