pythonnumpygenfromtxt

Wrong column assignment with np.genfromtxt if passed column order is not the same as in file


This problem appeared in some larger code but I will give simple example:

from io import StringIO
import numpy as np

example_data = "A B\na b\na b"

data1 = np.genfromtxt(StringIO(example_data), usecols=["A", "B"], names=True, dtype=None)
print(data1["A"], data1["B"]) # ['a' 'a'] ['b' 'b'] which is correct

data2 = np.genfromtxt(StringIO(example_data), usecols=["B", "A"], names=True, dtype=None)
print(data2["A"], data2["B"]) # ['b' 'b'] ['a' 'a'] which is not correct

As you can see, if I change passed column order in regard of column order in file, I get wrong results. What's interesting is that dtypes are same:

print(data1.dtype) # [('A', '<U1'), ('B', '<U1')]
print(data2.dtype) # [('A', '<U1'), ('B', '<U1')]

In this example it's not hard to sort column names before passing them, but in my case column names are gotten from some other part of system and it's not guaranteed that they will be in same order as those in file. I can probably circumvent that but I'm wondering if there is something wrong with my logic in this example or is there some kind of bug here.

Any help is appreciated.

Update:
What I just realized playing around a bit is following, if I add one or more columns into example data (not important where) and pass subset of columns to np.genfromtxt in whichever order I want, it gives correct result.

Example:

example_data = "A B C\na b c\na b c"

data1 = np.genfromtxt(StringIO(example_data), usecols=["A", "B"], names=True, dtype=None)
print(data1["A"], data1["B"]) # ['a' 'a'] ['b' 'b'] which is correct
data2 = np.genfromtxt(StringIO(example_data), usecols=["B", "A"], names=True, dtype=None)
print(data2["A"], data2["B"]) # ['a' 'a'] ['b' 'b'] which is correct

Solution

  •  [62]: text = "A B\na b\na b".splitlines()
    
    In [63]: np.genfromtxt(text,dtype=None, usecols=[1,0],names=True)
    Out[63]: array([('b', 'a'), ('b', 'a')], dtype=[('A', '<U1'), ('B', '<U1')])
    
    In [64]: np.genfromtxt(text3,dtype=None, usecols=[1,0])
    Out[64]: 
    array([['B', 'A'],
           ['b', 'a'],
           ['b', 'a']], dtype='<U1')
    

    So it uses the columns in the order you specify in usecols, but takes the structured array dtype from the names

    In [65]: text3="A B C\na b c\na b c".splitlines()
    
    In [66]: np.genfromtxt(text3,dtype=None, usecols=[1,0])
    Out[66]: 
    array([['B', 'A'],
           ['b', 'a'],
           ['b', 'a']], dtype='<U1')
    
    In [67]: np.genfromtxt(text3,dtype=None, usecols=[1,0],names=True)
    Out[67]: array([('b', 'a'), ('b', 'a')], dtype=[('B', '<U1'), ('Af', '<U1')])
    

    In the subset case it pays attention to the usecols when constructing the dtype.

    From the genfromtxt code (read from [source] or ipython ??

    firstvalues is the names derived from the first line, and nbcol is their count.

    After making sure usecols is a list, and converting to numbers if needed, it:

    nbcols = len(usecols or first_values)
    ...
            if usecols:
                for (i, current) in enumerate(usecols):
                    # if usecols is a list of names, convert to a list of indices
                    if _is_string_like(current):
                        usecols[i] = names.index(current)
                    elif current < 0:
                        usecols[i] = current + len(first_values)
                # If the dtype is not None, make sure we update it
                if (dtype is not None) and (len(dtype) > nbcols):
                    descr = dtype.descr
                    dtype = np.dtype([descr[_] for _ in usecols])
                    names = list(dtype.names)
                # If `names` is not None, update the names
                elif (names is not None) and (len(names) > nbcols):
                    names = [names[_] for _ in usecols]
    

    So with usecols, nbcols is the number of columns it's to use. In the subset case it selects from the names, but if it isn't a subset, then the names isn't modified, in number or order.

    For a structured array you really don't need to specify the order

    In [79]: data=np.genfromtxt(text,dtype=None, names=True); data
    Out[79]: array([('a', 'b'), ('a', 'b')], dtype=[('A', '<U1'), ('B', '<U1')])
    
    In [80]: data['B'], data['A']
    Out[80]: (array(['b', 'b'], dtype='<U1'), array(['a', 'a'], dtype='<U1'))
    

    Columns can be reordered after loading with indexing:

    In [87]: data[['A','B']]
    Out[87]: array([('a', 'b'), ('a', 'b')], dtype=[('A', '<U1'), ('B', '<U1')])
    
    In [88]: data[['B','A']]
    Out[88]: 
    array([('b', 'a'), ('b', 'a')],
          dtype={'names': ['B', 'A'], 'formats': ['<U1', '<U1'], 'offsets': [4, 0], 'itemsize': 8})
    

    I suppose this could be raised as an issue. The logic in applying usecols, names, etc, is complicated as it is :)

    edit

    With explicit dtype

    In [96]: dt=[('B','U1'),('A','U1')]
    
    In [97]: data=np.genfromtxt(text,dtype=dt, usecols=[1,0], skip_header=1); data
    Out[97]: array([('b', 'a'), ('b', 'a')], dtype=[('B', '<U1'), ('A', '<U1')])