pythonnumpycsvconvertersgenfromtxt

Reading csv with numpy has extra row of 1's


I am using np.genfromtxt to read a csv file, and trying to use the converters argument to preprocess each column.

CSV:

"","Col1","Col2","Col3"
"1","Cell.1",NA,1
"2","Cell.2",NA,NA
"3","Cell.3",1,NA
"4","Cell.4",NA,NA
"5","Cell.5",NA,NA
"6","Cell.6",1,NA

Code:

import numpy as np

filename = 'b.csv'
h = ("", "Col1", "Col2", "Col3")

def col1_converter(v):
    print(f'col1_converter {v = }')
    return v

def col2_converter(v):
    print(f'col2_converter {v = }')
    return v

def col3_converter(v):
    print(f'col3_converter {v = }')
    return v

a = np.genfromtxt(
    filename,
    delimiter=',',
    names=True,
    dtype=[None, np.dtype('U8'), np.dtype('U2'), np.dtype('U2')],
    usecols=range(1, len(h)),
    converters={1: col1_converter, 2: col2_converter, 3: col3_converter},
    deletechars='',
)
print()
print(a)

When I put print statements in the converters, I see printed an extraneous row of 1's at the beginning which doesn't actually appear in the matrix that is output. Why am I seeing this row of 1's?

col1_converter v = b'1'
col2_converter v = b'1'
col3_converter v = b'1'
col1_converter v = b'"Cell.1"'
col1_converter v = b'"Cell.2"'
col1_converter v = b'"Cell.3"'
col1_converter v = b'"Cell.4"'
col1_converter v = b'"Cell.5"'
col1_converter v = b'"Cell.6"'
col2_converter v = b'NA'
col2_converter v = b'NA'
col2_converter v = b'1'
col2_converter v = b'NA'
col2_converter v = b'NA'
col2_converter v = b'1'
col3_converter v = b'1'
col3_converter v = b'NA'
col3_converter v = b'NA'
col3_converter v = b'NA'
col3_converter v = b'NA'
col3_converter v = b'NA'

[('"Cell.1"', 'NA', '1') ('"Cell.2"', 'NA', 'NA') ('"Cell.3"', '1', 'NA')
 ('"Cell.4"', 'NA', 'NA') ('"Cell.5"', 'NA', 'NA') ('"Cell.6"', '1', 'NA')]

Solution

  • TL;DR: Before doing any of the actual conversions, numpy "tests" each converter function by invoking it with the argument '1', to find a reasonable default value for the column. This doesn't affect the output, except by possibly changing the default value for a given column.

    Explanation

    I thought it was strange how each converter gets called once, and then the column 1 converter gets called for each row, and then the column 2 converter, and so on. This suggested these invocations were coming from different areas in the code. I used python's traceback module to confirm:

    def col1_converter(v):
        print(f'col1_converter {v = }')
        traceback.print_stack()
        return v
    

    Sure enough, all of the calls to col1_converter had identical stack traces, except the first one. I looked through that stack trace and found this interesting bit of code:

      File "/Users/rpmccarter/Library/Python/3.8/lib/python/site-packages/numpy/lib/_iotools.py", line 804, in update
        tester = func(testing_value or '1')
    

    Because numpy is open-source, I just went to the GitHub repo and went to the _iotools.py file. I found a brief explanation of why they invoke the converter here, as well as the converter invocation here:

        testing_value : str, optional
            A string representing a standard input value of the converter.
            This string is used to help defining a reasonable default
            value.
    
    ...
    
        try:
            tester = func(testing_value or '1')
        except (TypeError, ValueError):
            tester = None