pythonnumpygenfromtxt

numpy genfromtxt not applying missing_values


I am currently struggling with a really simple problem, but cannot seem to solve it. You can reproduce the issue with the following file and code:

test.csv

2020081217,28.6
2020081218,24.7
2020081219,-999.0
2020081220,-999.0
2020081221,-999.0

code

data = np.genfromtxt("C:/Users/col/Downloads/test.csv", delimiter=',', missing_values=["-999", "-999.0", -999, -999.0])
print(data)

output

[[ 2.02008122e+09  2.86000000e+01]
 [ 2.02008122e+09  2.47000000e+01]
 [ 2.02008122e+09 -9.99000000e+02]
 [ 2.02008122e+09 -9.99000000e+02]
 [ 2.02008122e+09 -9.99000000e+02]]

Why does none of the versions for missing_values catch the -999 in the file and replace them with NaNs or something alike? I feel like this should be simple (and probably already answered somewhere on this website), but I cannot figure it out... Thanks for any help.


Solution

  • There are two types of missing values. One is where the value is represent only by the delimiter. Default fill is nan, but we can define a separate fill:

    In [93]: txt1="""2020081217,28.6
        ...: 2020081218,24.7
        ...: 2020081219,
        ...: 2020081220,
        ...: 2020081221,"""
    In [94]: np.genfromtxt(txt1.splitlines(),delimiter=',',encoding=None)
    Out[94]: 
    array([[2.02008122e+09, 2.86000000e+01],
           [2.02008122e+09, 2.47000000e+01],
           [2.02008122e+09,            nan],
           [2.02008122e+09,            nan],
           [2.02008122e+09,            nan]])
    In [95]: np.genfromtxt(txt1.splitlines(),delimiter=',',encoding=None,filling_val
        ...: ues=999)
    Out[95]: 
    array([[2.02008122e+09, 2.86000000e+01],
           [2.02008122e+09, 2.47000000e+01],
           [2.02008122e+09, 9.99000000e+02],
           [2.02008122e+09, 9.99000000e+02],
           [2.02008122e+09, 9.99000000e+02]])
    

    Your case has a specific string:

    In [96]: txt="""2020081217,28.6
        ...: 2020081218,24.7
        ...: 2020081219,-999.0
        ...: 2020081220,-999.0
        ...: 2020081221,-999.0"""
    

    The other answer suggests using usemask, returning a masked_array:

    In [100]: np.genfromtxt(txt.splitlines(),delimiter=',',encoding=None, missing_values=-999.0, usemask=True)
    Out[100]: 
    masked_array(
      data=[[2020081217.0, 28.6],
            [2020081218.0, 24.7],
            [2020081219.0, --],
            [2020081220.0, --],
            [2020081221.0, --]],
      mask=[[False, False],
            [False, False],
            [False,  True],
            [False,  True],
            [False,  True]],
      fill_value=1e+20)
    

    Looking at the code, I deduce that it's doing a string match, rather than a numeric one. It can also take one value per column (I don't think it does a per-row test):

    In [106]: np.genfromtxt(txt.splitlines(),delimiter=',',encoding=None, 
        missing_values=['2020081217','-999.0'], usemask=True, dtype=None)
    Out[106]: 
    masked_array(data=[(--, 28.6), (2020081218, 24.7), (2020081219, --),
                       (2020081220, --), (2020081221, --)],
                 mask=[( True, False), (False, False), (False,  True),
                       (False,  True), (False,  True)],
           fill_value=(999999, 1.e+20),
                dtype=[('f0', '<i8'), ('f1', '<f8')])
    

    Here I gave it dtype=None, so it returned a structured array.

    missing_values can also be dict, but I haven't figured out what it expects.

    I haven't figured out how to make it replace the missing values with something (such as from the filling_values).

    You do the replace after load

    In [110]: data = np.genfromtxt(txt.splitlines(),delimiter=',',encoding=None)
    In [111]: data
    Out[111]: 
    array([[ 2.02008122e+09,  2.86000000e+01],
           [ 2.02008122e+09,  2.47000000e+01],
           [ 2.02008122e+09, -9.99000000e+02],
           [ 2.02008122e+09, -9.99000000e+02],
           [ 2.02008122e+09, -9.99000000e+02]])
    
    In [114]: data[data==-999] = np.nan
    In [115]: data
    Out[115]: 
    array([[2.02008122e+09, 2.86000000e+01],
           [2.02008122e+09, 2.47000000e+01],
           [2.02008122e+09,            nan],
           [2.02008122e+09,            nan],
           [2.02008122e+09,            nan]])
    

    It looks like genfromtxt constructs a converters from the missing and filling values, but I haven't followed the details. Here's a way of using our converter

    In [138]: converters={1:lambda x: np.nan if x=='-999.0' else float(x)}
    In [139]: data = np.genfromtxt(txt.splitlines(),delimiter=',',encoding=None, 
        converters=converters)
    In [140]: data
    Out[140]: 
    array([[2.02008122e+09, 2.86000000e+01],
           [2.02008122e+09, 2.47000000e+01],
           [2.02008122e+09,            nan],
           [2.02008122e+09,            nan],
           [2.02008122e+09,            nan]])