arraysnumpypython-importdelimitergenfromtxt

numpy.genfromtxt , are uneven spaces between columns causing dtype errors?


The data I'm working with can be found at this gist,

And looks like:

07-11-2018 18:34:35 -2.001   5571.036 -1.987
07-11-2018 18:34:50 -1.999   5570.916 -1.988

image of code and output in Jupyter Notebook

When calling

TB_CAL_array = np.genfromtxt('calbath_data/TB118192.TXT',
                            skip_header = 10,
                            dtype = ([("date", "<U10"), ("time","<U8"), ("bathtemp", "<f8"), 
                                    ("SBEfreq", "<f8"), ("SBEtemp", "<f8")])

                               )

Output of array is:

array([('07-11-2018', '18:34:35', -2.001e+00, 5571.036, -1.987),
   ('07-11-2018', '18:34:50', -1.999e+00, 5570.916, -1.988),

The data is output as a structured ndarray of tuples and is a non-homogenous array because it contains both strings and floats. numpy.genfromtxt produces array of what looks like tuples, not a 2D array—why?

NOTE: The third column of data output has been treated as something other than the dtype specified.

The output should be -2.001 but instead it is -2.001e+00

NOTE: Notice that the fifth column has the same input format and dtype designation, however no data transformation occurred there during the genfromtxt function...

The only difference I can find between "bathtemp" and "SBEtemp" is that there are two extra blank spaces after the "bathtemp" column...

However based on the numpy.genfromtxt IO documentation this shouldn't matter because consecutive whitespace should automatically be treated as a delimiter.:

delimiter : str, int, or sequence, optional The string used to separate values. By default, any consecutive whitespaces act as delimiter. An integer or sequence of integers can also be provided as width(s) of each field.

Is the extra whitespace after the "bathtemp" column causing the error? If so how do I work around it?


Solution

  • With your sample:

    In [136]: txt="""07-11-2018 18:34:35 -2.001   5571.036 -1.987 
         ...: 07-11-2018 18:34:50 -1.999   5570.916 -1.988"""                       
    In [137]: np.genfromtxt(txt.splitlines(), dtype=None, encoding=None)            
    Out[137]: 
    array([('07-11-2018', '18:34:35', -2.001, 5571.036, -1.987),
           ('07-11-2018', '18:34:50', -1.999, 5570.916, -1.988)],
          dtype=[('f0', '<U10'), ('f1', '<U8'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<f8')])
    

    and with your dtype:

    In [139]: np.genfromtxt(txt.splitlines(), dtype= ([("date", "<U10"), ("time","<U
         ...: 8"), ("bathtemp", "<f8"),  
         ...:                                     ("SBEfreq", "<f8"), ("SBEtemp", "<
         ...: f8")]) 
         ...: , encoding=None)                                                      
    Out[139]: 
    array([('07-11-2018', '18:34:35', -2.001, 5571.036, -1.987),
           ('07-11-2018', '18:34:50', -1.999, 5570.916, -1.988)],
          dtype=[('date', '<U10'), ('time', '<U8'), ('bathtemp', '<f8'), ('SBEfreq', '<f8'), ('SBEtemp', '<f8')])
    

    Values like -2.001e+00 are the same as -2.001. numpy chooses to use scientific notation when the range of values is wide enough, or some values are too small to show well otherwise.

    For example, if I change one of the values to something much smaller:

    In [140]: data = _                                                              
    In [141]: data['bathtemp']                                                      
    Out[141]: array([-2.001, -1.999])
    In [142]: data['bathtemp'][1] *= 0.001                                          
    In [143]: data['bathtemp']                                                      
    Out[143]: array([-2.001e+00, -1.999e-03])
    

    The -2.001 is unchanged (except display style).

    My guess is that some of the bathtemp values (that you don't show) are much closer to zero.