pythonpandascsv

Adding leading zeros to data columns when loading from CSV files using pandas


I have a script to load and combine time-series data from two .csv files that have the same base filename (specified as a path using Pathlib), but different suffixes. Given that the importing of the data from the .csv files is working correctly, I include here a minimal working example with data provided using StringIO:

import pandas as pd
from io import StringIO

def load_data():
    headers_0 = ['a', 'b', 'c']  # Headers for first file. May have more entries than columns in file
    headers_1 = ['d', 'e']       # Headers for second file.

    data_0 = pd.read_csv(
        StringIO(
          """
            0 1 2
            3 4 5
            6 7 8
          """
          )
        , header=None, delim_whitespace=True)
    data_0.columns = headers_0[0:data_0.shape[1]]

    data_1 = pd.read_csv(
        StringIO(
          """
            A B
            C D
          """
          )
        , header=None, delim_whitespace=True)
    data_1.columns = headers_1[0:data_1.shape[1]]

    data = data_0.join(data_1)
    data.fillna(0, inplace=True)

    print(data)

Thus far, I have only been using load_data for datasets where both data_0 and data_1 have the same length of columns (same length of time-series). However, I am now encountering a situation where data_1 has a shorter column length than data_0; this is because the data in data_1 only starts getting recorded at some later time than data_0.

How do I use pandas to fill the columns of data_1 with leading zeros, such that the column length in both data_0 and data_1 is the same? I believe that the line data.fillna(0, inplace=True) is filling the length mismatch with trailing zeros; is there an obvious way to change this to leading zeros? Note that I do not know the length of either dataset a priori, so I would appreciate help towards a solution that works based on the length of the data loaded using pandas. Running the example above gives an output

   a  b  c  d  e
0  0  1  2  A  B
1  3  4  5  C  D
2  6  7  8  0  0

which is evidently not the desired effect (the zeros in columns d and e would be in row 0, rather than 2).

I have tried different options for DataFrame.fillna such as method=backfill, but none of these attempts have yielded the expected result.


Solution

  • Here's one approach:

    import pandas as pd
    from io import StringIO
    
    def load_data(filename):
        
        headers_0 = ['a', 'b', 'c']
        headers_1 = ['d', 'e']
    
        # using StringIO for mre 
        # + `sep="\s+"` for `delim_whitespace=True` (deprecated since 2.2.0)
        data_0 = pd.read_csv(StringIO(c_0), header=None, sep="\s+")
        data_0.columns = headers_0[0:data_0.shape[1]]
    
        data_1 = pd.read_csv(StringIO(c_1), header=None, sep="\s+")
        data_1.columns = headers_1[0:data_1.shape[1]]
    
        data = (
            pd.concat(
                [df[::-1].reset_index(drop=True) for df in [data_0, data_1]], 
                axis=1
            )[::-1]
            .reset_index(drop=True)
            .fillna(0)
        )
    
        return data
    
    c_0 = """0 1 2
    3 4 5
    6 7 8"""
    
    c_1 = """A B
    C D
    """
    
    load_data('c')
    
       a  b  c  d  e
    0  0  1  2  0  0
    1  3  4  5  A  B
    2  6  7  8  C  D
    

    Will work regardless of which one is longer.


    Another option could be to use df.shift with fill_value:

    def load_data2(filename):
        
        # `data_0` and `data_1` same as above
    
        diff = len(data_0) - len(data_1)
        
        data = pd.concat([data_0, data_1], axis=1)
        if diff > 0:
            data[data_1.columns] = data[data_1.columns].shift(diff, fill_value=0)
        elif diff < 0:
            data[data_0.columns] = data[data_0.columns].shift(abs(diff), fill_value=0)
        
        return data
    

    The nice thing about this option is that it will only fill NaN values in the shifted area. E.g.:

    c_0 = """0 1 2
    3 4 5
    6 7 8"""
    
    c_1 = """A B
    C 
    """
    
    load_data2('c')
    
       a  b  c  d    e
    0  0  1  2  0    0
    1  3  4  5  A    B
    2  6  7  8  C  NaN # with my `load_data` above, this `NaN` gets filled too!
    

    Of course, if you want all NaN values filled anyway, just use fillna here as well.