pythonpandascsvfile

Convert TSV file data to a dataframe, which can be pushed to database


We have TSV files which holds IOT data, want to convert to table like structure using pandas. I have worked on TSV data, similar to given below, were the logics goes like

  1. read the file
  2. Add new column names
  3. do transpose
  4. reindex

This is bit challenging as explained, col1 to col3 is dimension data and remaining is fact data

tsv file data looks as below

col1 qweqweq
col2 345435
col3 01/01/2024 35:08:09
col4 1
col5 0
col4 0
col5 0
col4 1
col5 1
col4 0
col5 1

Want to project as table like structure

col1 col2 col3 col4 col5
qweqweq 345435 01/01/2024 35:08:09 1 0
qweqweq 345435 01/01/2024 35:08:09 0 0
qweqweq 345435 01/01/2024 35:08:09 1 1
qweqweq 345435 01/01/2024 35:08:09 0 1

col4 and col5 can differ in each IOT file. How to achieve with python, pandas?


Solution

  • Assuming you can rely on "col1" to define the groups, you can use a pivot after de-duplicating the rows with cumsum and groupby.cumcount, and groupby.ffill:

    df = (pd.read_csv('input_file.tsv', sep='\t', header=None)
            .assign(index=lambda x: x[0].eq('col1').cumsum(),
                    n=lambda x: x.groupby(['index', 0]).cumcount())
            .pivot(index=['index', 'n'], columns=0, values=1)
            .groupby(level='index').ffill()
            .reset_index(drop=True).rename_axis(columns=None)
         )
    

    Output:

          col1    col2                 col3 col4 col5
    0  qweqweq  345435  01/01/2024 35:08:09    1    0
    1  qweqweq  345435  01/01/2024 35:08:09    0    0
    2  qweqweq  345435  01/01/2024 35:08:09    1    1
    3  qweqweq  345435  01/01/2024 35:08:09    0    1
    

    Reproducible input:

    import io
    
    input_file = io.StringIO('''col1\tqweqweq
    col2\t345435
    col3\t01/01/2024 35:08:09
    col4\t1
    col5\t0
    col4\t0
    col5\t0
    col4\t1
    col5\t1
    col4\t0
    col5\t1''')
    

    Intermediates:

    # before pivot
           0                    1  index  n
    0   col1              qweqweq      1  0
    1   col2               345435      1  0
    2   col3  01/01/2024 35:08:09      1  0
    3   col4                    1      1  0
    4   col5                    0      1  0
    5   col4                    0      1  1
    6   col5                    0      1  1
    7   col4                    1      1  2
    8   col5                    1      1  2
    9   col4                    0      1  3
    10  col5                    1      1  3
    
    # before the cleanup-step:
    0           col1    col2                 col3 col4 col5
    index n                                                
    1     0  qweqweq  345435  01/01/2024 35:08:09    1    0
          1  qweqweq  345435  01/01/2024 35:08:09    0    0
          2  qweqweq  345435  01/01/2024 35:08:09    1    1
          3  qweqweq  345435  01/01/2024 35:08:09    0    1