pythonpandasdataframereshapepy-datatable

Python datatable/pandas reshaping problem


I need to reshape my df.

This is my input df:

import pandas as pd
import datatable as dt

DF_in = dt.Frame(name=['name1', 'name1', 'name1', 'name1', 'name2', 'name2', 'name2', 'name2'],
             date=['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05', '2021-01-06', '2021-01-07', '2021-01-08'],
             type=['a', 'b', 'a', 'b', 'b', 'a', 'b', 'a'],
             value=[1, 2, 3, 4, 5, 6, 7, 8])

   | name   date        type  value
-- + -----  ----------  ----  -----
 0 | name1  2021-01-01  a         1
 1 | name1  2021-01-02  b         2
 2 | name1  2021-01-03  a         3
 3 | name1  2021-01-04  b         4
 4 | name2  2021-01-05  b         5
 5 | name2  2021-01-06  a         6
 6 | name2  2021-01-07  b         7
 7 | name2  2021-01-08  a         8

This is the desired output df:

DF_out = dt.Frame(name=['name1', 'name1', 'name2', 'name2'],
              date_a=['2021-01-01', '2021-01-03', '2021-01-06', '2021-01-08'],
              date_b=['2021-01-02', '2021-01-04', '2021-01-07', None],
              value_a=[1, 3, 6, 8],
              value_b=[2, 4, 7, None])

   | name   date_a      date_b      value_a  value_b
-- + -----  ----------  ----------  -------  -------
 0 | name1  2021-01-01  2021-01-02        1        2
 1 | name1  2021-01-03  2021-01-04        3        4
 2 | name2  2021-01-06  2021-01-07        6        7
 3 | name2  2021-01-08  NA                8       NA

If necessary the datatable Frames can be converted into a pandas DataFrame:

DF_in = DF_in.to_pandas()

Transformation:

I hope this explanation is understandable.

Thank you in advance


Solution

  • Thank you all very much for your answers. In the meantime I developed a solution that uses only datatable package a uses some workarounds for the current limitations:

    1. define a function to create id for adjacent rows: 1,1,2,2,...
    2. create column id that contains row index
    3. get id of rows to be deleted as list
    4. subtract row id's to be deleted from all row id's
    5. subset the Frame based on the remaining row id's
    6. get number of rows per group
    7. use the function for each group and use the number of rows as input, create a list with all results (same length as Frame after subset). Bind this to the Frame
    8. create two subset Frames based on column type ('a' or 'b')
    9. join df2 on df1

    code:

    import math
    import datatable as dt
    from datatable import dt, f, by, update, join
    
    DF_in = dt.Frame(name=['name1', 'name1', 'name1', 'name1', 'name2', 'name2', 'name2', 'name2'],
                     date=['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04', '2021-01-05', '2021-01-06', '2021-01-07', '2021-01-08'],
                     type=['a', 'b', 'a', 'b', 'b', 'a', 'b', 'a'],
                     value=[1, 2, 3, 4, 5, 6, 7, 8])
    
    
    
    def group_id(n):
        l = [x for x in range(0, math.floor(n / 2))]
        l = sorted(l * 2)
        if n % 2 != 0:
            try:
                l.append(l[-1] + 1)
            except IndexError:
                l.append(0)
        return l
    
    
    DF_in['id'] = range(DF_in.nrows)
    first_row = f.id==dt.min(f.id)
    row_eq_b = dt.first(f.type)=="b"
    remove_rows = first_row & row_eq_b
    DF_in[:, update(remove_rows = ~remove_rows), 'name']
    DF_in = DF_in[f[-1]==1, :-1]
    group_count = DF_in[:, {"Count": dt.count()}, by('name')][:, 'Count'].to_list()[0]
    group_id_column = []
    
    for x in group_count:
        group_id_column = group_id_column + group_id(x)
    
    DF_in['group_id'] = dt.Frame(group_id_column)
    df1 = DF_in[f.type == 'a', ['name', 'date', 'value', 'group_id']]
    df2 = DF_in[f.type == 'b', ['name', 'date', 'value', 'group_id']]
    
    df2.key = ['name', 'group_id']
    DF_out = df1[:, :, join(df2)]
    DF_out.names = {'date': 'date_a', 'value': 'value_a', 'date.0': 'date_b', 'value.0': 'value_b'}
    
    DF_out[:, ['name', 'date_a', 'date_b', 'value_a', 'value_b']]
    
       | name   date_a      date_b      value_a  value_b
    -- + -----  ----------  ----------  -------  -------
     0 | name1  2021-01-01  2021-01-02        1        2
     1 | name1  2021-01-03  2021-01-04        3        4
     2 | name2  2021-01-06  2021-01-07        6        7
     3 | name2  2021-01-08  NA                8       NA