pythonnumpytile

Using numpy tile to group rows by column in a python data frame


I have a dataframe that looks like this.
enter image description here

I would like to add a column "Tile" that looks like this
enter image description here

is numpy.tile() the correct tool for this?

Ultimately what I am trying to do is count the month gaps each person has between the CRP event. So up until there is an event, the event is null. That would be a zero. When the next event happens, the tile would be 1. then the next few events if null would be a 1 until another event happens. if there is another CRP in consecutive order, that would trigger a new tile i.e. 3 if the previous was 2.

EDIT: The tile is supposed to start over per member. I forgot that important detail. So I will try to use the cumsum() with a window.


Solution

  • you can compare column with 'CRP to get True/False and later use .cumsum()
    which will treat True as 1 and False as 0

    Minimal working code:

    import pandas as pd
    
    df = pd.DataFrame({
       'event':[None, None, 'CRP', None, None, None, 'CRP', 'CRP', None, None, None, None]
    })
    print(df)
    
    df['tile'] = (df['event'] == 'CRP').cumsum()
    
    print(df)
    

    Result:

       event  tile
    0   None     0
    1   None     0
    2    CRP     1
    3   None     1
    4   None     1
    5   None     1
    6    CRP     2
    7    CRP     3
    8   None     3
    9   None     3
    10  None     3
    11  None     3
    

    Other idea: because you have only CRP and Null so you can use exanding(1) which will get rows similar to cumsum() (and other cumulative functions) and use .count() which will count not-null values

    df['tile'] = df['event'].expanding(1).count().astype(int)
    

    UPDATE: (after adding details in question)

    If you want it for every member then you can use groupby()

    df.groupby('member')['event'].expanding(1).count().astype(int).values
    

    or

    df.groupby('member')['event'].expanding(1).count().astype(int).reset_index(drop=True)
    

    Minimal working code used for tests:

    import pandas as pd
    
    df = pd.DataFrame({
        'member': ["A"]*12 + ["B"]*12,
        'event': [None, None, 'CRT', None, None, None, 'CRT', 'CRT', None, None, None, None]*2,
    })
    print(df)
    
    # --- old version ---
    
    #df['tile'] = (df['event'] == 'CRT').cumsum()
    #print(df)
    
    #df['tile2'] = df['event'].expanding(1).count().astype(int)
    #print(df)
    
    # --- grouping by memeber ---
    
    df['tile'] = (df['event'] == 'CRT') # create column with True/False
    df['tile'] = df.groupby('member')['tile'].cumsum()  # count True/False in this column
    print(df)
    
    df['tile2'] = df.groupby('member')['event'].expanding(1).count().astype(int).values
    print(df)
    
    df['tile3'] = df.groupby('member')['event'].expanding(1).count().astype(int).reset_index(drop=True)
    print(df)
    

    Result:

       member event  tile
    0       A  None     0
    1       A  None     0
    2       A   CRT     1
    3       A  None     1
    4       A  None     1
    5       A  None     1
    6       A   CRT     2
    7       A   CRT     3
    8       A  None     3
    9       A  None     3
    10      A  None     3
    11      A  None     3
    12      B  None     0
    13      B  None     0
    14      B   CRT     1
    15      B  None     1
    16      B  None     1
    17      B  None     1
    18      B   CRT     2
    19      B   CRT     3
    20      B  None     3
    21      B  None     3
    22      B  None     3
    23      B  None     3