pythonpandasdataframeindexingstringio

get index per ID - from example data using text input to pandas dataframe? easily reproducible


Text can be used as input to pandas dataframes to make easily reproducible examples for testing solutions.1

import pandas as pd
from io import StringIO

txt= """
ID,datetime,value
AB-CL-34,07/10/2022 10:00:00,5 
AB-CL-34,07/10/2022 11:15:10,7 
AB-CL-34,09/10/2022 15:30:30,13 
BX-RT-55,06/10/2022 11:30:22,0 
BX-RT-55,10/10/2022 22:44:11,1 
BX-RT-55,10/10/2022 23:30:22,6 
"""

df = pd.read_csv(StringIO(txt), parse_dates=[1], dayfirst=True)

How do I index df such that there is an integer index according to each ID, keeping the ID variable?

imagined output

        ID            datetime value
0 AB-CL-34 07/10/2022 10:00:00     5 
0 AB-CL-34 07/10/2022 11:15:10     7 
0 AB-CL-34 09/10/2022 15:30:30    13 
1 BX-RT-55 06/10/2022 11:30:22     0 
1 BX-RT-55 10/10/2022 22:44:11     1 
1 BX-RT-55 10/10/2022 23:30:22     6 

edit ID values are now chr/str, not int, as they should have been (apologies to Marcus_CH.


Solution

  • You could try:

    df = pd.read_csv(StringIO(txt),\
                 parse_dates=[1],\
                 dayfirst=True)\
                 .assign(id_index= lambda df: df\
                 .groupby('ID', sort=False).ngroup())\
                 .set_index("id_index")\
                 .rename_axis(index=None)
    

    Output

    +----+----------+---------------------+---------+
    |    |       ID |            datetime |   value |
    |----+----------+---------------------+---------|
    |  0 | AB-CL-34 | 2022-10-07 10:00:00 |       5 |
    |  0 | AB-CL-34 | 2022-10-07 11:15:10 |       7 |
    |  0 | AB-CL-34 | 2022-10-09 15:30:30 |      13 |
    |  1 | BX-RT-55 | 2022-10-06 11:30:22 |       0 |
    |  1 | BX-RT-55 | 2022-10-10 22:44:11 |       1 |
    |  1 | BX-RT-55 | 2022-10-10 23:30:22 |       6 |
    +----+----------+---------------------+---------+
    

    References:

    1. indexing, please have a look at the official reference. This is a way better explanation as I could do.
    2. index=None to remove the index name (try it without this statement)
    3. lambda is an anonymous function. You could also do something like df["id_col"] = df["ID"] - 1. lambda makes it easier and faster.
    4. the brackets are so that you can do line breaks without \ for linebreak (so it's just for styling)