pythonpandasloops

Counter that starts over for a specific variable


I'm writing a script to identify clusters of data in one dimension (depth) defined by a user specified gap in depth for multiple oil wells. The script looks at the difference in depth (d_dep) and then uses a cumsum() to number unique cluster_id. I want to modify the cluster_id to start over at "1" for each new occurrence of a well (id). My "for" loop in the list of IDs is not working. It seems like it should be a simple solution, but I seem to be missing something.

# Create Data
data = {'id': [1,1,1,1,1, 2,2,2,2,2, 3,3,3,3,3],
        'depth': [1,2,10,11,70, 0,20,75,150,155, 0,100,105,200,210],
        'temp':  [1,1,1,1,1, 2,2,2,2,2, 3,3,3,3,3],
        'surfT': [9,9,9,9,9,9,9,9,9,9,9,9,9,9,9]}

# Create DataFrame
df = pd.DataFrame(data)
   
# SORT by id and depth
df.sort_values(by=['id', 'depth'], inplace=True, ascending=True)
# df.head(20)
   
# Create a list of unique IDs
IDs = df['id'].unique()
print('Unique IDs: ', IDs)

# IDENTIFY DATA CLUSTERS BY DEPTH
#===============================================================================================
#1) Calculate the difference in depth between samples, and the first occurence of id is zero
df['d_dep']= np.abs(np.where(df['id'].shift(1) != df['id'], 0, df['depth'].shift(1) - df['depth']))

#2) Flag clusters by depth using the gap threshold
gap= 50 # Gap Theshold

# Flag clusters by the gap threshold, where the d_dep > gap or the first observation in id
df['cluster'] = np.where((df['d_dep']>=gap) | (df['id'].shift(1) != df['id']), 1, 0)

# Assign unique numeric id for clusters of data based on the gap threshold
for i in IDs:
    df['cluster_id'] = df['cluster'].cumsum()
    
display(df)

I tried:

for i in IDs:
    df['cluster_id'] = df['cluster'].cumsum()

And I get the following Table:

Output of code

The cluster_id for the first id is correct, but for id numbers 2 and 3, the first cluster_id should be 1 and start counting over again.


Solution

  • If I understand what you require you can use:

    df['cluster_id'] = df.groupby('id')['cluster'].transform('cumsum')
    

    which gives:

        id  depth  temp  surfT  d_dep  cluster  cluster_id
    0    1      1     1      9    0.0        1           1
    1    1      2     1      9    1.0        0           1
    2    1     10     1      9    8.0        0           1
    3    1     11     1      9    1.0        0           1
    4    1     70     1      9   59.0        1           2
    5    2      0     2      9    0.0        1           1
    6    2     20     2      9   20.0        0           1
    7    2     75     2      9   55.0        1           2
    8    2    150     2      9   75.0        1           3
    9    2    155     2      9    5.0        0           3
    10   3      0     3      9    0.0        1           1
    11   3    100     3      9  100.0        1           2
    12   3    105     3      9    5.0        0           2
    13   3    200     3      9   95.0        1           3
    14   3    210     3      9   10.0        0           3