pythonpandasreplacesuppression

Python - Complementary suppression of 2nd/lowest value row-wise


I'm working on a data suppression script in python where I need to 1) suppress small values (between 1 and 5) and 2) make sure that there are at least 2 values suppressed at the smallest level of aggregation. I've done the first step, replacing small values with -1 (which I'll later recode to "s"). And I created a new helper column that counts how many suppressed values there are per row ('sup_cnt'). That yields something like this:

Subgroup   cat1    cat2    cat3    sup_cnt
Group1      0      -1       0       1
Group2     -1      22       6       1
Group3     -1      14      -1       2
Group4     -1      -1       0       2

data = {'group':['group1','group2','group3','group4'],'cat1':[0,-1,-1,-1],'cat2':[-1,22,14,-1],'cat3':[0,0,-1,0],'sup_cnt':[1,1,2,3]}
df = pd.DataFrame(data)

So for Group1 and Group2, which only have one value suppressed, I want a second value -- the lowest (including zeroes) -- to be replaced with -1. In Group1, one of the zeroes would be replaced; in Group2, 6 would be replaced. So the result would be like this:

Subgroup   cat1    cat2    cat3    sup_cnt
Group1     -1      -1       0       1
Group2     -1      22      -1       1
Group3     -1      14      -1       2
Group4     -1      -1       0       2

If there are more than one columns with the same lowest value (like with Group1, which has 2 zeroes), I only want one of those to be replaced (doesn't matter which).

Originally started this in R and switched to python/pandas (but I'm new to pandas). My idea was to write a function that takes the cat values as arguments, determines the minimum non-negative integer among those, loops through the data columns in a row and replaces the first instance of that min value in the row, then breaks. Not sure if that's the right approach though (or exactly how to carry it out). Any ideas?


Solution

  • I hope I've understood your question right:

    def fn(x):
        cols = x.filter(regex=r"^cat")
        x = cols[cols >= 0].sort_values()[: 2 - x["sup_cnt"]]
        df.loc[x.name, x.index] = -1
    
    
    df[df.sup_cnt < 2].apply(fn, axis=1)
    print(df)
    

    Prints:

      Subgroup  cat1  cat2  cat3  sup_cnt
    0   Group1    -1    -1     0        1
    1   Group2    -1    22    -1        1
    2   Group3    -1    14    -1        2
    3   Group4    -1    -1     0        2