[SOLVED] rate of change between 2 matching values (across many different rows)

rate of change between 2 matching values (across many different rows)

I have a pandas dataframe that looks like:

  A    B    count
---   ---   -----
  X    A1      10
  Y    A1       5
  X    A2       1
  Y    A2      50
  X    A3       2
  Y    A4      90

There are only 2 different possible values in column A, but there are many different values for column B. I am interested in the percent change of matching values of column B between X and Y rows.

So, for example, in the above, since A1 appears in column B for both X and Y rows, and the value goes from 10 to 5, the percent change for A1 is 100* (5-10)/10) = -50%

Also note that the rows with B=A3 and B=A4 only occur a single time, so they should be reported in some special way (since the percent change formula doesn't work with a single value).

From looking at the pct_change function it seems like it expects more than just 2 different values in column A. And I'm new to pandas, so not sure about good ways of comparing values of B across the entire frame.

I could iterate over it in python, create a dict for X rows and a dict for Y rows, then do the caclucation for each entry (and report entries that are only in X or only in Y), but was hoping there is some panda-fu that will churn it out more quickly.

Solution

Code

Since you want the values compared for each item anyway, pivoting seems better. You could use the pivot function, but it throws an error on bad data where there are more than 2 rows in the combination of A and B, so I used the pivot_table function.

out = (df.pivot_table(values='count', index='B', columns='A', aggfunc='sum')
         .pct_change(axis=1, fill_method=None)['Y']
)

out

B
A1   -0.50
A2   49.00
A3     NaN
A4     NaN
Name: Y, dtype: float64

The code above uses a sum when there are more than 2 rows(becuz column name is count), but you can change this depending on how you want to handle duplicate data.