I'm trying to replicate the IBM SPSS function @SINCE using Python and Pandas, but unfortunately, I've got stuck in a part of my process. Is there a direct function that replicates IBM SPSS CLEM @SINCE using python?
Here is the link for more info: Link
IMB @SINCE function description
"This function returns the offset of the last record where this condition was true--that is, the number of records before this one in which the condition was true. If the condition has never been true, @SINCE returns @INDEX + 1." (IBM, 2020)
I've been trying to replicate this function from scratch, but I haven't find the right way to do it:
How do I do this with pandas and Python?
Here the problem,
My data looks like this:
+------+----------+
| Type | Flag |
+------+----------+
| d | |
+------+----------+
| A | myStatus |
+------+----------+
| c | |
+------+----------+
| B | myStatus |
+------+----------+
| c | |
+------+----------+
| c | myStatus |
+------+----------+
| c | |
+------+----------+
| d | |
+------+----------+
| d | |
+------+----------+
| A | myStatus |
+------+----------+
In IBM SPSS I use this formula to get this data:
if Type = 'A' or Type = 'B' then @SINCE(Flag = 'myStatus') else -1 endif
And this is the output:
+------+----------+----------------+
| Type | Flag | Expected Count |
+------+----------+----------------+
| d | | -1 |
+------+----------+----------------+
| A | myStatus | 0 |
+------+----------+----------------+
| c | | -1 |
+------+----------+----------------+
| B | myStatus | 2 |
+------+----------+----------------+
| c | | -1 |
+------+----------+----------------+
| c | myStatus | -1 |
+------+----------+----------------+
| c | | -1 |
+------+----------+----------------+
| d | | -1 |
+------+----------+----------------+
| d | | -1 |
+------+----------+----------------+
| A | myStatus | 4 |
+------+----------+----------------+
So, I found the way to fix this problem: here is the code:
df = pd.DataFrame({"Type":["d", "A", "c", "B", "c", "c", "c", "d", "d", "A"],
"Flag":[np.nan, "myStatus", np.nan, "myStatus", np.nan, "myStatus", np.nan, np.nan, np.nan, "myStatus"]})
Function to solve the problem:
def spssSince(df):
df_temp = df
df_temp = df[df.Flag=="myStatus"]
df_temp['last_ind'] = df_temp.index
df_temp['last_ind'] = df_temp.last_ind.shift(1)
df_temp['last_ind'] = df_temp['last_ind'].fillna(1)
df_temp["Expected Count"] = df_temp.index - df_temp.last_ind
df_temp.loc[~df_temp.Type.isin(["A", "B"]), "Expected Count"] = -1
DFreturn = pd.merge(left=df, right=df_temp.drop(['Type', 'Flag', 'last_ind'], axis=1), how="left", left_index=True, right_index=True)
DFreturn["Expected Count"] = DFreturn["Expected Count"].fillna(-1)
return DFreturn
Basically the function calculates the last SINCE value from a condition, calculating the actual index among the index that has the validation (using shift()).