pythonpandasdataframenumpyshift

python pandas Problem repeating the previous value


is my code

import pandas as pd

columns1 = ['Student ID', 'Course ID', 'Marks']
data1 = [(1, 10, 100), (2, 400, 200), (3, 30, 300), (3, 30, 300), (3, 30, 300), (3, 30, 300), (3, 30, 300), (3, 30, 300)]
df1 = pd.DataFrame(data1, columns=columns1)
Student ID Course ID Marks
1 10 100
2 400 200
3 30 300
3 30 300
3 30 300
3 30 300
3 30 300
3 30 300
df1['s']  = np.where((df1['Course ID']  > df1['Marks'])  == True, df1['Student ID'],  df1['s'].shift(1)) 
df1
Student ID Course ID Marks s
1 10 100 NaN
2 400 200 2
3 30 300 2
3 30 300 NaN
3 30 300 NaN
3 30 300 NaN
3 30 300 NaN
3 30 300 NaN

As you can see, only the information of two rows has changed and the rest are null. This is the result I expect because after column 2 condition "df1['Course ID'] > df1['Marks']" is true

Student ID Course ID Marks s
1 10 100 NaN
2 400 200 2
3 30 300 2
3 30 300 2
3 30 300 2
3 30 300 2
3 30 300 2
3 30 300 2

Thank you for your help


Solution

  • If you want to assign the Student ID for rows matching the df1['Course ID'] > df1['Marks'] condition, and for other rows take the previous value, use ffill:

    df1['s'] = (df1['Student ID']
                .where(df1['Course ID'] > df1['Marks'])
                .ffill()
                .convert_dtypes() # optional
               )
    

    Output (with a slightly different input):

       Student ID  Course ID  Marks     s
    0           1         10    100  <NA>
    1           2        400    200     2
    2           3         30    300     2
    3           3         30    300     2
    4           3        400    300     3
    5           3         30    300     3
    6           3         30    300     3
    7           3         30    300     3
    

    If you only want to apply this logic per Student ID, which might make more sense to avoid "leaking" values from one student to another rather use groupby.ffill:

    df1['s'] = (df1['Student ID']
                .where(df1['Course ID'] > df1['Marks'])
                .groupby(df1['Student ID']).ffill()
                .convert_dtypes() # optional
               )
    

    Or:

    df1['s'] = (df1['Student ID']
                .where(df1['Course ID'].gt(df1['Marks'])
                       .groupby(df1['Student ID']).cummax())
                .convert_dtypes() # optional
               )
    

    Output:

       Student ID  Course ID  Marks     s
    0           1         10    100  <NA>
    1           2        400    200     2
    2           3         30    300  <NA>
    3           3         30    300  <NA>
    4           3        400    300     3
    5           3         30    300     3
    6           3         30    300     3
    7           3         30    300     3