pandasdataframelong-format-datawide-format-data

Converting Pandas dataframe from long to wide format


I have a Pandas dataframe that records the performance of students in exams in different classes and it looks like:

Class_ID   Class_size   Student_Number   IQ   Hours   Score   Place
1          3            1                101  10      98      1
1          3            2                99   19      80      3
1          3            3                130  3       95      2
2          5            1                93   5       50      5
2          5            2                103  9       88      3
2          5            3                112  12      99      2
2          5            4                200  10      100     1
2          5            5                90   19      78      4
3          2            1                100  12      84      2
3          2            2                102  13      88      1

and I would like to convert the above long format into wide format, using Student_Number as columns and Class_ID, Class_size as index, with the last column Top being the student who comes first in that class, so the desire outcome looks like:

Class_ID Class_size IQ_1 IQ_2 IQ_3 IQ_4 IQ_5 Hours_1 Hours_2 Hours_3 Hours_4 Hours_5 Score_1 Score_2 Score_3 Score_4 Score_5 Top
1        1          101  99   130  NaN  NaN  10      19      3       NaN     NaN     98      80      95      NaN     NaN     1
2        5          93   103  112  200  90   5       9       12      10      19      50      88      99      100     78      4
3        2          100  102  NaN  NaN  NaN  12      13      NaN     NaN     NaN     84      88      NaN     NaN     NaN     2

And here is what I have tried:

out = df.pivot_table(index=['Class_ID', 'Class_size'],
                     columns='Student_Number',
                     values=['IQ', 'Hours', 'Score'])
out.columns = [f'{x}_{y}' for x,y in out.columns]
out_dummy = out.reset_index()
df_wide = out_dummy

However, I have no idea how to create the last column Top. And also, the above code seems to be rather slow, and since my original dataframe is quite huge (~300,000rows), I would like to ask is there any quicker way to do that too. Thank you so much in advance.


Solution

  • A simple method using apply:

    out["Top"] = out.apply(lambda row: 1 + np.argmax(row["Score_1":"Score_5"]) , axis=1)