pythonpandasmultilevel-analysis

Pandas: how to deal with data associated with higher levels of hierarchical index?


I have a Pandas dataframe similar to (but much larger than) :

                         dob name                     x                     y
subject_code datapoint                                                       
subject1     0          2000  foo    1.0130617046645218   -0.9197554534168588
             1          2000  foo    1.6268826597606547    0.9235889343439213
             2          2000  foo  -0.27934495719523383   -0.7649336229271105
subject2     3          1980  bar     0.799986663648364   0.17376727917253385
             4          1980  bar   0.05939596855122923    1.6401662246395807
             5          1980  bar   -0.6496807099005506   -0.9591735410133281
             6          1980  bar   -0.6953680580655669  -0.15390936342856965

Which I have read in from a flat csv file where all of the subject-level attributes (like subject_code,doband name here) are repeated for every column, using the command df = pd.read_csv(PATH, index_col=["subject_code", "datapoint"]).

Currently the dob and name are repeated in every column, which seems to be bad practice. My questions are twofold:

  1. Is there a way to associate the columns dob and name with the higher level index subject_code, such that they are in the same table but not repeated?
  2. What's the best way to pull out the subject_code, dob and name into a separate, smaller table?

With respect to (2), I'm currently this far:

subject_cols = df.groupby("subject_code")\
                 .nunique()\
                 .max() == 1

subject_cols = subject_cols.index[subject_cols]

with df[subject_cols] giving:

                         dob name
subject_code datapoint           
subject1     0          2000  foo
             1          2000  foo
             2          2000  foo
subject2     3          1980  bar
             4          1980  bar

Solution

  • With the following dataframe I get from running your toy example:

    print(df)
    # Output
                             dob name                     x                     y
    subject_code datapoint
    subject1     0          2000  foo     0.513162983337374  0.007833678673477415
                 1          2000  foo    1.5251357488726798   -0.2107782114549924
                 2          2000  foo    -1.213546898092921   -0.9955113729604301
    subject2     3          1980  bar  -0.19001014005197553    -0.863345548112192
                 4          1980  bar   -0.7446186914220339    -0.388115467600475
                 5          1980  bar  -0.04464842762041622   0.08245820701430223
                 6          1980  bar    0.9529697829289847   -0.7613603252199205
    

    You can accomplish the first task using Pandas groupby and explode:

    df = df.reset_index().groupby(["subject_code", "dob", "name", "datapoint"]).agg(list)
    
    for col in ["x", "y"]:
        df = df.explode(col)
    
    print(df)
    # Output
                                                         x                     y
    subject_code dob  name datapoint
    subject1     2000 foo  0             0.513162983337374  0.007833678673477415
                           1            1.5251357488726798   -0.2107782114549924
                           2            -1.213546898092921   -0.9955113729604301
    subject2     1980 bar  3          -0.19001014005197553    -0.863345548112192
                           4           -0.7446186914220339    -0.388115467600475
                           5          -0.04464842762041622   0.08245820701430223
                           6            0.9529697829289847   -0.7613603252199205
    

    And the second task with Pandas from_records and reset_index:

    other_df = pd.DataFrame.from_records(
        df.reset_index(level=3).index, columns=df.reset_index(level=3).index.names
    ).drop_duplicates(ignore_index=True)
    
    print(other_df)
    # Output
      subject_code   dob name
    0     subject1  2000  foo
    1     subject2  1980  bar