I have a Pandas dataframe similar to (but much larger than) :
dob name x y
subject_code datapoint
subject1 0 2000 foo 1.0130617046645218 -0.9197554534168588
1 2000 foo 1.6268826597606547 0.9235889343439213
2 2000 foo -0.27934495719523383 -0.7649336229271105
subject2 3 1980 bar 0.799986663648364 0.17376727917253385
4 1980 bar 0.05939596855122923 1.6401662246395807
5 1980 bar -0.6496807099005506 -0.9591735410133281
6 1980 bar -0.6953680580655669 -0.15390936342856965
Which I have read in from a flat csv file where all of the subject-level attributes (like subject_code
,dob
and name
here) are repeated for every column, using the command df = pd.read_csv(PATH, index_col=["subject_code", "datapoint"])
.
Currently the dob and name are repeated in every column, which seems to be bad practice. My questions are twofold:
dob
and name
with the higher level index subject_code
, such that they are in the same table but not repeated?subject_code
, dob
and name
into a separate, smaller table?With respect to (2), I'm currently this far:
subject_cols = df.groupby("subject_code")\
.nunique()\
.max() == 1
subject_cols = subject_cols.index[subject_cols]
with df[subject_cols]
giving:
dob name
subject_code datapoint
subject1 0 2000 foo
1 2000 foo
2 2000 foo
subject2 3 1980 bar
4 1980 bar
With the following dataframe I get from running your toy example:
print(df)
# Output
dob name x y
subject_code datapoint
subject1 0 2000 foo 0.513162983337374 0.007833678673477415
1 2000 foo 1.5251357488726798 -0.2107782114549924
2 2000 foo -1.213546898092921 -0.9955113729604301
subject2 3 1980 bar -0.19001014005197553 -0.863345548112192
4 1980 bar -0.7446186914220339 -0.388115467600475
5 1980 bar -0.04464842762041622 0.08245820701430223
6 1980 bar 0.9529697829289847 -0.7613603252199205
You can accomplish the first task using Pandas groupby and explode:
df = df.reset_index().groupby(["subject_code", "dob", "name", "datapoint"]).agg(list)
for col in ["x", "y"]:
df = df.explode(col)
print(df)
# Output
x y
subject_code dob name datapoint
subject1 2000 foo 0 0.513162983337374 0.007833678673477415
1 1.5251357488726798 -0.2107782114549924
2 -1.213546898092921 -0.9955113729604301
subject2 1980 bar 3 -0.19001014005197553 -0.863345548112192
4 -0.7446186914220339 -0.388115467600475
5 -0.04464842762041622 0.08245820701430223
6 0.9529697829289847 -0.7613603252199205
And the second task with Pandas from_records and reset_index:
other_df = pd.DataFrame.from_records(
df.reset_index(level=3).index, columns=df.reset_index(level=3).index.names
).drop_duplicates(ignore_index=True)
print(other_df)
# Output
subject_code dob name
0 subject1 2000 foo
1 subject2 1980 bar