I have a DF with three columns named A, B, and C. My goal is see if groupby stores a copy of the DF. My test code snippet is as follows:
# Make Df with columns A, B, C.
grp = df.groupby(by=['A', 'B'])
del df
print(grp.transform(lambda x: x)) # This line outputs the whole DF.
The above snippet seems to indicate that grp
contains the DF because the original DF has been deleted and grp
can still produce it. Is this conclusion true?
May be that grp
maintains a pointer to the DF and after the del
operation, the reference count does not go to zero so the data hangs around in memory for grp
to use. Can this be true?
My Pandas is V 2.2.2. Thanks in advance for clarification.
The original df
gets referenced in groupby.obj
:
data = {'A': [*'ABC'],
'B': [*'DEF'],
'C': range(3)}
df = pd.DataFrame(data)
grp = df.groupby(by=['A', 'B'])
Output:
grp.obj
A B C
0 A D 0
1 B E 1
2 C F 2
Equality check:
grp.obj.equals(df)
# True
To be sure, grp.obj
is a reference, not a copy:
id(df) == id(grp.obj)
# True
That also means that any changes to df
would reflect in grp.obj
and impact the result of groupby.transform
. E.g.:
grp['C'].transform('sum')
0 0
1 1
2 2
Name: C, dtype: int64
But if I change the index values of df
:
df.index = [5, 10, 15]
grp['C'].transform('sum')
5 0
10 1
15 2
Name: C, dtype: int64
Note that a change in df
does not alter the groups:
grp['C'].value_counts()
A B C
A D 0 1
B E 1 1
C F 2 1
Name: count, dtype: int64
If I change the values of columns ['A', 'B']
, I still get:
df[['A', 'B']] = 'B'
grp['C'].value_counts()
A B C
A D 0 1
B E 1 1
C F 2 1
Name: count, dtype: int6
Not:
A B C
B B 0 1
1 1
2 1
Name: count, dtype: int64
Finally, if you do del df
, this only removes a reference. The actual pd.DataFrame
still exists in memory, as long as the other reference to it (grb.obj
) still exists.