pythonpandasdataframegroup-by

Does groupby object in Pandas store contents of the original dataframe


I have a DF with three columns named A, B, and C. My goal is see if groupby stores a copy of the DF. My test code snippet is as follows:

# Make Df with columns A, B, C.
grp = df.groupby(by=['A', 'B'])
del df  
print(grp.transform(lambda x: x))  # This line outputs the whole DF.

The above snippet seems to indicate that grp contains the DF because the original DF has been deleted and grp can still produce it. Is this conclusion true?

May be that grp maintains a pointer to the DF and after the del operation, the reference count does not go to zero so the data hangs around in memory for grp to use. Can this be true?

My Pandas is V 2.2.2. Thanks in advance for clarification.


Solution

  • The original df gets referenced in groupby.obj:

    data = {'A': [*'ABC'],
            'B': [*'DEF'],
            'C': range(3)}
    df = pd.DataFrame(data)
    
    grp = df.groupby(by=['A', 'B'])
    

    Output:

    grp.obj
    
       A  B  C
    0  A  D  0
    1  B  E  1
    2  C  F  2
    

    Equality check:

    grp.obj.equals(df)
    # True
    

    To be sure, grp.obj is a reference, not a copy:

    id(df) == id(grp.obj)
    # True
    

    That also means that any changes to df would reflect in grp.obj and impact the result of groupby.transform. E.g.:

    grp['C'].transform('sum')
    
    0    0
    1    1
    2    2
    Name: C, dtype: int64
    

    But if I change the index values of df:

    df.index = [5, 10, 15]
    
    grp['C'].transform('sum')
    
    5     0
    10    1
    15    2
    Name: C, dtype: int64
    

    Note that a change in df does not alter the groups:

    grp['C'].value_counts()
    
    A  B  C
    A  D  0    1
    B  E  1    1
    C  F  2    1
    Name: count, dtype: int64
    

    If I change the values of columns ['A', 'B'], I still get:

    df[['A', 'B']] = 'B'
    
    grp['C'].value_counts()
    
    A  B  C
    A  D  0    1
    B  E  1    1
    C  F  2    1
    Name: count, dtype: int6
    

    Not:

    A  B  C
    B  B  0    1
          1    1
          2    1
    Name: count, dtype: int64
    

    Finally, if you do del df, this only removes a reference. The actual pd.DataFrame still exists in memory, as long as the other reference to it (grb.obj) still exists.