pythonpandascythonchained-assignmentfuture-warning

Cython and Pandas ChainedAssignmentError Issue: Handling Reference Count Discrepancies


I am using the Cython pip package to speed up the performance of my Pandas operations.

However, I encounter a ChainedAssignmentError due to a discrepancy in reference counts between Cython and standard Python code. When running the following Cythonized script:

import cython
import pandas as pd
import sys

def main():
    df = pd.DataFrame({"A": [1, 2, 3]})
    df["A"] = df["A"].astype(object)
    return df

I get this warning:

FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.

The same code runs fine using regular Python

I can hide the warning but that's not ideal because it will break when python 3.0 is released.

The issue stems with how pandas checks for ChainedAssignment,

pandas checks for the number of system references, but cythonised code has 1 less reference by default,

you can see this by calling sys.getrefcount(df) in cythonised vs uncythonised code,

You can also see it happen by adding this code before calling astype, this extra reference will make the code pass without raising the warning

references = [df]

Does anyone know a way a - fix this internally b - alert the pandas team since their user guide says they support Cython

To reproduce

  1. install Cython
pip install Cython
  1. Create a setup.py file
from setuptools import setup
from Cython.Build import cythonize

setup(
    ext_modules = cythonize("run.pyx")
)
  1. Generate the cythonised file
python setup.py build_ext --inplace
  1. run the file using
import run
run.main()

Solution

  • For reference, the ChainedAssignmentError is to warn you about cases like this:

    df["A"][0:3] = 10
    

    where you're essentially doing some_temp[0:3] = 10, which currently changes df but won't in the future.

    I can hide the warning but that's not ideal because it will break when python 3.0 is released.

    1. Pandas 3.0, not Python 3.0 (probably just a typo).
    2. It's a non-issue. The reference counting check is just what triggers the warning - it doesn't actually affect the behaviour.

    Does anyone know a way a - fix this internally

    I don't think there's a way to fix it except by silencing the warning.

    b - alert the pandas team [...]

    They know.


    The other point to make is: this kind of code isn't the sort of code that Cython will accelerate much because it's just a bunch of Python-style calls to Pandas. The examples in their documentation for using Cython are very much about fast access to individual array elements. So for the code you show, you may not be achieving much by compiling it in Cython. It's possible you have other code that does benefit though.