pythonpandasrapidscudf

what is the most efficient way to do `diff` for a `cudf`


The rapids.ai cudf type is somewhat compatible with pandas, but here is a strange incompatibility. cudf.Series has a .diff() method, but a cudf.DataFrame does not appear to. This is super-annoying (consider, for example, a data frame of stock prices, with columns corresponding to instruments). There are, of course, kludgy ays to get around this (converting to pandas data frame and back comes to mind), but I wonder what the canonical way is. Any advice?


Solution

  • cuDF Python covers a large segment of the pandas API, but there are some gaps (as you've run into here).

    Today, the easiest way to run diff on every column and return a dataframe would be the following:

    cudf.DataFrame({col: df[col].diff() for col in df.columns})