pandasnumpycudf

searching index with cudf dataframe doesn't work with numpy


I just loaded the csv file with cudf (rapidsai) to reduce the time it takes. An issue comes up when I try to search index with an condition where df['X'] = A.

here is my code example:

import cudf, io, requests
df = cudf.read_csv('fileA.csv')

# X is an existing column
# A is the value
df['X'] = np.where(df['X'] == A, 1, 0)

# What it is supposed to do with pandas is it search the index where df['X'] is equal to value A, 
# and change them to 1, otherwise leave them as 0.

However, an error is shown like this:

if len(cond) ! = len(self):
  raise ValueError("""Array conditional must be same shape as self""")
input_col = self._data[self.name]

ValueError : Array conditional must be same shape as self

I don't see why it happens since I've never had any issues with pandas before.


Solution

  • cuDF is trying to dispatch from numpy.where to cupy.where via the array function protocol. For one reason or another, cuDF is not able to successfully run the dispatched function in this case.

    In general, the recommendation would be to explicitly use CuPy rather than numpy here.

    import cudf
    import cupy as cp
    ā€‹
    A = 2
    df = cudf.DataFrame({"X": [0, 1, 2]})
    df['X'] = cp.where(df['X'] == A, 1, 0)
    df
    X
    0   0
    1   0
    2   1