pythonrandom-forestcudf

How To Pass cuDF Dataframe to cuML.ensemble.RandomForestClassifier?


I'm trying to fit data to the cuml.ensemble.RandomForestClassifier and I keep getting the error: "The labels need to be consecutive values from 0 to the number of unique label values"

I'm passing cudf.DataFrame objects into the function which have the same number of rows but differing number of columns. The column labels start at 0 and step by 1 up to the final column (in the example below 108). What am I doing wrong? I've attached a printout of the dataframes that I'm passing in below and some code for context:

clf1 = modelClass(max_depth=D1, random_state=random.randrange(0, 1024, 1), n_bins=15, n_streams=4, split_criterion=criterion, bootstrap=bootstrap, n_estimators=trs1)

clf1.fit(X1, Y1)

X1's dataframe looks like this:

0 1 2
0 1.000000e-11 1.000000e-11 1.647421e-01
1 1.000000e-11 1.000000e-11 1.760000e-02
2 1.000000e-11 1.000000e-11 -1.772000e-01
3 1.000000e-11 1.000000e-11 8.254000e-01
4 1.000000e-11 1.000000e-11 2.587000e-01
... ... ... ...
5402 1.000000e-11 1.000000e-11 1.704444e-01
5403 1.000000e-11 1.000000e-11 -1.860000e-01
5404 0.000000e+00 1.000000e-11 1.229714e-01
5405 1.000000e-11 1.959500e-01 1.984667e-01
5406 1.000000e-11 1.000000e-11 1.000000e-11

[5407 rows x 3 columns]; dtype=('0', dtype('float64')); <cudf.core.dataframe._DataFrameLocIndexer object at 0x7f9c3d0f3070>

Y1's Dataframe looks like this:

0
0 -2.0
1 4.0
2 -3.0
3 1.0
4 0.0
... ...
5402 0.0
5403 -2.0
5404 0.0
5405 0.0
5406 0.0

[5407 rows x 1 columns]; dtype=('0', dtype('float64')); <cudf.core.dataframe._DataFrameLocIndexer object at 0x7f9c1b847b50>

System Information: Ubuntu 20.04, Titan RTX, CUDA 11.5, Rapids 21.12 built-in Conda, Python 3.8


Solution

  • Ended up that you need to encode Y1's Dataframe first before passing it:

    enc = cuml.preprocessing.LabelEncoder()
    
    Y1 = enc.fit_transform(Y1)
    

    Shoutout to @beckernick for helping me out with this!