pythonxgboostxgbclassifier

xgboost with categorical data - parser error


I'm trying to train an XGBoost model which has also categorical variable. I'd like to avoid onehot encoding and I saw it is now possible using enable_categorical=True. I formatted my dataframe but when I try to generate the DMatrix I get the error below. I also attach a very simple example that recapitulate the error.

import xgboost as xgb
import numpy as np

test = pd.DataFrame({'out': ["a","b"],'features': [np.array(["house","horse","something", "NA" ]), np.array(["house","NA","NA", "NA" ]) ]})

X_train = test['features'].to_json()
y_train = test['out'].to_json()

xgb.DMatrix(X_train, label=y_train)

Then I get this warning/error:

[14:32:11] WARNING: ../src/data/data.cc:868: No format parameter is provided in input uri.  Choosing default parser in dmlc-core.  Consider providing a uri parameter like: filename?format=csv
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/site-packages/xgboost/core.py", line 620, in inner_f
    return func(**kwargs)
  File "/usr/local/lib/python3.8/site-packages/xgboost/core.py", line 743, in __init__
    handle, feature_names, feature_types = dispatch_data_backend(
  File "/usr/local/lib/python3.8/site-packages/xgboost/data.py", line 964, in dispatch_data_backend
    return _from_uri(data, missing, feature_names, feature_types)
  File "/usr/local/lib/python3.8/site-packages/xgboost/data.py", line 880, in _from_uri
    _check_call(_LIB.XGDMatrixCreateFromFile(c_str(data),
  File "/usr/local/lib/python3.8/site-packages/xgboost/core.py", line 279, in _check_call
    raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [14:32:11] ../src/data/data.cc:874: Encountered parser error:
[14:32:11] ../dmlc-core/src/io/local_filesys.cc:86: LocalFileSystem.GetPathInfo: {"0":["house","horse","something","NA"],"1":["house","NA","NA","NA"]} error: No such file or directory
Stack trace:
  [bt] (0) /usr/local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x83a293) [0x7f01eb5ea293]
  [bt] (1) /usr/local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x83c13c) [0x7f01eb5ec13c]
  [bt] (2) /usr/local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x822929) [0x7f01eb5d2929]
  [bt] (3) /usr/local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x822e1e) [0x7f01eb5d2e1e]
  [bt] (4) /usr/local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x812ca6) [0x7f01eb5c2ca6]
  [bt] (5) /usr/local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x81312e) [0x7f01eb5c312e]
  [bt] (6) /usr/local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x7f2210) [0x7f01eb5a2210]
  [bt] (7) /usr/local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x7d4141) [0x7f01eb584141]
  [bt] (8) /usr/local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x214294) [0x7f01eafc4294]


Stack trace:
  [bt] (0) /usr/local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x20b233) [0x7f01eafbb233]
  [bt] (1) /usr/local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0xfc7ad) [0x7f01eaeac7ad]
  [bt] (2) /usr/local/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(XGDMatrixCreateFromFile+0xdf) [0x7f01eaef762f]
  [bt] (3) /usr/lib/x86_64-linux-gnu/libffi.so.7(+0x6d1d) [0x7f0248f58d1d]
  [bt] (4) /usr/lib/x86_64-linux-gnu/libffi.so.7(+0x6289) [0x7f0248f58289]
  [bt] (5) /usr/local/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(_ctypes_callproc+0x336) [0x7f0248f75477]
  [bt] (6) /usr/local/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so(+0xae9e) [0x7f0248f70e9e]
  [bt] (7) /usr/local/bin/../lib/libpython3.8.so.1.0(_PyObject_MakeTpCall+0x87) [0x7f024f004437]
  [bt] (8) /usr/local/bin/../lib/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x41f7) [0x7f024f031d07]

Does anyone have suggestions on how it can be solved? Is the format ok?

-----
numpy               1.23.5
pandas              2.0.2
xgboost             1.7.5
-----
Python 3.8.16 (default, May 23 2023, 14:26:40) [GCC 10.2.1 20210110]
Linux-5.10.104-linuxkit-x86_64-with-glibc2.2.5
-----

EDIT: I didn't give much context in my original question, but I'd prefer not to split features into different columns because of the data itself. The reason for this is that the way features appears is not ordered, so I could have the same feature in column1 at times and columnN in other cases. I imagined this problem could be overcome if all the features are part of the same array. Could something similar could be achieved for categorical values? I tried X_train = np.vstack(test['features'].apply(lambda x: x.astype('category') )) but then I get the error: ValueError: could not convert string to float: 'chain' in my DMatrix. Is training on an array something achievable?


Solution

  • The data parameter can take one of the following:

    but not a JSON string.

    So you have to pass a DataFrame or a Numpy array. However you have to convert as numeric (or category). Try something like:

    X_train = (pd.DataFrame(np.vstack(test['features']))
                 .replace('NA', np.nan)
                 .add_prefix('feat_')
                 .apply(lambda x: pd.factorize(x)[0]))
    y_train = pd.factorize(test['out'])[0]
    
    dmat = xgb.DMatrix(X_train, label=y_train)
    

    Output:

    >>> dmat
    <xgboost.core.DMatrix at 0x7f225b6483d0>
    
    >>> X_train
       feat0  feat1  feat2  feat3
    0      0      0      0     -1
    1      0     -1     -1     -1
    
    >>> y_train
    array([0, 1])
    

    EDIT:

    I'd prefer not to split features into different columns because of the data itself. The reason for this is that the way features appears is not ordered, so I could have the same feature in column1 at times and columnN in other cases

    XGBoost is a Decision-Tree algorithm so you need columns. Instead of pd.factorize, you can use pd.get_dummies:

    X_train = (pd.get_dummies(test['features'].explode()
                 .loc[lambda x: x != 'NA']).astype(int)
                 .groupby(level=0).max())
    print(X_train)
    
    # Output
       horse  house  something
    0      1      1          1
    1      0      1          0