pythonclassificationscikit-multilearn

Resolve ValueError more than 1 class in BinaryRelevance package


I'm attempting to follow this tutorial on my own dataset. After binarizing my data, I tried to run the Binary Relevance package, but got the error: The number of classes has to be greater than one; got 1 class

These are the suggestions I've tried, with links:

  1. Getting rid of categories with only one instance. This took my data from 34 labels to 32. I made sure to get rid of the two columns containing these genres. I also exploded the genres column (from a delimited string to just the genres) so that I could get rid of rows containing the sparsely seen genres.

  2. Since I exploded the column, I could use a stratified test train split like you see here:

train, test = train_test_split(movies, random_state=42, train_size = 20000, test_size=1000, shuffle=True, stratify = movies['genre'])

I checked the length of the columns using len(np.unique(train['genre'])) which returned 32.

  1. I checked whether np.unique(y_train) returned 0 and 1, which it did, meaning I do not just have one class.

  2. (EDIT) I also checked the shape of my x_train and y_train and got x_train.shape = (20000, 10000) (10,000 is my max number of parameters) and y_train.shape = (20000, 32).

I'm beginning to think that the sparser categories are the issue, and not the code. I have over 300,000 rows, but my smallest categories have only 6 instances. It just not possible to use Binary Relevance to make predictions with such sparse cases, or is there another potential solution I'm missing?


Solution

  • The issue is with scikit-multilearn. It is not compatible with my version of Python (3.11) and does not integrate well with newer versions of numpy and scipy. Using scikit-multilearn-ng solved this issue.