multilabel-classificationmulticlass-classification

How to define a dataset is a multiclass or multilabel problems


I am really confused with the multiclass and multilabel terminology, I have a dataset as below:

fe_1, fe_2, lb_1, lb_2
12, 34, A, A
34, 56, C, C
...

The requirement is that within lb_1 or lb_2, A,C can not happen at the same times, and between lb_1 and lb_2, A,A or C,C must show up, that means my final label should be [A,A],[A,C],[C,A],[C,C] I think this should be a multiclass-multilabel problem, my colleague is more confident that it should be a multiclass problem only, while from sklearn package, it's a multilabel-indicator. My question is,

  1. if it's a multilabel problem, can the model knows which A, C belongs to which class? Will it accidently make A,C in lb_1 or A,C in lb_2 happen at the same time?

  2. And for multilabel problems, label columns are allowed to happen at the same time and label values only for value 0/1, here the 0/1 means "presence or not"? what if I have more than two values in each label column and cannot be happened at the same time? For example, I replace A,C from lb_2 to B,D, and I need to make sure A,C in lb_1 and B,D in lb_2 cannot happen at the same time which means when I get a row of data, it should have 4 possible outcomes {[A,B],[A,D],[C,B],[C,D]}, this time sklearn package detect this is a multiclass-multioutput problem, so if it's true, can multiclass-multioutput make sure lb_1 and lb_2 show up at the same time and within lb_1 or within lb_2, different labels cannot happen at the same time?

  3. If the multiclass-multioutput is what I need for my question, Can anyone suggest me a way to deal with imbalance issue in this multiclass-multioutput dataset? SMote Tomek Link seems can only deal with 1d array and no idea how to use the ADASYN, MLSMOTE also doesn't work if I assign my original lb_2 label from [A,C] to [B,D]

Appreciate your help and patience!


Solution

  • From what I understood, you have two outputs (L1, L2) per sample whose labels can be: (A, A), (A, C), (C, A), (C, C). sklearn defines this as multilabel classification (2 or more outputs, where each output is binary).

    Each output takes on one class from A or C (binary classification). The multilabel aspect is that each sample has two binary labels assigned to it. L1 can only predict one class at a time, because sklearn will see that it's binary. Same for L2 - it can only predict one class for each sample. The outputs from the model could be AA, AC, CA, CC. If you fit the model on data where the first column is L1 and the second column is L2, then the model's outputs will follow that order - i.e. the first output will be for L1, and the second will be for L2.

    L1 can't predict two classes at the same time. So if you .predict(), you might get L1=A, L2=C for example. You could equivalently get the underlying probabilities using .predict_proba(): L1=[0.7, 0.3], L2=[0.01, 0.99].

    In your case, if AA and CC are not valid combinations, and you want to strictly prohibit them, you may reframe the problem as simply binary classification, where the only two allowed outputs are: 0 representing "AC" and 1 representing "CA". You'd need to delete the invalid L1-L2 combinations, map the two valid combinations to 0 ("AC") and 1 ("CA"), and fit a binary classifier to this new binary target.


    Alternatively, if you don't want to delete the (A, A) and (C, C) samples, and keep it as a multilabel setup, then some options for minimising AA and CC predictions are below.

    My initial thought is that the model will adapt to your data...so if your data reflects those constraints, a good model will mostly behave the desired way.

    One way to influence the model in sklearn is to use the sample_weight= parameter when fitting. You can set the weight of (a,a) and (c,c) samples to zero, causing the model to ignore them and therefore the model will only be influenced by (a,c) and (c,a) labels.

    An potential alternative is to use a framework like PyTorch, where you can penalise the model heavily for (a,a) and (c,c) predictions, which will push the model way from making such predictions, though it won't guarantee their absence.