[SOLVED] WEKA's performance on nominal dataset

WEKA's performance on nominal dataset

I used WEKA for classification. I used the breast cancer dataset which is available in WEKA data folder. The data set is a nominal dataset. The .arff file can be found this link.

I did classification using Naive Bayes classifier. I received a classification report such as accuracy, precision, recall, ROC, and other metrics after classification.

I am familiar with SkLearn - the python package. I know that when the input features are nominal we need to convert those features into numerical values using the label encoder or other encoding techniques. Only after that, we can perform classification.

All those machine learning methods are doing some kind of mathematics in background to give the prediction result.

Therefore, I am confused about how could any classifier in WEKA give us prediction results on a nominal dataset?

Solution

TL;DR: when designing software, complexity will exist somewhere.

scikit-learn assumes the user can write code to handle complexity; WEKA assumes the user can explain complexity with metadata.

The 2009 WEKA "update" publication describes some of the design motivations behind the software:

4.1 Core

.... "Another addition to the core of WEKA is the 'Capabilities' meta-data facility. This framework allows individual learning algorithms and filters to declare what data characteristics they are able to handle. This, in turn, enables WEKA's user interfaces to present this information and provide feedback to the user about the applicability of a scheme for the data at hand."

Hall et al. (2009). https://doi.org/10.1145/1656274.1656278

In other words, it is assumed that the user can describe attributes of the data. The Wisconsin dataset includes extensive annotation about the variables (age, menopause, ..., Class) and the ordinal values that they may take ('10-19', '20-29', ...):

@relation breast-cancer
@attribute age {'10-19','20-29','30-39','40-49','50-59','60-69','70-79','80-89','90-99'}
@attribute menopause {'lt40','ge40','premeno'}
...
@attribute 'Class' {'no-recurrence-events','recurrence-events'}
@data
'40-49','premeno','15-19','0-2','yes','3','right','left_up','no','recurrence-events'
'50-59','ge40','15-19','0-2','no','1','right','central','no','no-recurrence-events'

This provides an adequate amount of context for what the inputs look like, which in turn implies which methods are appropriate.

The 2013 scikit-learn API Design publication does not explicitly rule out ordinal string inputs like this. Nonetheless, the core API design principle of "Consistency" suggests some constraints.

Consider this:

from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB()
clf.fit([["a", "b"], ["b", "a"], ["a", "a"]], [0, 0, 1])

Which in scikit-learn==1.2.0 produces an error:

ValueError: dtype='numeric' is not compatible with arrays of bytes/strings.
            Convert your data to numeric values explicitly instead.

One could imagine a version of Multinomial Naive Bayes where this code does not raise an error. However, it would be inconsistent if some estimators allowed these inputs while others did not. What if we were to apply Logistic Regression to this data? Should we one-hot encode the values or ordinal encode them? It's best to leave that detail to the user.

The authors briefly address this as a "data representation" problem, and (perhaps) suggest WEKA is an interesting alternative model:

2.2 Data representation

"In scikit-learn, we chose a representation of data that is as close as possible to the matrix representation: datasets are encoded as NumPy multidimensional arrays .... While these may seem rather unsophisticated data representations when compared to more object-oriented constructs, such as the ones used by Weka (Hall et al., 2009), they bring the prime advantage of allowing us to rely on efficient NumPy and SciPy vectorized operations while keeping the code short and readable."

Buitinck et al. 2013 https://arxiv.org/abs/1309.0238