I would like to have a regression output instead of the classification. For instance: instead of n classes I want a floating point output value from 0 to 1.
Here is the minimalistic example from the package github page:
import spacy
from spacy.util import minibatch
import random
import torch
is_using_gpu = spacy.prefer_gpu()
if is_using_gpu:
torch.set_default_tensor_type("torch.cuda.FloatTensor")
nlp = spacy.load("en_trf_bertbaseuncased_lg")
print(nlp.pipe_names) # ["sentencizer", "trf_wordpiecer", "trf_tok2vec"]
textcat = nlp.create_pipe("trf_textcat", config={"exclusive_classes": True})
for label in ("POSITIVE", "NEGATIVE"):
textcat.add_label(label)
nlp.add_pipe(textcat)
optimizer = nlp.resume_training()
for i in range(10):
random.shuffle(TRAIN_DATA)
losses = {}
for batch in minibatch(TRAIN_DATA, size=8):
texts, cats = zip(*batch)
nlp.update(texts, cats, sgd=optimizer, losses=losses)
print(i, losses)
nlp.to_disk("/bert-textcat")
Is there an easy way to make trf_textcat
work as a regressor? Or would it mean extending the library?
I have figured out a workaround: extract vector representations from the nlp pipeline as:
vector_repres = nlp('Test text').vector
After doing so for all the text entries, You end up with a fixed-dimensional representation of the texts. Assuming You have the continuous output values, feel free to use any estimator, including Neural Network with a linear output.
Note that the vector representation is an average of the vector embeddings of all the words in the text - it might be a sub-optimal solution for Your case.