nlpbert-language-modelhuggingface-transformersjohnsnowlabs-spark-nlp

BERT embeddings in SPARKNLP or BERT for token classification in huggingface


Currently I am working on productionize a NER model on Spark. I have a current implementation that is using Huggingface DISTILBERT with the TokenClassification head, but as the performance is a bit slow and costly, I am trying to find ways to optimize.

I have checked SPARKNLP implementation, which lacks a pretrained DISTILBERT and has I think a different approach, so some questions regarding this arose:

  1. Huggingface uses the entire BERT model and adds a head for token classification. Is this the same as obtaining the BERT embeddings and just feeding them to another NN?
  2. I ask this because this is the SPARKNLP approach, a class that helps obtaim those embeddings and use it as a feature for another complex NN. Doesnt this lose some of the knowledge inside BERT?
  3. Does SPARKNLP have any optimization regarding SPARK that helps in inference time or is it just another BERT implementation.

Solution

  • To answer your Question no. 1:

    Hugging face uses different head for different tasks, this is almost the same as what the authors of BERT did with their model. They added task-specific layer on top of the existing model to fine-tune for a particular task. One thing that must be noted here is that when you add task specific layer (a new layer), you jointly learn the new layer and update the existing learnt weights of the BERT model. So, basically your BERT model is part of gradient updates. This is quite different from obtaining the embeddings and then using it as input to Neural Nets.

    Question 2: When you obtain the embeddings and use it for another complex model, I am not sure how to quantify in terms of loosing the information, because you are still using the information obtained using BERT, from your data to build another model. So, we cannot attribute to loosing the information, but the performance need not be the best when compared with learning another model on top of BERT (and along with BERT).

    Often, people would obtain the embeddings and then as input to another the classifier due to resource constraint, where it may not be feasible to train or fine-tune BERT.