nlpbert-language-modelfine-tuning

Do I need to retrain Bert for NER to create new labels?


I am very new to natural language processing and I was thinking about working on named entity recognition NER. A friend of mine who works with NLP advised me to check out BERT, which I did. When reading the documentation and checking out the CoNLL-2003 data set, I noticed that the only labels are person, organization, location, miscellanious and outside. What if instead of outside, I want the model to recognize date, time, and other labels. I get that I would need a dataset labelled as such so, assuming that I have that, do I need to retrain BERT from stratch or can I somehow fine tune the existing model without needing to restart the whole process?


Solution

  • Yes, you would have to use a model trained using the specific labels you require. The OntoNotes dataset may be better suited for what you are trying to do, as it includes the 18 entity names listed below (see OntoNotes 5.0 Release Notes for further info).

    The HuggingFace flair/ner-english-ontonotes-large (here) and flair/ner-english-ontonotes-fast (here) models are trained on this dataset and will likely produce results closer to what you desire. As a demo (make sure to pip install flair first)

    from flair.data import Sentence
    from flair.models import SequenceTagger
    
    tagger = SequenceTagger.load("flair/ner-english-ontonotes-large")  # load tagger
    sentence = Sentence("On September 1st George won 1 dollar while watching Game of Thrones.")  # example sentence
    tagger.predict(sentence)  # predict NER tags
    
    # Print sentence and NER spans
    print(sentence)
    print('The following NER tags are found:')
    # iterate over entities and print
    for entity in sentence.get_spans('ner'):
        print(entity)
    
    # Output
    # Span [2,3]: "September 1st"   [− Labels: DATE (1.0)]
    # Span [4]: "George"   [− Labels: PERSON (1.0)]
    # Span [6,7]: "1 dollar"   [− Labels: MONEY (1.0)]
    # Span [10,11,12]: "Game of Thrones"   [− Labels: WORK_OF_ART (1.0)
    

    OntoNotes 5.0 Named Entities

    1. PERSON (People, including fictional)
    2. NORP (Nationalities or religious or political groups)
    3. FACILITY (Buildings, airports, highways, bridges, etc.)
    4. ORGANIZATION (Companies, agencies, institutions, etc.)
    5. GPE (Countries, cities, states)
    6. LOCATION (Non-GPE locations, mountain ranges, bodies of water)
    7. PRODUCT (Vehicles, weapons, foods, etc. (Not services))
    8. EVENT (Named hurricanes, battles, wars, sports events, etc.)
    9. WORK OF ART (Titles of books, songs, etc.)
    10. LAW (Named documents made into laws)
    11. LANGUAGE (Any named language)
    12. DATE (Absolute or relative dates or periods)
    13. TIME (Times smaller than a day)
    14. PERCENT (Percentage (including “%”))
    15. MONEY (Monetary values, including unit)
    16. QUANTITY (Measurements, as of weight or distance)
    17. ORDINAL (“first”, “second”)
    18. CARDINAL (Numerals that do not fall under another type)