huggingface-transformersbert-language-modelnamed-entity-recognitionpunctuation

Bert Tokenizer punctuation for named entity recognition task


I'm working on a named entity recognition task, where I need to identify person names, books etc.

I am using Huggingface Transformer package and BERT with PyTorch. Generally it works very good, however, my issue is that for some first names a dot "." is a part of the first name and shouldn't be separate it from it. For example, for the person name "Paul Adam", the first name in the training data is shortened to one letter combined with dot "P. Adam". The tokenizer tokenize it as ["P", ".", "Adam"] which later negatively impact the ner trained model performance as "P." is presented in the training data and not only "P". The model is capable to recognize the full names but fails in the shortened one. I used Spacy tokenizer before and I didn't face this issue. Here more details:

from transformers import BertTokenizer, BertConfig, AutoTokenizer, AutoConfig, BertModel
path_pretrained_model='/model/bert/'
tokenizer = BertTokenizer.from_pretrained(path_pretrained_model)

print(tokenizer.tokenize("P. Adam is a scientist."))

Output:
['p', '.', 'adam', 'is', 'a', 'scientist', '.']

The helpful output would be 
['p.', 'adam', 'is', 'a', 'scientist', '.']

Solution

  • Not sure whether this might be a viable solution for you, but here's a possible hack.

    from transformers import BertTokenizer, BertConfig, AutoTokenizer, AutoConfig, BertModel
    import string
    
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_basic_tokenization=True, never_split=[f"{letter}." for letter in list(string.ascii_lowercase)])
    
    print(tokenizer.tokenize("P. Adam is a scientist."))   
    # ['p.', 'adam', 'is', 'a', 'scientist', '.']
    

    Indeed, from the documentation

    never_split (Iterable, optional) — Collection of tokens which will never be split during tokenization. Only has an effect when do_basic_tokenize=True