pythontokenizebert-language-modelnamed-entity-recognitionroberta

NER Classification Deberta Tokenizer error : You need to instantiate DebertaTokenizerFast


I'm trying to perform a NER Classification task using Deberta, but I'm stacked with a Tokenizer error. This is my code (my input sentence must be splitted word by word by ",:):

from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

import transformers
assert isinstance(tokenizer, transformers.PreTrainedTokenizerFast)

tokenizer(["Hello", ",", "this", "is", "one", "sentence", "split", "into", "words", "."])

I have this results:

{'input_ids': [[1, 31414, 2], [1, 6, 2], [1, 9226, 2], [1, 354, 2], [1, 1264, 2], [1, 19530, 4086, 2], [1, 44154, 2], [1, 12473, 2], [1, 30938, 2], [1, 4, 2]], 'token_type_ids': [[0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0], [0, 0, 0]], 'attention_mask': [[1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1], [1, 1, 1]]} 

Then I proceed but I have this error:

tokenized_input = tokenizer(example["tokens"])
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
print(tokens)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'

And I think the reasons is that I need to have the results of the token in the following format(that is not possible because I have the sentence splitted by ",":

tokenizer("Hello, this is one sentence!")

{'input_ids': [1, 31414, 6, 42, 16, 65, 3645, 328, 2], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

So I tried in this both way but I'm stack and don't know how to do. There are very few documentation online about Deberta.

tokenizer(["Hello", ",", "this", "is", "one", "sentence", "split", "into", "words", "."], is_split_into_words=True)

AssertionError: You need to instantiate DebertaTokenizerFast with add_prefix_space=True to use it with pretokenized inputs.
tokenizer(["Hello", ",", "this", "is", "one", "sentence", "split", "into", "words", "."], is_split_into_words=True,add_prefix_space=True)

And the error is still the same. Thank you so much !


Solution

  • Lets try this:

    input_ids = [1, 31414, 6, 42, 16, 65, 3645, 328, 2]
    input_ids  = ','.join(map(str, input_ids ))
    
    
    input_ids = ["Hello", ",", "this", "is", "one", "sentence", "split", "into", "words", "."]
    input_ids  = ','.join(map(str, input_ids ))
    input_ids