PyTorch's torchtext can convert tokens into integers via sentencepiece-numericalizer
. E.g. "is" -> 17"
.
What about the inverse operation, from integer to a token? E.g. 17 --> "is"
. How can I do that? It's not listed in the https://pytorch.org/text/stable/data_functional.html API doc.
Looking through the code for the sentencepiece_numericalizer in PyTorch, it looks like this (docs removed):
def sentencepiece_numericalizer(sp_model):
def _internal_func(txt_iter):
for line in txt_iter:
yield sp_model.EncodeAsIds(line)
return _internal_func
Note the call to sp_model.EncodeAsIds(line)
.
Based on this, it appears that the sp_model argument (the same returned by load_sp_model
) is actually an instance of the SentencePieceProcessor class. Looking through the code, there is an additional DecodeIds
method.
With the above information we can write the inverse function to sentencepiece_numericalizer
def sentencepiece_denumericalizer(sp_model):
def _internal_func(id_iter):
for ids in id_iter:
yield sp_model.DecodeIds(ids)
return _internal_func
This can be used like below:
from torchtext.data.functional import (
load_sp_model,
sentencepiece_numericalizer,
)
sp_model = load_sp_model("path_to_model")
sp_nm = sentencepiece_numericalizer(sp_model)
sp_dnm = sentencepiece_denumericalizer(sp_model)
in_strs = ["sentencepiece encode as pieces", "examples to try!"]
ids = list(sp_nm(in_strs))
out_strs = list(sp_dnm(ids)) # should approx eq. in_strs