[SOLVED] PyTorch: How to convert SentencePiece numbers into tokens

PyTorch: How to convert SentencePiece numbers into tokens

PyTorch's torchtext can convert tokens into integers via sentencepiece-numericalizer. E.g. "is" -> 17".

What about the inverse operation, from integer to a token? E.g. 17 --> "is". How can I do that? It's not listed in the https://pytorch.org/text/stable/data_functional.html API doc.

Solution

Looking through the code for the sentencepiece_numericalizer in PyTorch, it looks like this (docs removed):

def sentencepiece_numericalizer(sp_model):
    def _internal_func(txt_iter):
        for line in txt_iter:
            yield sp_model.EncodeAsIds(line)

    return _internal_func

Note the call to sp_model.EncodeAsIds(line).

Based on this, it appears that the sp_model argument (the same returned by load_sp_model) is actually an instance of the SentencePieceProcessor class. Looking through the code, there is an additional DecodeIds method.

With the above information we can write the inverse function to sentencepiece_numericalizer

def sentencepiece_denumericalizer(sp_model):
    def _internal_func(id_iter):
        for ids in id_iter:
            yield sp_model.DecodeIds(ids)

    return _internal_func

This can be used like below:

from torchtext.data.functional import (
    load_sp_model,
    sentencepiece_numericalizer,
)
sp_model = load_sp_model("path_to_model")
sp_nm = sentencepiece_numericalizer(sp_model)
sp_dnm = sentencepiece_denumericalizer(sp_model)

in_strs = ["sentencepiece encode as pieces", "examples to   try!"]
ids = list(sp_nm(in_strs))
out_strs = list(sp_dnm(ids)) # should approx eq. in_strs