pytorchtorchtext

PyTorch: How to convert SentencePiece numbers into tokens


PyTorch's torchtext can convert tokens into integers via sentencepiece-numericalizer. E.g. "is" -> 17".

What about the inverse operation, from integer to a token? E.g. 17 --> "is". How can I do that? It's not listed in the https://pytorch.org/text/stable/data_functional.html API doc.


Solution

  • Looking through the code for the sentencepiece_numericalizer in PyTorch, it looks like this (docs removed):

    def sentencepiece_numericalizer(sp_model):
        def _internal_func(txt_iter):
            for line in txt_iter:
                yield sp_model.EncodeAsIds(line)
    
        return _internal_func
    

    Note the call to sp_model.EncodeAsIds(line).

    Based on this, it appears that the sp_model argument (the same returned by load_sp_model) is actually an instance of the SentencePieceProcessor class. Looking through the code, there is an additional DecodeIds method.

    With the above information we can write the inverse function to sentencepiece_numericalizer

    def sentencepiece_denumericalizer(sp_model):
        def _internal_func(id_iter):
            for ids in id_iter:
                yield sp_model.DecodeIds(ids)
    
        return _internal_func
    

    This can be used like below:

    from torchtext.data.functional import (
        load_sp_model,
        sentencepiece_numericalizer,
    )
    sp_model = load_sp_model("path_to_model")
    sp_nm = sentencepiece_numericalizer(sp_model)
    sp_dnm = sentencepiece_denumericalizer(sp_model)
    
    in_strs = ["sentencepiece encode as pieces", "examples to   try!"]
    ids = list(sp_nm(in_strs))
    out_strs = list(sp_dnm(ids)) # should approx eq. in_strs