pytorchmachine-translationfairseq

How to inspect values in binarized FairSeq datasets?


Running the fairseq-preprocess script produces binary files with integer indices corresponding to token ids in a dictionary.

When I no longer have the original tokenized texts, what is the simplest way to explore the binarized dataset? The documentation does not say much about how a dataset can be loaded for debugging purposes.


Solution

  • I worked around this by loading the trained model and using it to decode the binarized sentences back to strings:

    from fairseq.models.transformer import TransformerModel
    
    model_dir = ???
    data_dir = ???
    
    model = TransformerModel.from_pretrained(
        model_dir,
        checkpoint_file='checkpoint_best.pt',
        data_name_or_path=data_dir,
        bpe='sentencepiece', 
        sentencepiece_model=model_dir + '/sentencepiece.joint.bpe.model'
    )
    model.task.load_dataset('train')
    data_bin = model.task.datasets['train']
    train_pairs = [
        (model.decode(item['source']), model.decode(item['target'])) 
        for item in data_bin
    ]