openfstfst

OpenFST - creating FST's from list of words


I'm reading the top example: http://www.openfst.org/twiki/bin/view/FST/FstExamples about tokenization.

In the example, they create three fsts: Mars.fst, Martian.fst, and man.fst, and manually run some fst commands to merge them into one big transducer. They get the word "Mars", "Martian", and "man" from wotw.syms, which has 7102 words.

My question is, is there a smart way to create a word.fst for all 7102 words, so that all 7102 words can be made into one big automata, or does it have to be done manually, like they did for the three word Martian, Mars, and man?


Solution

  • They gave a script: https://www.openfst.org/twiki/pub/FST/FstExamples/makelex.py.txt We may simply:

    cat wotw.syms | python2 makelex.py > lexicons_text.fst
    fstcompile --isymbols=ascii.syms --osymbols=wotw.syms lexicon_text.fst lexicon.fst
    fstrmepsilon lexicon.fst | fstdeterminize | fstminimize >lexicon_opt.fst