c++nlpn-gramlanguage-modelkenlm

Cannot allocate memory Failed to allocate when using KenLM build_binary


I have a arpa file which I created by the following command:

 ./lmplz -o 4 -S 1G <tmp_100M.txt >100m.arpa

Now I want to convert this arpa file to binary file:

./build_binary 100m.arpa 100m.bin

And I'm getting error:

mmap.cc:225 in void util::HugeMalloc(std::size_t, bool, util::scoped_memory&) threw ErrnoException because `!to.get()'.
Cannot allocate memory Failed to allocate 106122412848 bytes Byte: 80
ERROR

I tried to add -S parameter:

./build_binary -S 1G 100m.arpa 100m.bin

and I got the same error.

  1. How can I convert to binary file ?

  2. Why I'm getting this error ?


Solution

  • Take a look at https://aclanthology.org/W16-4618 for some light explanation

    Try this instead:

    LM_ORDER=4
    CORPUS_LM="tmp_100M"
    LANG_E="txt"
    LM_ARPA="100m.arpa"
    LM_FILE="100m.bin"
    
    ${MOSES_BIN_DIR}/lmplz --order ${LM_ORDER} -S 80% -T /tmp \
    < ${CORPUS_LM}.${LANG_E} | gzip > ${LM_ARPA}
    
    ${MOSES_BIN_DIR}/build_binary trie -a 22 -b 8 -q 8 ${LM_ARPA} ${LM_FILE}
    

    MOSES_BIN_DIR is the directory where the binaries you've compiled are stored.


    If you still face the memory issue when using the trie and quantization options, you might need to change to a machine/instance where the CPU RAM is sufficient to read your language model and produce the binary.