linuxbashshelllinux-kernelkaldi

Extending Kaldi Aspire: bad variable error while Recompiling HCLG.fst using new lexicon and grammar files


I have successfully setup and run the Kaldi Aspire recipe on my WSL. Now i was working on a POC where i want to extend the ASPIRE recipe by making a new corpus, dictionary, language model and merge it with the original HCLG.fst. I followed this blog post. I have been able to sucessfully create the new dictionary, language model and merged the input files. However i am getting the following error when i try to recompile the HCLG.fst with new Lexicon and grammar.

Checking update-model/local/dict/silence_phones.txt ...
--> reading update-model/local/dict/silence_phones.txt
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> update-model/local/dict/silence_phones.txt is OK

Checking update-model/local/dict/optional_silence.txt ...
--> reading update-model/local/dict/optional_silence.txt
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> update-model/local/dict/optional_silence.txt is OK

Checking update-model/local/dict/nonsilence_phones.txt ...
--> reading update-model/local/dict/nonsilence_phones.txt
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> update-model/local/dict/nonsilence_phones.txt is OK

Checking disjoint: silence_phones.txt, nonsilence_phones.txt
--> disjoint property is OK.

Checking update-model/local/dict/lexicon.txt
--> reading update-model/local/dict/lexicon.txt
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> update-model/local/dict/lexicon.txt is OK

Checking update-model/local/dict/lexiconp.txt
--> reading update-model/local/dict/lexiconp.txt
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> update-model/local/dict/lexiconp.txt is OK

Checking lexicon pair update-model/local/dict/lexicon.txt and update-model/local/dict/lexiconp.txt
--> lexicon pair update-model/local/dict/lexicon.txt and update-model/local/dict/lexiconp.txt match

Checking update-model/local/dict/extra_questions.txt ...
--> reading update-model/local/dict/extra_questions.txt
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> update-model/local/dict/extra_questions.txt is OK
--> SUCCESS [validating dictionary directory update-model/local/dict]

fstaddselfloops update-model/dict/phones/wdisambig_phones.int update- 
model/dict/phones/wdisambig_words.int
prepare_lang.sh: validating output directory
utils/validate_lang.pl update-model/dict
Checking existence of separator file
separator file update-model/dict/subword_separator.txt is empty or does not exist, deal in word case.
Checking update-model/dict/phones.txt ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> update-model/dict/phones.txt is OK

Checking words.txt: #0 ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> update-model/dict/words.txt is OK

Checking disjoint: silence.txt, nonsilence.txt, disambig.txt ...
--> silence.txt and nonsilence.txt are disjoint
--> silence.txt and disambig.txt are disjoint
--> disambig.txt and nonsilence.txt are disjoint
--> disjoint property is OK

Checking sumation: silence.txt, nonsilence.txt, disambig.txt ...
--> found no unexplainable phones in phones.txt

Checking update-model/dict/phones/context_indep.{txt, int, csl} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 20 entry/entries in update-model/dict/phones/context_indep.txt
--> update-model/dict/phones/context_indep.int corresponds to update-model/dict/phones/context_indep.txt
--> update-model/dict/phones/context_indep.csl corresponds to update-model/dict/phones/context_indep.txt
--> update-model/dict/phones/context_indep.{txt, int, csl} are OK

Checking update-model/dict/phones/nonsilence.{txt, int, csl} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 156 entry/entries in update-model/dict/phones/nonsilence.txt
--> update-model/dict/phones/nonsilence.int corresponds to update-model/dict/phones/nonsilence.txt
--> update-model/dict/phones/nonsilence.csl corresponds to update-model/dict/phones/nonsilence.txt
--> update-model/dict/phones/nonsilence.{txt, int, csl} are OK

Checking update-model/dict/phones/silence.{txt, int, csl} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 20 entry/entries in update-model/dict/phones/silence.txt
--> update-model/dict/phones/silence.int corresponds to update-model/dict/phones/silence.txt
--> update-model/dict/phones/silence.csl corresponds to update-model/dict/phones/silence.txt
--> update-model/dict/phones/silence.{txt, int, csl} are OK

Checking update-model/dict/phones/optional_silence.{txt, int, csl} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 1 entry/entries in update-model/dict/phones/optional_silence.txt
--> update-model/dict/phones/optional_silence.int corresponds to update-model/dict/phones/optional_silence.txt
--> update-model/dict/phones/optional_silence.csl corresponds to update-model/dict/phones/optional_silence.txt
--> update-model/dict/phones/optional_silence.{txt, int, csl} are OK

Checking update-model/dict/phones/disambig.{txt, int, csl} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 10 entry/entries in update-model/dict/phones/disambig.txt
--> update-model/dict/phones/disambig.int corresponds to update-model/dict/phones/disambig.txt
--> update-model/dict/phones/disambig.csl corresponds to update-model/dict/phones/disambig.txt
--> update-model/dict/phones/disambig.{txt, int, csl} are OK

Checking update-model/dict/phones/roots.{txt, int} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 43 entry/entries in update-model/dict/phones/roots.txt
--> update-model/dict/phones/roots.int corresponds to update-model/dict/phones/roots.txt
--> update-model/dict/phones/roots.{txt, int} are OK

Checking update-model/dict/phones/sets.{txt, int} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 43 entry/entries in update-model/dict/phones/sets.txt
--> update-model/dict/phones/sets.int corresponds to update-model/dict/phones/sets.txt
--> update-model/dict/phones/sets.{txt, int} are OK

Checking update-model/dict/phones/extra_questions.{txt, int} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 10 entry/entries in update-model/dict/phones/extra_questions.txt
--> update-model/dict/phones/extra_questions.int corresponds to update-model/dict/phones/extra_questions.txt
--> update-model/dict/phones/extra_questions.{txt, int} are OK

Checking update-model/dict/phones/word_boundary.{txt, int} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 176 entry/entries in update-model/dict/phones/word_boundary.txt
--> update-model/dict/phones/word_boundary.int corresponds to update-model/dict/phones/word_boundary.txt
--> update-model/dict/phones/word_boundary.{txt, int} are OK

Checking optional_silence.txt ...
--> reading update-model/dict/phones/optional_silence.txt
--> update-model/dict/phones/optional_silence.txt is OK

Checking disambiguation symbols: #0 and #1
--> update-model/dict/phones/disambig.txt has "#0" and "#1"
--> update-model/dict/phones/disambig.txt is OK

Checking topo ...

Checking word_boundary.txt: silence.txt, nonsilence.txt, disambig.txt ...
--> update-model/dict/phones/word_boundary.txt doesn't include disambiguation symbols
--> update-model/dict/phones/word_boundary.txt is the union of nonsilence.txt and silence.txt
--> update-model/dict/phones/word_boundary.txt is OK

Checking word-level disambiguation symbols...
--> update-model/dict/phones/wdisambig.txt exists (newer prepare_lang.sh)
Checking word_boundary.int and disambig.int
sh: 2: export: (x86)/Intel/Intel(R): bad variable name
--> generating a 88 word/subword sequence
sh: 2: export: (x86)/Intel/Intel(R): bad variable name
--> ERROR: number of reconstructed words 0 does not match real number of words 88; indicates problem in L.fst or word_boundary.int.  phoneseq = , wordseq = finches pei reservations rambo mommy courtship dawdling divas vox reorient boomtown whore protectorate hurt rayner topeka adamant mugs fouls birth a._k. stand discontents amazed laurels buttering sidetrack boundary lamport occasional suspicion shortcut melons until threats droppings tourette's greece boo competence fire's throat reimburse buffington waged griffith's meshes twiddling forecasting peters catastrophe tiptoe psychoanalysis statewide polar diluting bandit acronyms alvarez snatching nolte dreary fonder snacked navigate foolish severe barbara influenza shelled manuel adulterous antisocial army palace dollars whiff chalice paws injuries pop legume hyped invalids chide goodridge crappie raving
--> generating a 48 word/subword sequence
sh: 2: export: (x86)/Intel/Intel(R): bad variable name

Checking update-model/dict/oov.{txt, int} ...
--> text seems to be UTF-8 or ASCII, checking whitespaces
--> text contains only allowed whitespaces
--> 1 entry/entries in update-model/dict/oov.txt
--> update-model/dict/oov.int corresponds to update-model/dict/oov.txt
--> update-model/dict/oov.{txt, int} are OK

sh: 2: export: (x86)/Intel/Intel(R): bad variable name
--> ERROR: update-model/dict/L.fst is not olabel sorted
sh: 2: export: (x86)/Intel/Intel(R): bad variable name
--> ERROR: update-model/dict/L_disambig.fst is not olabel sorted
--> ERROR (see error messages above)
prepare_lang.sh: error validating output

I had asked this question on Kaldi help group as well. Dan Povey had suggested that this might be a local issue where a subshell might be spawning which is throwing this error.

My pwd output is as follows:--

/home/nitin/kaldi/egs/aspire/s5

My path.sh is as follows:

export KALDI_ROOT=`pwd`/../../..
export PATH=$PWD/utils/:$KALDI_ROOT/tools/openfst/bin:$PWD:$PATH
[ ! -f $KALDI_ROOT/tools/config/common_path.sh ] && echo >&2 "The standard 
file $KALDI_ROOT/tools/config/common_path.sh is not present -> Exit!" && exit 
1
. $KALDI_ROOT/tools/config/common_path.sh
export PATH=$KALDI_ROOT/tools/sctk/bin:$PATH
export LC_ALL=C
source ../../../tools/env.sh

My cmd.sh as mentioned in the linked blog post needs to be sourced before running subsequent commands is:

# "queue.pl" uses qsub.  The options to it are
# options to qsub.  If you have GridEngine installed,
# change this to a queue you have access to.
# Otherwise, use "run.pl", which will run jobs locally
# (make sure your --num-jobs options are no more than
# the number of cpus on your machine.

#a) JHU cluster options
export train_cmd="queue.pl"
export decode_cmd="queue.pl --mem 4G"
export mkgraph_cmd="queue.pl --mem 8G"


#b) BUT cluster options
#export train_cmd="queue.pl -q all.q@@blade -l ram_free=1200M,mem_free=1200M"
#export decode_cmd="queue.pl -q all.q@@blade -l ram_free=1700M,mem_free=1700M"
#export decodebig_cmd="queue.pl -q all.q@@blade -l ram_free=4G,mem_free=4G"

#export cuda_cmd="queue.pl -q long.q@@pco203 -l gpu=1"
#export cuda_cmd="queue.pl -q long.q@pcspeech-gpu"
#export mkgraph_cmd="queue.pl -q all.q@@servers -l ram_free=4G,mem_free=4G"

#c) run it locally...
#export train_cmd=run.pl
#export decode_cmd=run.pl
#export cuda_cmd=run.pl
#export mkgraph_cmd=run.pl

Any Linux captain here who might help me with this?


Solution

  • The error message

    sh: 2: export: (x86)/Intel/Intel(R): bad variable name
    

    indicates a word splitting problem resulting from missing quoting.

    The text (x86)/Intel/Intel(R) looks like a part of a directory path that contains spaces as it is common on Windows. It could be something like

    C:/Program Files (x86)/Intel/Intel(R) something
    

    You can probably find this in the value of your PATH variable.

    According to the referenced thread in the KALDI help group, the problem can be in your path.sh file.

    With your current working directory /home/nitin/kaldi/egs/aspire/s5 the problem will not occur in the line

    export KALDI_ROOT=`pwd`/../../..
    

    but to avoid possible problems it should be

    export KALDI_ROOT="$(pwd)"/../../..
    

    or

    export KALDI_ROOT="$(pwd)/../../.."
    

    The problem seems to occur in line 2 of the script (which matches the error message):

    export PATH=$PWD/utils/:$KALDI_ROOT/tools/openfst/bin:$PWD:$PATH
    

    I guess your PATH contains directories with spaces (including the piece shown in the error message). In this case the shell will split the line on every space and you will get something like

    export PATH=something maybe_something_else (x86)/Intel/Intel(R) maybe_again_something
    

    This would (try to) export variables PATH, maybe_something_else, (x86)/Intel/Intel(R) and maybe_again_something... which is not what you want. You want all this to be in the value of PATH.

    You are lucky to get an error message from the shell about the invalid variable name (x86)/Intel/Intel(R). If all parts were valid variable names you would get a wrong PATH and a few unwanted environment variables but no error message.

    So you should also quote this line and in general the expansion of all variables that may contain spaces.

    I suggest to change path.sh to

    export KALDI_ROOT="$(pwd)/../../.."
    export PATH="$PWD/utils/:$KALDI_ROOT/tools/openfst/bin:$PWD:$PATH"
    [ ! -f "$KALDI_ROOT/tools/config/common_path.sh" ] && echo >&2 "The standard 
    file $KALDI_ROOT/tools/config/common_path.sh is not present -> Exit!" && exit 
    1
    . "$KALDI_ROOT/tools/config/common_path.sh"
    export "PATH=$KALDI_ROOT/tools/sctk/bin:$PATH"
    export LC_ALL=C
    source ../../../tools/env.sh
    

    I don't know KALDI. Is this file generated or did you manually create it?

    The line

    export KALDI_ROOT=`pwd`/../../..
    

    can be problematic because it depends on the current working directory when you run the script. I don't know if there is a mechanism that makes sure you run it only from the directory where this script is located. Otherwise this would result in a wrong value of KALDI_ROOT

    I don't know if there is a reason to do it the way it is, but it may make sense to use an absolute path instead of path that depends on your working directory.

    The current directory /home/nitin/kaldi/egs/aspire/s5 will result in

    export KALDI_ROOT=/home/nitin/kaldi/egs/aspire/s5/../../..
    

    I would replace the line in the script with

    export KALDI_ROOT=/home/nitin/kaldi
    

    You might ask in the KALDI help group about this suggestion.

    Edit:

    If adding quotes to path.sh is not sufficient, check also $KALDI_ROOT/tools/config/common_path.sh and $KALDI_ROOT/tools/env.sh and other scripts that may exist.

    As a starting point you can search for files that contain a line with both export and $PATH. (Of course this could occur with other variables as well.) Example:

    find /home/nitin/kaldi -type f -exec grep 'export.*\$PATH' {} /dev/null \;
    

    I just noticed that path.sh is a bit inconsistent. It uses both command substitution of pwd and variable $PWD and both $KALDI_HOME and hard coded ../../.. as if the lines are written by different people.