python-3.x tensorflow lstm handwriting-recognition ctc

CTC + BLSTM Architecture Stalls/Hangs before 1st epoch

I am working on a code which recognizes online handwriting recognition. It works with CTC loss function and Word Beam Search (custom implementation: githubharald)

TF Version: 1.14.0

Following are the parameters used:

batch_size: 128
total_epoches: 300
hidden_unit_size: 128
num_layers: 2
input_dims: 10 (number of input Features)
num_classes: 80 (CTC output logits)
save_freq: 5
learning_rate: 0.001
decay_rate: 0.99
momentum: 0.9
max_length: 1940.0 (BLSTM with variable length time stamps)
label_pad: 63

The problem that I'm facing is, that after changing the decoder from CTC Greedy Decoder to Word Beam Search, my code stalls after a particular step. It does not show the output of the first epoch and is stuck there for about 5-6 hours now.

The step it is stuck after: tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10

I am using a Nvidia DGX-2 for training (name: Tesla V100-SXM3-32GB)

Solution

Here is the paper describing word beam search, maybe it contains some useful information for you (I'm the author of the paper).

I would look at your task as two separate parts:

optical model, i.e. train a model that is as good as possible at reading text just by "looking" at it
language model, i.e. use a large enough text corpus, use a fast enough mode of the decoder

To select the best model for part (1), using best path (greedy) decoding for validation is good enough. If the best path contains wrong characters, chances are high that also beam search has no chance to recover (even when using language models).

Now to part (2). Regarding runtime of word beam search: you are using "NGramsForecast" mode, which is the slowest of all modes. It has running time O(W*log(W)) with W being the number of words in the dictionary. "NGrams" has O(log(W)). If you look into the paper and go to Table 1, you see that the runtime gets much worse when using the forecast modes ("NGramsForecast" or "NGramsForecastAndSample"), while character error rate may or may not get better (e.g. "Words" mode has 90ms runtime, while "NGramsForecast" has over 16s for the IAM dataset).

For practical use cases, I suggest the following:

if you have a dictionary (that means, a list of unique words), then use "Words" mode
if you have a large text corpus containing enough sentences in the target language, then use "NGrams" mode
don't use the forecast modes, instead use "Words" or "NGrams" mode and increase the beam width if you need better character error rate