pythoncntk

Problems implementing CRNN with CNTK


I'm quite new to machine learning and as a learning exercise I'm trying to implement convolutional recurrent neural network in CNTK to recognize variable length text from image. The basic idea is to take the output of CNN, make a sequence out of it and feed it to RNN then use CTC as loss function. I followed the 'CNTK 208: Training Acoustic Model with Connectionist Temporal Classification (CTC) Criteria' tutorial which shows the basics of CTC usage. Unfortunately, during training my network converges to outputting only blank labels and nothing else, because for some reason that gives the smallest loss.

I'm feeding my network with images with dimensions (1, 32, 96) and I generate them on the fly to show some random letters. As labels I give it a sequence of one hot encoded letters with blank required by CTC at index 0 (this is all as numpy arrays because I use custom data loading). I turns out that for forward_backward() function to work I need to make sure that both its inputs use the same dynamic axis with the same length, which I achieve by making my label string the same length as network output length, and using to_sequence_like() in the code below (I don't know how to do it better, the side effect of using to_sequence_like() here is that I need to pass dummy label data when evaluating this model).

alphabet = "0123456789abcdefghijklmnopqrstuvwxyz"
input_dim_model = (1, 32, 96)    # images are 96 x 32 with 1 channel of color (gray)
num_output_classes = len(alphabet) + 1
ltsm_hidden = 256

def bidirectionalLTSM(features, nHidden, nOut):
    a = C.layers.Recurrence(C.layers.LSTM(nHidden))(features)
    b = C.layers.Recurrence(C.layers.LSTM(nHidden), go_backwards=True)(features)
    c = C.splice(a, b)
    r = C.layers.Dense(nOut)(c)
    return r

def create_model_rnn(features):
    h = features
    h = bidirectionalLTSM(h, ltsm_hidden, ltsm_hidden)
    h = bidirectionalLTSM(h, ltsm_hidden, num_output_classes)
    return h

def create_model_cnn(features):
    with C.layers.default_options(init=C.glorot_uniform(), activation=C.relu):
        h = features

        h = C.layers.Convolution2D(filter_shape=(3,3), 
                                    num_filters=64, 
                                    strides=(1,1), 
                                    pad=True, name='conv_0')(h)

        #more layers...

        h = C.layers.BatchNormalization(name="batchnorm_6")(h)

        return h

x = C.input_variable(input_dim_model, name="x")
label = C.sequence.input((num_output_classes), name="y")

def create_model(features):
    #Composite(x: Tensor[1,32,96]) -> Tensor[512,1,23]
    a = create_model_cnn(features) 
    a = C.reshape(a, (512, 23))
    #Composite(x: Tensor[1,32,96]) -> Tensor[23,512]
    a = C.swapaxes(a, 0, 1) 

    #is there a better way to convert to sequence and still be compatible with forward_backwards() ?
    #Composite(x: Tensor[1,32,96], y: Sequence[Tensor[37]]) -> Sequence[Tensor[512]]
    a = C.to_sequence_like(a, label) 

    #Composite(x: Tensor[1,32,96], y: Sequence[Tensor[37]]) -> Sequence[Tensor[37]]
    a = create_model_rnn(a) 
    return a

#Composite(x: Tensor[1,32,96], y: Sequence[Tensor[37]]) -> Sequence[Tensor[37]]
z = create_model(x)

#LabelsToGraph(y: Sequence[Tensor[37]]) -> Sequence[Tensor[37]]
graph = C.labels_to_graph(label)

#Composite(y: Sequence[Tensor[37]], x: Tensor[1,32,96]) -> np.float32
criteria = C.forward_backward(C.labels_to_graph(label), z, blankTokenId=0) 

err = C.edit_distance_error(z, label, squashInputs=True, tokensToIgnore=[0])
lr = C.learning_rate_schedule(0.01, C.UnitType.sample)
learner = C.adadelta(z.parameters, lr)

progress_printer = C.logging.progress_print.ProgressPrinter(50, first=10, tag='Training')
trainer = C.Trainer(z, (criteria, err), learner, progress_writers=[progress_printer])

#some more custom code ...
#below is how I'm feeding the data

while True:
    x1, y1 = custom_datareader.next_minibatch()
    #x1 is a list of numpy arrays containing training images
    #y1 is a list of numpy arrays with one hot encoded labels

    trainer.train_minibatch({x: x1, label: y1})

The network converges very quickly, although not where I want it (on the left side is the network output, on the right labels I'm giving it):

Minibatch[  11-  50]: loss = 3.506087 * 58880, metric = 176.23% * 58880;
lllll--55leym---------- => lllll--55leym----------, gt: aaaaaaaaaaaaaaaaaaaayox
-------bbccaqqqyyyryy-q => -------bbccaqqqyyyryy-q, gt: AAAAAAAAAAAAAAAAAAAJPTA
tt22yye------yqqqtll--- => tt22yye------yqqqtll---, gt: tttttttttttttttttttyliy
ceeeeeeee----eqqqqqqe-q => ceeeeeeee----eqqqqqqe-q, gt: sssssssssssssssssssskht
--tc22222al55a5qqqaa--q => --tc22222al55a5qqqaa--q, gt: cccccccccccccccccccaooa
yyyyyyiqaaacy---------- => yyyyyyiqaaacy----------, gt: cccccccccccccccccccxyty
mcccyya----------y---qq => mcccyya----------y---qq, gt: ppppppppppppppppppptjnj
ylncyyyy--------yy--t-y => ylncyyyy--------yy--t-y, gt: sssssssssssssssssssyusl
tt555555ccc------------ => tt555555ccc------------, gt: jjjjjjjjjjjjjjjjjjjmyss
-------eeeaadaaa------5 => -------eeeaadaaa------5, gt: fffffffffffffffffffciya
eennnnemmtmmy--------qy => eennnnemmtmmy--------qy, gt: tttttttttttttttttttajdn
-rcqqqqaaaacccccycc8--q => -rcqqqqaaaacccccycc8--q, gt: aaaaaaaaaaaaaaaaaaaixvw
------33e-bfaaaaa------ => ------33e-bfaaaaa------, gt: uuuuuuuuuuuuuuuuuuupfyq
r----5t5y5aaaaa-------- => r----5t5y5aaaaa--------, gt: fffffffffffffffffffapap
deeeccccc2qqqm888zl---t => deeeccccc2qqqm888zl---t, gt: hhhhhhhhhhhhhhhhhhhlvjx
 Minibatch[  51- 100]: loss = 1.616731 * 73600, metric = 100.82% * 73600;
----------------------- => -----------------------, gt: kkkkkkkkkkkkkkkkkkkakyw
----------------------- => -----------------------, gt: ooooooooooooooooooopwtm
----------------------- => -----------------------, gt: jjjjjjjjjjjjjjjjjjjqpny
----------------------- => -----------------------, gt: iiiiiiiiiiiiiiiiiiidspr
----------------------- => -----------------------, gt: fffffffffffffffffffatyp
----------------------- => -----------------------, gt: vvvvvvvvvvvvvvvvvvvmccf
----------------------- => -----------------------, gt: dddddddddddddddddddsfyo
----------------------- => -----------------------, gt: yyyyyyyyyyyyyyyyyyylaph
----------------------- => -----------------------, gt: kkkkkkkkkkkkkkkkkkkacay
----------------------- => -----------------------, gt: uuuuuuuuuuuuuuuuuuujuqs
----------------------- => -----------------------, gt: sssssssssssssssssssovjp
----------------------- => -----------------------, gt: vvvvvvvvvvvvvvvvvvvibma
----------------------- => -----------------------, gt: vvvvvvvvvvvvvvvvvvvaajt
----------------------- => -----------------------, gt: tttttttttttttttttttdhfo
----------------------- => -----------------------, gt: yyyyyyyyyyyyyyyyyyycmbh
 Minibatch[ 101- 150]: loss = 0.026177 * 73600, metric = 100.00% * 73600;
----------------------- => -----------------------, gt: iiiiiiiiiiiiiiiiiiiavoo
----------------------- => -----------------------, gt: lllllllllllllllllllaara
----------------------- => -----------------------, gt: pppppppppppppppppppmufu
----------------------- => -----------------------, gt: sssssssssssssssssssaacd
----------------------- => -----------------------, gt: uuuuuuuuuuuuuuuuuuujulx
----------------------- => -----------------------, gt: vvvvvvvvvvvvvvvvvvvoaqy
----------------------- => -----------------------, gt: dddddddddddddddddddvjmr
----------------------- => -----------------------, gt: oooooooooooooooooooxlvl
----------------------- => -----------------------, gt: dddddddddddddddddddqqlo
----------------------- => -----------------------, gt: wwwwwwwwwwwwwwwwwwwwrvx
----------------------- => -----------------------, gt: pppppppppppppppppppxuxi
----------------------- => -----------------------, gt: bbbbbbbbbbbbbbbbbbbkbqv
----------------------- => -----------------------, gt: ppppppppppppppppppplpha
----------------------- => -----------------------, gt: dddddddddddddddddddilol
----------------------- => -----------------------, gt: dddddddddddddddddddqnwf

My question is how to get the network to learn to output proper captions. I would like to add that I successfully managed to train a model using the same technique but made in pytorch, so it's unlikely that images or labels are the problem. Also, is there any better way to convert output of convolutional layers to sequence with dynamic axis so that I can still use it with forward_backward() function?


Solution

  • There are a bunch of things that make training CRNN models difficult in CNTK (the correct way to format labels is tricky, the whole LabelsToGraph conversion, no transcription error metric etc). Here is an implementation of the model that is working correctly:

    https://github.com/BenjaminTrapani/SceneTextOCR/tree/master

    It relies on a fork of CNTK that fixes an image reader bug, provides a transcription error function and improves the performance of the text format reader. It also provides an app that will generate text format labels from the mjsynth dataset. For reference, here is how to format your labels:

    513528 |textLabel 7:2
    513528 |textLabel 26:1
    513528 |textLabel 0:2
    513528 |textLabel 26:1
    513528 |textLabel 20:2
    513528 |textLabel 26:1
    513528 |textLabel 11:2
    513528 |textLabel 26:1
    513528 |textLabel 8:2
    513528 |textLabel 26:1
    513528 |textLabel 4:2
    513528 |textLabel 26:1
    513528 |textLabel 17:2
    513528 |textLabel 26:1
    513528 |textLabel 18:2
    513528 |textLabel 26:1
    513528 |textLabel 26:1
    513528 |textLabel 26:1
    513528 |textLabel 26:1
    513528 |textLabel 26:1
    513528 |textLabel 26:1
    513528 |textLabel 26:1
    513528 |textLabel 26:1
    513528 |textLabel 26:1
    513528 |textLabel 26:1
    513528 |textLabel 26:1
    513528 |textLabel 26:1
    513528 |textLabel 26:1
    513528 |textLabel 26:1
    513528 |textLabel 26:1
    513528 |textLabel 26:1
    513528 |textLabel 26:1
    

    513528 is the sequence ID, and should match corresponding image data sequence IDs for the same sample. textLabel is used to create the stream for the minibatch source. You create the stream as follows in C++:

    StreamConfiguration textLabelConfig(L"textLabel", numClasses, true, L"textLabel");
    

    26 is the index of the blank character for the CTC decode. The other values before the ":" are the character codes for your labels. 1 is to 1-hot encode each vector in the sequence. There are a bunch of trailing blank characters to ensure that the sequence is as long as the maximum supported sequence length, since variable-length sequences are not supported by the CTC loss function implementation as of the time of writing.