So I'm trying to translate morsecode signals to their String representation. Some forms of preprocessing yield one dimensional arrays of normalized floats from [0, 1] that serve as input to a C/RNN. Example:
This image is stretched along the y-axis for better visibility, but the inputs to the NN are 1d. I'm looking for a smart way to translate the contens of the image, in this example the correct translation would be "WPM = TEXT I". My current model uses keras' ctc loss as in this tutorial. It will however detect the letter "E" for every single timestep ("E" is the morse equivalent of "." or a small bar in the image), so I figure that the "stepsize" is too small. This is reinforced by another attempt that classifies every timestep above some threshold as "E" and everything else als [UNK]/blank.
I think the main problem is the vast difference in size between for example an "E" (one thin line) and other characters, for example "=", represented by the small lines, framed by two thick ones as seen in the middle (-...-). This should be less of a problem in voice recognition, because there you can make phonetic sense of time-segments as small as microseconds (like hearing an "i" sound in "thin" and "gym") which is not possible in this context.
Perhaps anyone comes up with a smart solution, either to this implementation, or maybe through a different representation of inputs or something along those lines.
I have also used CTC-loss successfully for extracting textual information from traffic sign plates.
Intuitively, unless you have many examples so that the CNN(encoder) can extract and actually learn that different sizes can actually point to the same letter, you will not be able to successfully learn this.
Indeed, the theoretical foundation for CTC does imply that the loss function is able to learn different sizes but in your particular case, a (thicker) line can easily also be classified as the same previous letter (thinner) one.
One possible attempt I would employ would be to reduce the length of the timesteps/max length of the words you are processing. Intuitively, this would (provided we keep the same width of the image) enforce a greater classification frame for the RNN. In your particular case this could potentially proved to be a helpful approach since you are interested in your networks capacity to interpret a broader region (not like CAPTCHA example in the tutorial).
So in the image below, the width of the bin would be wider, therefore allowing for a better grasp (the width of the pink rectangles would be bigger).
Another important aspect to consider is the dimensionality of the dataset. Ensure that you use augmentation and you have enough samples for training. What I have also remarked at CTC is that, for a successful result you also need a variety of text to be analyzed (not only the sample number but the text within the sample). Here, the amount of data plays an even greater role; it is easier for a network to distinguish between A and X but much harder to differentiate between thicker and thinner lines.
Image source: https://towardsdatascience.com/intuitively-understanding-connectionist-temporal-classification-3797e43a86c