pythonkerasdeep-learningtokenizeseq2seq

Can't Initialise Two Different Tokenizers with Keras


For spelling correction task, I build a seq2seq model including LSTM and attention mechanism. I do char-level tokenisation with Keras. I initialised two different tokenizers, one for typo sentence and the other for corrected sentence.

After testing, I see that model produced empty string and I believe that there is problem with tokenisation. Otherwise, there should be no need to initialise distinct tokenizers for typos and corrected sentences.

When I see the word_index for these tokenizers, I realized that indexing for each character is the same. For example:

Tokenized sentence with typo tokenizer: [[20, 17, 2, 24, 8, 1, 13, 3, 5, 24, 4, 7, 7, 3, 6, 1, 4, 12, 3, 1, 27, 2, 13, 4, 1, 25, 2, 13, 2, 13, 1, 2, 19, 3, 23, 4, 1, 3, 27, 24, 2, 5, 3, 1, 23, 21]]

Vocabulary: {' ': 1, 'a': 2, 'e': 3, 'i': 4, 'n': 5, 'r': 6, 'l': 7, 'ı': 8, 'd': 9, 'k': 10, 't': 11, 's': 12, 'm': 13, 'u': 14, 'y': 15, 'o': 16, 'b': 17, 'ü': 18, 'ş': 19, '<': 20, '>': 21, 'g': 22, 'v': 23, 'z': 24, 'h': 25, 'p': 26, 'c': 27, 'ç': 28, 'ğ': 29, 'ö': 30, 'f': 31, '1': 32, '0': 33, '2': 34, '9': 35, 'j': 36, 'w': 37, '8': 38, '3': 39, '5': 40, '4': 41, '6': 42, '7': 43, 'x': 44, 'q': 45}

Tokenized sentence with corrects tokenizer: [[20, 17, 2, 24, 8, 1, 13, 3, 5, 24, 4, 7, 7, 3, 6, 1, 4, 12, 3, 1, 27, 2, 13, 4, 1, 25, 2, 13, 2, 13, 1, 2, 19, 3, 23, 4, 1, 3, 27, 24, 2, 5, 3, 1, 23, 21]]

Vocabulary: {' ': 1, 'a': 2, 'e': 3, 'i': 4, 'n': 5, 'r': 6, 'l': 7, 'ı': 8, 'd': 9, 'k': 10, 't': 11, 's': 12, 'm': 13, 'u': 14, 'y': 15, 'o': 16, 'b': 17, 'ü': 18, 'ş': 19, '<': 20, '>': 21, 'g': 22, 'v': 23, 'z': 24, 'h': 25, 'p': 26, 'c': 27, 'ç': 28, 'ğ': 29, 'ö': 30, 'f': 31, '1': 32, '0': 33, '2': 34, '9': 35, '8': 36, '3': 37, 'j': 38, '5': 39, '4': 40, '6': 41, '7': 42, 'w': 43, 'x': 44, 'q': 45}

I initialised like this:

typos_tokenizer = tf.keras.preprocessing.text.Tokenizer(
num_words=NUM_WORDS,
filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
lower=True,
char_level=True
)

corrects_tokenizer = tf.keras.preprocessing.text.Tokenizer(
num_words=NUM_WORDS,
filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n',
lower=True,
char_level=True
)

Why can't I initialise two different tokenisers with this method?

I was expecting 2 different word_index dictionaries for typos_tokenizer and corrects_tokenizer.


Solution

  • They are not the same, e.g for the first one 36 is 'j' and for the second one 36 is '8'.