Building a Character-Level Ngram Language Model with NLTK

I'm trying to build a language model on the character level with NLTK's KneserNeyInterpolated function. What I have is a frequency list of words in a pandas dataframe, with the only column being it's frequency (the word itself is the index). I've determined, based on the average length of words, that a 9-gram model would be appropriate.

from nltk.lm.models import KneserNeyInterpolated

lm = KneserNeyInterpolated(9)
for i in range(df.shape[0]):
    lm.fit([list(ngrams(df.index[i], n = 9))])

lm.generate(num_words = 9)
# ValueError: Can't choose from empty population

Attempt at debugging:

n = 9 # Order of ngram

train_data, padded_sents = padded_everygram_pipeline(4, 'whatisgoingonhere')
model = KneserNeyInterpolated(n) 
model.fit(train_data, padded_sents)

model.generate(num_words = 10)
# ['r', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>', '</s>']

This works (I guess?), but I can't seem to extend the functionality to successively training new words to the language model, and I still can't generate realistic words. I feel like I'm missing something basic here on how this module is supposed to work. What has made this a bit difficult is that all tutorials seem to be based on word-level ngrams.

Solution

You need to tokenize your input, apart from this, your approach works.

Basic example:

n = 3
train, vocab = padded_everygram_pipeline(n, 'whatisgoingonhere'.split())
model = KneserNeyInterpolated(n) 
model.fit(train, vocab)
model.generate(num_words = 10, random_seed=5)
# => ['i', 's', 'g', 'o', 'n', 'h', 'e', 'r', 'e', '</s>']

More realistic example

How you transform your input depends on the kind of original source you use. Let's say, for a more realistic case, you input a sequence of words from a text:

from nltk.tokenize import word_tokenize

n = 3

# prep inputs
text = "Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet."
tokenized = word_tokenize(text)
train, vocab = padded_everygram_pipeline(n, tokenized)

# fit model & generate word
model = KneserNeyInterpolated(n) 
model.fit(train, vocab)
model.generate(num_words=5, random_seed=5)
# => ['o', 'r', 'e', 's', 't']