dictionarynlpwordsn-grampytorch

Google Ngram Viewer - English One Million


I'm training a language model in PyTorch and I'd need the most common one million words in English to serve as dictionary.

From what I've understood, the Google Ngram English One Million (1-grams) might suit to this task, but after downloading every part (0-9) of this dataset and using tail on them to check if they were what I supposed, I found out that no part of this dataset contains words beyond the F letter.

As far as I understood, any Version 1 file has its ngrams alphabetically and cronologically sorted and I'm concerned if it might be possible that the most common one million words do not go beyond the F?

Or am I missing the point of this dataset and it isn't the most commond one million words?


Solution

  • Try shuf <file> to get a random sorting and you will see the data covers all letters. What you see at the end of the files is not an f but the ligature .