Having read Peter Norvig's How to write a spelling corrector I tried to make the code work for Persian. I rewrote the code like this:
import re, collections
def normalizer(word):
word = word.replace('ي', 'ی')
word = word.replace('ك', 'ک')
word = word.replace('ٔ', '')
return word
def train(features):
model = collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
return model
NWORDS = train(normalizer(open("text.txt", encoding="UTF-8").read()))
alphabet = 'ا آ ب پ ت ث ج چ ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ک گ ل م ن و ه ی ء'
In Norvig's original code, NWORDS is the dictionary that records the words and their number of occurrences in the text. I tried print (NWORDS)
to see if it works with the Persian characters but the result is irrelevant. It doesn't count words, it counts the appearance of separate letters.
Does anyone have any idea where the code went wrong?
P.S. 'text.txt' is actually a long concatenation of Persian texts, like its equivalent in Norvig's code.
You are applying normalizer
to the file object.
I suppose you actually want to do something like this:
with open('text.txt') as file:
NWORDS = train(normalizer(word) for line in file for word in line.split()))
I would also look into using Counter
in the documentation.