pythonregexpython-3.xspelling

Spelling corrector for non-English characters


Having read Peter Norvig's How to write a spelling corrector I tried to make the code work for Persian. I rewrote the code like this:

import re, collections

def normalizer(word):
    word = word.replace('ي', 'ی')
    word = word.replace('ك', 'ک')
    word = word.replace('ٔ', '')
    return word

def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model

NWORDS = train(normalizer(open("text.txt", encoding="UTF-8").read()))

alphabet = 'ا آ ب پ ت ث ج چ ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ک گ ل م ن و ه ی ء'

In Norvig's original code, NWORDS is the dictionary that records the words and their number of occurrences in the text. I tried print (NWORDS) to see if it works with the Persian characters but the result is irrelevant. It doesn't count words, it counts the appearance of separate letters.

Does anyone have any idea where the code went wrong?

P.S. 'text.txt' is actually a long concatenation of Persian texts, like its equivalent in Norvig's code.


Solution

  • You are applying normalizer to the file object.

    I suppose you actually want to do something like this:

    with open('text.txt') as file:
        NWORDS = train(normalizer(word) for line in file for word in line.split()))
    

    I would also look into using Counter in the documentation.