python-3.xnlppersian

How to normalize Persian texts with Hazm


I have a folder containing some other folders and each contains a lot of text files. I have to extract 5 words before and after a specific word and following code works fine.

The problem is that because I did not normalize the texts, it just returns a few sentences while there is more. In Persian there is a module called hazm for normalizing the texts. How can I use that in this code?

For example of normalizing: "ك" should change to "ک" or "ؤ" should change to "و". Because the first two ones are actually Arabic alphabets which were used in Persian. Without normalizing the code just returns the words that are written with the second form and it does not recognize the words which are in the first forms Arabic).

import os
from hazm import Normalizer


def getRollingWindow(seq, w):
    win = [next(seq) for _ in range(11)]
    yield win
    for e in seq:
        win[:-1] = win[1:]
        win[-1] = e
        yield win


def extractSentences(rootDir, searchWord):
    with open("پاکت", "w", encoding="utf-8") as outfile:
        for root, _dirs, fnames in os.walk(rootDir):
            for fname in fnames:
                print("Looking in", os.path.join(root, fname))
                with open(os.path.join(root, fname), encoding = "utf-8") as infile:
                    #normalizer = Normalizer()
                    #fname = normalizer.normalize(fname)
                    for window in getRollingWindow((word for line in infile for word in line(normalizer.normalize(line)).split()), 11):
                        if window[5] != searchWord: continue
                        outfile.write(' '.join(window)+ "\n")

Solution

  • I haven't experience with Hazm, but it is easy to normalize it yourself with the following piece of code. (Note that here we just replace Arabic character with Persian)

    def clean_sentence(sentence):
        sentence = arToPersianChar(sentence)
        sentence = arToPersianNumb(sentence)
        # more_normalization_function()
        return sentence
    
    
    def arToPersianNumb(number):
        dic = {
            '١': '۱',
            '٢': '۲',
            '٣': '۳',
            '٤': '۴',
            '٥': '۵',
            '٦': '۶',
            '٧': '۷',
            '٨': '۸',
            '٩': '۹',
            '٠': '۰',
        }
        return multiple_replace(dic, number)
    
    
    def arToPersianChar(userInput):
        dic = {
            'ك': 'ک',
            'دِ': 'د',
            'بِ': 'ب',
            'زِ': 'ز',
            'ذِ': 'ذ',
            'شِ': 'ش',
            'سِ': 'س',
            'ى': 'ی',
            'ي': 'ی'
    }
    return multiple_replace(dic, userInput)
    
    
    def multiple_replace(dic, text):
        pattern = "|".join(map(re.escape, dic.keys()))
        return re.sub(pattern, lambda m: dic[m.group()], str(text))
    

    Just you need to read each line of your document and pass it to clean_sentence():

    def clean_all(document):
        clean = ''
        for sentence in document:
            sentence = clean_sentence(sentence)
            clean += ' \n' + sentence
        return clean