pythonreplacepywikibot

How to add space in python's search/find string?


I have been running a pywikibot on Marathi wikipedia since almost a month now. The only task of this bot is find and replace. You can find overall details of pywikibot at: pywikibot. You can find the details of that particular find and replace operation at replace.py and fixes.py and even further examples of fixes here.

The following is a part of my source code. When running the bot on Marathi wikipedia, I am facing a difficulty because of the Marathi language's script. All of the replacements are going fine, but one is not. For example, I will use English words instead of Marathi.

The first part ("fix") of following code searches for "{{PAGENAME}}", and replaces it with "{{subst:PAGENAME}}". The msg parameter is the edit summary.

The second fix of the code "man", finds "man" and replaces it with "gent". But the problem is, it is also replacing "human" to "hugent", "craftsmanship" to "craftsgentship" and so on.

fixes = {
    'name': {
        'regex': True,
        'nocase': True,
        'msg': {'mr': '{{PAGENAME}} → पानाचे मूळ नाव (base name of page)'},
        'replacements': [
            ( r'{{ *PAGENAME *}}', '{{subst:PAGENAME}}' ),
        ],
    },
    'man': {
        'regex': True,
        'msg': {'mr': 'man → gent'},
        'replacements': [
            ('man', 'gent'),
        ],
    },
}

So I tried to update the find and replace parameter from ('man', 'gent') to ('man ', 'gent ') (space in the end) and then to (' man ', ' gent ') (space at the both ends). But both these changes didn't change any words, not even the original (only) "man".

So how do I change the instance of "He was a good man - a true humanitarian" to "He was a good gent - a true humanitarian" without making it hugentitarian?


Solution

  • You want occurrances of 'man', but only by itself - in other words, only if it's not preceded or followed by other letters or symbols that would be part of a word.

    I don't know if Marathi contains symbols like '-' that could be part of a word, for example 'He was a real man-child', in which case you may or may not want to replace it.

    In English, since you're using regex, you can do this:

    'man': {
            'regex': True,
            'msg': {'mr': 'man → gent'},
            'replacements': [
                ('(?<=[^\w]|^)man(?=[^\w]|$)', 'gent'),
            ],
    }
    

    The regular expression '(?<=[^\w]|^)man(?=[^\w]|$)' there means:

    Note that this doesn't cover Man, unless your regex engine is already set to be case-insensitive.

    If your regex engine doesn't consider the characters that make up Marathi words to be part of \w, you could replace that with a string of all the characters that make up the language, if that's achievable (unlike it would be in logographic languages like Chinese).

    Note that, when testing the regex in some environments, it needs that |^ and |$, while in others it may cause issues.

    In pure Python, this works:

    import re
    
    text = 'He was a good man, a true humanitarian.'
    print(re.sub('(?<=[^\w])man(?=[^\w])', 'gent', text))
    
    text = 'तो एक चांगला माणूस होता माणूसला'
    print(re.sub('(?<=[^\w])माणूस(?=[^\w])', 'व्यक्ती', text))
    

    Output:

    He was a good gent, a true humanitarian.
    तो एक चांगला व्यक्ती होता माणूसला
    

    So that (?<=[^\w])man(?=[^\w]) may be all you need. (I hope the Marathi here isn't accidentally rude - I blame Google Translate)