pythonreplacenon-breaking-characters

Replace spaces with non-breaking spaces according to a specific criterion


I want to clean up files that contain bad formatting, more precisely, replace "normal" spaces with non-breaking spaces according to a given criterion.

For example:

If in a sentence, I have:

"You need to walk 5 km."

I need to replace the space between 5 and km with a non-breaking space.

So far, I have managed to do this:

import os

unites = ['km', 'm', 'cm', 'mm', 'mi', 'yd', 'ft', 'in']

# iterate and read all files in the directory
for file in os.listdir():
    # check if the file is a file
    if os.path.isfile(file):
        # open the file
        with open(file, 'r', encoding='utf-8') as f:
            # read the file
            content = f.read()
            # search for exemple in the file
            for i in unites:
                if i in content:
                    # find the next character after the unit
                    next_char = content[content.find(i) + len(i)]
                    # check if the next character is a space
                    if next_char == ' ':
                        # replace the space with a non-breaking space
                        content = content.replace(i + ' ', i + '\u00A0')

But this replace all the spaces in the document and not the ones that I want. Can you help me?


EDIT

after UlfR's answer which was very useful and relevant, I would like to push my criteria further and make my "search/replace" more complex.

Now I would like to search for characters before/after a word in order to replace spaces with non-breaking spaces. For example :

I've tried to do this :

units = ['km', 'm', 'cm', 'mm', 'mi', 'yd', 'ft', 'in']
units_before_after = ['{']

nbsp = '\u00A0'

rgx = re.sub(r'(\b\d+)(%s) (%s)\b'%(units, units_before_after),r'\1%s\2'%nbsp,text))

print(rgx)

But I'am having some trouble, do you have any ideas to share ?


Solution

  • You should use re to do the replacement. Like so:

    import re
    
    text = "You need to walk 5 km or 500000 cm."
    units = ['km', 'm', 'cm', 'mm', 'mi', 'yd', 'ft', 'in']
    nbsp = '\u00A0'
    
    print(re.sub(r'(\b\d+) (%s)\b'%'|'.join(units),r'\1%s\2'%nbsp,text))
    

    Both the search and replace patterns are dynamically built, but basically you have a pattern that matches:

    1. At the beginning of something \b
    2. 1 or more digits \d+
    3. One space
    4. One of the units km|m|cm|...
    5. At the end of something \b

    Then we replaces the all that with the two groups with the nbsp-string between them.

    See re for more info on how to us regular expressions in python. Its well worth the invested time to learn the basics since its a very powerful and useful tool!

    Have fun :)