I want to clean up files that contain bad formatting, more precisely, replace "normal" spaces with non-breaking spaces according to a given criterion.
For example:
If in a sentence, I have:
"You need to walk 5 km."
I need to replace the space between 5 and km with a non-breaking space.
So far, I have managed to do this:
import os
unites = ['km', 'm', 'cm', 'mm', 'mi', 'yd', 'ft', 'in']
# iterate and read all files in the directory
for file in os.listdir():
# check if the file is a file
if os.path.isfile(file):
# open the file
with open(file, 'r', encoding='utf-8') as f:
# read the file
content = f.read()
# search for exemple in the file
for i in unites:
if i in content:
# find the next character after the unit
next_char = content[content.find(i) + len(i)]
# check if the next character is a space
if next_char == ' ':
# replace the space with a non-breaking space
content = content.replace(i + ' ', i + '\u00A0')
But this replace all the spaces in the document and not the ones that I want. Can you help me?
EDIT
after UlfR's answer which was very useful and relevant, I would like to push my criteria further and make my "search/replace" more complex.
Now I would like to search for characters before/after a word in order to replace spaces with non-breaking spaces. For example :
I've tried to do this :
units = ['km', 'm', 'cm', 'mm', 'mi', 'yd', 'ft', 'in']
units_before_after = ['{']
nbsp = '\u00A0'
rgx = re.sub(r'(\b\d+)(%s) (%s)\b'%(units, units_before_after),r'\1%s\2'%nbsp,text))
print(rgx)
But I'am having some trouble, do you have any ideas to share ?
You should use re
to do the replacement. Like so:
import re
text = "You need to walk 5 km or 500000 cm."
units = ['km', 'm', 'cm', 'mm', 'mi', 'yd', 'ft', 'in']
nbsp = '\u00A0'
print(re.sub(r'(\b\d+) (%s)\b'%'|'.join(units),r'\1%s\2'%nbsp,text))
Both the search and replace patterns are dynamically built, but basically you have a pattern that matches:
\b
\d+
km|m|cm|...
\b
Then we replaces the all that with the two groups with the nbsp
-string between them.
See re for more info on how to us regular expressions in python. Its well worth the invested time to learn the basics since its a very powerful and useful tool!
Have fun :)