I iterate over document by paragraphs, then I split each paragraph text into sentences by .
(dot with space). I split paragraph text in sentences in order to do more effective text search compare to search in a whole paragraph text.
Then the code searches error in each word of sentence, error being taken from error-correction db. I show below a simplified code:
from docx.enum.text import WD_BREAK
for paragraph in document.paragraphs:
sentences = paragraph.text.split('. ')
for sentence in sentences:
words=sentence.split(' ')
for word in words:
for error in error_dictionary:
if error in word:
# (A) make simple replacement
word = word.replace(error, correction, 1)
# (B) alternative replacement based on runs
for run in paragraph.runs:
if error in run.text:
run.text = run.text.replace(error, correction, 1)
# here we may fetch page break attribute and knowing current number
# find out at what page the replacement has taken place
if run.page_break== WD_BREAK:
current_page_number +=1
replace_counter += 1
# write to a report what paragraph and what page
write_report(error, correction, sentence, current_page_number )
# for that I need to know a page break
The problem is how to identify if a run (or other paragraph element) contains a page break? Does run.page_break == WD_BREAK
@scanny has showed how to add page break, but how to identify it?
The best would be if one can identify also a line break in paragraph.
I could make:
for run in paragraph.runs:
if run._element.br_lst:
for br in run._element.br_lst:
print br.type
Yet this code shows only Hard breaks, that is, breaks inserted thru Ctrl+Enter. Soft page breaks are not detected... (Soft page break is formed when user keeps typing until the page he is on runs out then it flows on to the next page)
Any hints?
For the Soft and Hard page breaks I now use the following:
for run in paragraph.runs:
if 'lastRenderedPageBreak' in run._element.xml:
print 'soft page break found at run:', run.text[:20]
if 'w:br' in run._element.xml and 'type="page"' in run._element.xml:
print 'hard page break found at run:', run.text[:20]