[SOLVED] Python-docx: identify a page break in paragraph

Python-docx: identify a page break in paragraph

I iterate over document by paragraphs, then I split each paragraph text into sentences by . (dot with space). I split paragraph text in sentences in order to do more effective text search compare to search in a whole paragraph text.

Then the code searches error in each word of sentence, error being taken from error-correction db. I show below a simplified code:

from docx.enum.text import WD_BREAK

for paragraph in document.paragraphs:
    sentences = paragraph.text.split('. ') 
    for sentence in sentences:
        words=sentence.split(' ')
        for word in words:
            for error in error_dictionary:
                 if error in word:
                     # (A) make simple replacement
                     word = word.replace(error, correction, 1)
                     # (B) alternative replacement based on runs 
                     for run in paragraph.runs:
                         if error in run.text:
                               run.text = run.text.replace(error, correction, 1)
                         # here we may fetch page break attribute and knowing current number 
                         # find out at what page the replacement has taken place 
                         if run.page_break== WD_BREAK:
                              current_page_number +=1
                     replace_counter += 1
                     # write to a report what paragraph and what page
                     write_report(error, correction, sentence, current_page_number )  
                     # for that I need to know a page break

The problem is how to identify if a run (or other paragraph element) contains a page break? Does run.page_break == WD_BREAK work? @scanny has showed how to add page break, but how to identify it?

The best would be if one can identify also a line break in paragraph.

I could make:

for run in paragraph.runs:
    if run._element.br_lst:             
        for br in run._element.br_lst:
            br_couter+=1
            print br.type

Yet this code shows only Hard breaks, that is, breaks inserted thru Ctrl+Enter. Soft page breaks are not detected... (Soft page break is formed when user keeps typing until the page he is on runs out then it flows on to the next page)

Any hints?

Solution

For the Soft and Hard page breaks I now use the following:

for run in paragraph.runs:
    if 'lastRenderedPageBreak' in run._element.xml:  
        print 'soft page break found at run:', run.text[:20] 
    if 'w:br' in run._element.xml and 'type="page"' in run._element.xml:
        print 'hard page break found at run:', run.text[:20]