pythondocxpython-docx

python-docx: Extracting text along with heading and sub-heading numbers


I have a word document that is structured as follows:

1. Heading
    1.1. Sub-heading
        (a) Sub-sub-heading

When I load the document in docx using the code:

import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)
print(getText("a.docx"))

I get the following output.

Heading
Sub-heading
Sub-sub-heading

How can I extract the heading/sub-heading numbers also along with the text? I tried simplify_docx but that only works for standard MS Word heading styles and not on custom heading styles.


Solution

  • Unfortunately numbers are not part of the text but are generated by Word itself based on the heading style (Heading i), and I don't thing docx exposes any way to get this number.

    However you can retrieve the style / level using para.style and then read through the document to recompute the numbering scheme. This is however cumbersome as it doesn't take into account any custom style you could be using. There might be a way to access the numbering scheme in the style.xml part of the doc but I don't know how.

    import docx
    
    level_from_style_name = {f'Heading {i}': i for i in range(10)}
    
    def format_levels(cur_lev):
        levs = [str(l) for l in cur_lev if l != 0]
        return '.'.join(levs)  # Customize your format here
    
    d = docx.Document('my_doc.docx')
    
    current_levels = [0] * 10
    full_text = []
    
    for p in d.paragraphs:
        if p.style.name not in level_from_style_name:
            full_text.append(p.text)
        else:
            level = level_from_style_name[p.style.name]
            current_levels[level] += 1
            for l in range(level + 1, 10):
                current_levels[l] = 0
            full_text.append(format_levels(current_levels) + ' ' + p.text)
    
    for l in full_text:
        print(l)
    

    which from

    enter image description here

    gives me

    Hello world
    1 H1 foo
    1.1 H2 bar
    1.1.1 H3 baz
    Paragraph are really nice !
    1.1.2 H3 bibou
    Something else
    2 H1 foofoo
    You got the drill…