pythonpython-docx

Iterate through Table of Contents in docx using python-docx


I have a doc with a table of contents that was auto generated in the beginning of the doc and would like to parse through this table of contents. Is this possible using python-docx? If I try to iterate through doc.paragraphs.text, the text in that is in the table of contents does not show up.

I tried the following: iterating through paragraphs and checking for the paragraph.style.name being toc 1 Then I know that I am in a ToC. But I am unable to get the actual text. I tried this:

if para.style.name == "toc 1" #then print para.text. 

But para.text is giving me a blank string. Why would this be the case?

Thanks


Solution

  • I believe you'll find that the actual generated contents of the TOC is "wrapped" in a non-paragraph element. python-docx won't get you there directly as it only finds paragraphs that are direct children of the w:document/w:body element.

    To get at these you'll need to go down to the lxml level, using python-docx to get you as close as possible. You can get to (and print) the body element with this:

    document = Document('my-doc.docx')
    body_element = document._body._body
    print(body_element.xml)  # this will be big if your document is
    

    From there you can identify the specific XML location of the parts you want and use lxml/XPath to access them. Then you can wrap them in python-docx Paragraph objects for ready access:

    from docx.text.paragraph import Paragraph
    
    ps = body_element.xpath('./w:something/w:something_child/w:p'
    paragraphs = [Paragraph(p, None) for p in ps]
    

    This is not an exact recipe and will require some research on your part to work out what w:something etc. are, but if you want it bad enough to surmount those hurdles, this approach will work.

    Once you get it working, posting your exact solution may be of help to others on search.