I have a doc with a table of contents that was auto generated in the beginning of the doc and would like to parse through this table of contents. Is this possible using python-docx? If I try to iterate through doc.paragraphs.text
, the text in that is in the table of contents does not show up.
I tried the following: iterating through paragraphs and checking for the paragraph.style.name
being toc 1
Then I know that I am in a ToC. But I am unable to get the actual text. I tried this:
if para.style.name == "toc 1" #then print para.text.
But para.text
is giving me a blank string. Why would this be the case?
Thanks
I believe you'll find that the actual generated contents of the TOC is "wrapped" in a non-paragraph element. python-docx
won't get you there directly as it only finds paragraphs that are direct children of the w:document/w:body
element.
To get at these you'll need to go down to the lxml level, using python-docx to get you as close as possible. You can get to (and print) the body element with this:
document = Document('my-doc.docx')
body_element = document._body._body
print(body_element.xml) # this will be big if your document is
From there you can identify the specific XML location of the parts you want and use lxml/XPath to access them. Then you can wrap them in python-docx Paragraph
objects for ready access:
from docx.text.paragraph import Paragraph
ps = body_element.xpath('./w:something/w:something_child/w:p'
paragraphs = [Paragraph(p, None) for p in ps]
This is not an exact recipe and will require some research on your part to work out what w:something
etc. are, but if you want it bad enough to surmount those hurdles, this approach will work.
Once you get it working, posting your exact solution may be of help to others on search.