java groovy xml-parsing libreoffice opendocument

OpenDocument format: parse & split text by lines

I'm parsing (using Groovy) the content.xml obtained from an LibreOffice .odt (Writer) file.

I want to make sure I hoover up all the text in the file, splitting by line breaks.

In Java's org.w3c.dom.Node (or Groovy's groovy.util.Node) there is a method to pick up all the text under any node (dom.Node.getTextContent/util.Node.text). For the highest node this will print all the text in the file, but it ignores line breaks.

This led me to suppose I would instead have to walk (depth-first) through the structure, identifying individual lines.

Parsing through such a structure I find that the "local part" of the nodes' names which tend to have text are "p" (paragraph) and "h" (heading).

I'm also assuming that a "p" or "h" can't nest another "p" or "h" (although with some complicated embedded structure I'm sure they can...). But clearly examining any spans under a given "p" will generate text which you've already obtained from its ancestor "p" node.

But are "p" and "h" the only QNames that I need to look at? I how should I deal with the possibility of embedded structures (e.g. a graphic containing some text).

Is there some technique whereby I can get a comprehensive listing of all text, node by node, ensuring that no text is missed out and none duplicated?

Failing this, is there some aspect of the OpenDocument format which might let me work this out? Interestingly the example in the brief overview at Wikip, under "content.xml", uses just these two QNames, "p" and "h".

Solution

Tim Yates' comment seems the best way to go.

Unless anyone objects I shall not delete this question though because there doesn't seem another one like it.

From first experiments it appears that org.odftoolkit.simple.TextDocument.getParagraphIterator() will iterate through all paras, including "h" QNames (= headings), and also including empty paragraphs. A good sign.

NB bear in mind that these "paragraphs" may in fact be multi-line paragraphs: in a Writer file there is a difference between a "paragraph mark" and a "newline". The solution to this is very simple, however: just split the Paragraph getTextContent() / (textContent property for Groovy people) String on the newline character...