docxpython-docx

python-docx get words position and attributes


I'm looking for a means to extract the position (x, y) and attributes (font / size) of every word in a document.

From the python-docx docs, I know that :

Conceptually, Word documents have two layers, a text layer and a drawing layer. In the text layer, text objects are flowed from left to right and from top to bottom, starting a new page when the prior one is filled. In the drawing layer, drawing objects, called shapes, are placed at arbitrary positions. These are sometimes referred to as floating shapes.

A picture is a shape that can appear in either the text or drawing layer. When it appears in the text layer it is called an inline shape, or more specifically, an inline picture.

[...] At the time of writing, python-docx only supports inline pictures.

Yet, even if it is not the gist of it, I'm wondering if something similar exists :

from docx import Document
main_file = Document("/tmp/file.docx")
for paragraph in main_file.paragraphs:
    for word in paragraph.text:  # <= Non-existing (yet wished) functionnalities, IMHO
        print(word.x, word.y)  # <= Non-existing (yet wished) functionnalities, IMHO

Does somebody has an idea ? Best, Arthur


Solution

  • for word in paragraph.text:  # <= Non-existing (yet wished) functionalities, IMHO    
    

    This functionality is provided right in the Python library as str.split(). These can be composed easily as:

    for word in paragraph.text.split():
        ...
    

    Regarding

    print(word.x, word.y)  # <= Non-existing (yet wished) functionnalities, IMHO
    

    I think it's safe to say this functionality will never appear in python-docx, and if it did it could not look like this.

    What such a feature would be doing is asking the page renderer for the location at which the renderer was going to place those characters. python-docx has no rendering engine (because it does not render documents); it is simply a fancy XML editor that selectively modifies XML files in the WordprocessingML vocabulary.

    It may be possible to get these values from Word itself, because Word does have a rendering engine (which it uses for screen display and printing).

    If there was such a function, I expect it would take a paragraph and a character offset within that paragraph, or something more along those lines, like document.position(paragraph, offset=42) or perhaps paragraph.position(offset=42).