javascriptpythonms-wordpython-docxwordprocessingml

Reading a .docx file to extract the text along with font and other formatting information of the text


As the question states, my goal is to find a python library to extract the text and font information from a .docx file. For example for the following text: "hello world" I need to be able to read that the string hello is bold and not italicized and the string world is not bold and italicized. In addition to knowing if the text is bold or italicized, I also need to know other information such as size, color, font type (like ariel, times new roman), etc. I need to be able to read the entire .docx file and extract the information.

I have tried using python-docx library, and was able to extract the text but not the relevant font information in the .docx file. for example in the following code:

import docx
doc = docx.Document('cg0002.docx')
for para in doc.paragraphs:
    for run in para.runs:
        font = run.font
        is_bold = font.bold

I would get font and is_bold as none. Upon further research I learned that you cannot use the library to read the .docx font but you have to assign them your self. Is there any other library that I can use to achieve my goal?

Compromises I am willing to make: I am not particularly adamant on using python to solve this. I can use any other language like java, javascript, c/c++, powershell etc. I can also work with converting the documents into other formats like pdf if it makes it easier to extract the info, provided that the document stays intact (for example I could try to upload it to google docs and use appscript to try and extract the text but some fonts won't be retained once viewed using google docs, so I don't want to do that.


Solution

  • For DocX it would perhaps be 100% best to use VBA to garner the details.

    However a "potential" alternative route might be to simply drop any style overrides by export from WordPad to basic RTF. Then look at redefined characteristics of target blocks.

    NOTE:- depending on the conversion this may not be 100% reliable, to achieve you goal.

    Whilst we can convert DocX to PDF using WordPad from the command line, we can not convert DocX to RTF without using a VBS macro, but that is a different question.

    enter image description here enter image description here

    From the Header we can see CodePage=1252 & 2057= English (United Kingdom) British :-)

    Breakdown by eye of \b\f0\fs24\lang9 Hello \b0\i World\ul\i0 !\ulnone\fs22\par

    \b - Is the start of Bold
    \f0 - Calibri in the given language (BEWARE here 0 is an index NOT a stop)
    \fs24 - Is points x 2 so the text here is 12 point
    \lang9 - I forget at the moment, awaiting correction in comments :-)
     Hello - Has both a leading and trailing space (leading is to be ignored)
    \b0 - My BAD, boldening STOPS, AFTER the space between the words
    \i - Start italics (ignore the space before World)
    \ul - Start underlining
    \i0 - Stop italics (ignore the space before !)
    \ulnone - Stop underline (don't ask me why not \ul0)
    \fs22 - I will let you guess the default page font height but by now you know it is not 22
    
    \par - THE END, "That's all Folks!" ™
    

    P.S.

    I revisited the source, to make 2 corrections, see if you can work out both changes. "My" clue to the second is above, but can easily trip you up when using regex.

    \b\f0\fs22\lang9 Hello,\i \b0 World\ul\i0 !\ulnone\par

    whilst it should have finally been

    \b\f0\fs22\lang9 Hello,\b0 \i World\ul\i0 !\ulnone\par