I am new to the Python and coding. New I have a problem and need your help. I tried to read a docx document by using Python-docx, but all of the text I wanted were in the ContentControl. When I try to print the text of the paragraph with a ContentControl, error occurs.
For exemple, I try to print the 1st paragraphe, using
import docx
doc= docx.Document("C:\ContentControl.docx")
p=doc.paragraphs
print(p[0].text)
then I get an error like:
UnicodeEncodeError: 'gbk' codec can't encode character '\xa0' in position 8: illegal multibyte sequence
So what should I do to get the text in ContentControl? Thanks a lot for your help!
You cannot, with Python-docx.
If you check https://github.com/python-openxml/python-docx/blob/master/docx/oxml/text/paragraph.py – the code that reads paragraphs and their contents –, you can see that it only parses two sub-elements of <w:p>
: its formatting from <w:pPr>
, and its text runs from <w:r>
. The contents of a text run is parsed with text/run.py
, which iterates over its elements and stores data for rPr
(local text run formatting), t
(plain text itself), and tab
(a literal Tab), and a handful more.
But Word's "contentControl" is stored in another tag, which is not parsed!
<w:p> <!-- paragraph -->
<w:r> <!-- text runs -->
<w:t>Editions :</w:t> <!-- plain text -->
</w:r> <!-- end text run -->
<w:sdt>
<w:sdtPr>
<w:sdtContent> <!-- something else! -->
<w:r>
<w:t>Henry</w:t>
</w:r>
</w:sdtContent>
</w:sdt>
<w:r> <!-- next text run; just a tab -->
<w:tab/>
<w:t xml:space="preserve"> </w:t>
</w:r> <!-- end of that text run -->
</w:p>
(from your sample document; some codes are elided for brevity)
As you can see, the ContentControl data is inside a <w:sdt>
tag, which in turn is a direct descendent of <w:p>
. So the code to read its data should be in paragraph.py
, but it is not.
You can clone python-docx
and add proper handling of <w:sdt>
yourself (and here is all information you need for that), but it just may be easier to use Word itself, and use a VBA macro to convert these to plain text.
By the way, your error code has nothing to do with this. The "offending" character is the non-breaking space in the "Editions" line, stored as  
. Your text decoder should really not have had any problem with it. The problem is likely caused by you using the gbk
decoder instead of UTF-8. There are some Chinese characters in the document, but also written as decimal escaped Unicode characters; there are no non-ASCII characters.