I'm trying to detect text that has a coloured background in a MS Word docx, to separate it from the "normal" text.
from docx import Document
...
# Load the document
doc = Document(docx_path)
highlighted_text = []
normal_text = []
# Iterate through all paragraphs
for para in doc.paragraphs:
# Iterate through all runs in the paragraph
for run in para.runs:
print(run.text + " - " + str(run.font.highlight_color))
# Check if the run has a highlight color set
if run.font.highlight_color is not None:
highlighted_text.append(run.text)
print(f"Found highlighted text: '{run.text}' with highlight color: {run.font.highlight_color}")
return highlighted_text
However, in my test document it's only found grey highlights:
This is the results from the print statement: Text (normal) - None Text in grey - GRAY_25 (16) Found highlighted text: 'Text in grey ' with highlight color: GRAY_25 (16) Text in yellow - None Text in green - None
So not sure where I'm going wrong. I don't think the text has been been shaded as that is across a whole line.
Addendum: It only works for grey for me - which I have highlighted in MS Office - however the other highlights, which are getting missed have been done by someone else. This might have been done with an old copy of Office, or docx compatible software or some other method of highlighting he text that isn't "highlighting"
Any ideas?
This script performs well for me:
from docx import Document
def extract_highlighted_text(docx_path):
doc = Document(docx_path)
highlighted_texts = []
for para in doc.paragraphs:
for run in para.runs:
if run.font.highlight_color is not None:
highlighted_texts.append(run.text)
return highlighted_texts
docx_file = "text.docx"
highlighted_texts = extract_highlighted_text(docx_file)
print("Highlighted Texts:")
for text in highlighted_texts:
print(text)
Result: