How can I extract font color from text within a PDF?
I already tried to explore LTText
or LTChar
objects using PDFMiner, but it seems that this module only allows to extract font size and style, not color.
PDFMiner's LTChar object has 'graphicstate' attribute which has 'scolor' (stroking color) and 'ncolor' (non stroking color) attributes, which can be used to obtain text color information. Here's working code snippet (based on the code from one of the answers) that outputs font info for each text line component:
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar
import sys
with open(sys.argv[1], 'rb') as scr_file:
for page_layout in extract_pages(scr_file):
for element in page_layout:
if isinstance(element, LTTextContainer):
fontinfo = set()
for text_line in element:
for character in text_line:
if isinstance(character, LTChar):
fontinfo.add(character.fontname)
fontinfo.add(character.size)
fontinfo.add(character.graphicstate.scolor)
fontinfo.add(character.graphicstate.ncolor)
print("\n", element.get_text(), fontinfo)