I have thousands of excel workbooks and need to create a list of which ones contain shapes that might have text (ie a text box). How would I use either openpyxl or the underlying XML to generate a list of all shapes with text? I tried using both xlwings and pywin32, but those libraries both parse the shapes too slowly to be useful.
Current slow code:
for shape in ws_xlwings.shapes:
if shape.type == "text_box" or shape.type == "auto_shape":
try:
str_cell_value = shape.text.lower()
str_cell_value_upper = shape.text
except: #shape might not have text
continue
Here's a POC on how to handle excel as zip files, extract drawing xml and detect text boxes. It's probably faster than parsing the whole spreadsheet.
from zipfile import ZipFile
from lxml import etree
def main():
z = ZipFile('/home/luis/tmp/shapes.xlsx')
drawings = [ drw for drw in z.filelist if "drawings" in drw.filename.split('/') ]
for drw in drawings:
with z.open(drw.filename) as sstr:
ssdata = sstr.read()
tree = etree.fromstring(ssdata)
ns = {"xdr": "http://schemas.openxmlformats.org/drawingml/2006/spreadsheetDrawing",
"a": "http://schemas.openxmlformats.org/drawingml/2006/main",
"r": "http://schemas.openxmlformats.org/officeDocument/2006/relationships"
}
shapes = tree.xpath('//xdr:txBody[preceding-sibling::xdr:spPr]/a:p/a:r/a:t/text()', namespaces = ns)
print(shapes)
if __name__ == '__main__':
main()
Result for a simple xslx with 2 sheets and 1 shape per sheet
['Text box 2']
['Sheet2 ellipse text 3']
To just detect text boxes
shapes_cnt = tree.xpath('count(//xdr:txBody[preceding-sibling::xdr:spPr]/a:p/a:r/a:t)', namespaces = ns)
Based on https://rsmith.home.xs4all.nl/howto/reading-xlsx-files-with-python.html