excelxmlopenpyxlpywin32xlwings

Is there a way to see if an excel workbook contains shapes using either openpxyl or by looking at the underlying XML?


I have thousands of excel workbooks and need to create a list of which ones contain shapes that might have text (ie a text box). How would I use either openpyxl or the underlying XML to generate a list of all shapes with text? I tried using both xlwings and pywin32, but those libraries both parse the shapes too slowly to be useful.

Current slow code:

for shape in ws_xlwings.shapes:
    if shape.type == "text_box" or shape.type == "auto_shape":
        try:
            str_cell_value = shape.text.lower()
            str_cell_value_upper = shape.text
        except: #shape might not have text
            continue

Solution

  • Here's a POC on how to handle excel as zip files, extract drawing xml and detect text boxes. It's probably faster than parsing the whole spreadsheet.

    from zipfile import ZipFile
    from lxml import etree
    
    def main():
        z = ZipFile('/home/luis/tmp/shapes.xlsx')
        drawings = [ drw for drw in z.filelist if "drawings" in drw.filename.split('/') ]
        for drw in drawings:
            with z.open(drw.filename) as sstr:
                ssdata = sstr.read()
                
                tree = etree.fromstring(ssdata)
                ns = {"xdr": "http://schemas.openxmlformats.org/drawingml/2006/spreadsheetDrawing",
                      "a": "http://schemas.openxmlformats.org/drawingml/2006/main", 
                      "r": "http://schemas.openxmlformats.org/officeDocument/2006/relationships"
                      }
                shapes = tree.xpath('//xdr:txBody[preceding-sibling::xdr:spPr]/a:p/a:r/a:t/text()', namespaces = ns)
                
                print(shapes)
    
    if __name__ == '__main__':
        main()
    

    Result for a simple xslx with 2 sheets and 1 shape per sheet

    ['Text box 2']
    ['Sheet2 ellipse text 3']
    

    To just detect text boxes

    shapes_cnt = tree.xpath('count(//xdr:txBody[preceding-sibling::xdr:spPr]/a:p/a:r/a:t)', namespaces = ns)
    

    Based on https://rsmith.home.xs4all.nl/howto/reading-xlsx-files-with-python.html