pythonpowerpointpython-pptx

How to use python-pptx to extract infrequent tables?


I have a pipeline where I'll be needing to ingest PowerPoint (pptx) files using Python. These files will mostly have text, occasionally have tables, and won't always have the same format and/or design. I need to extract this data, including the [mostly text] cell values of tables when present and eventually get into a table with presentation name, presentation date, and a free text field of all the ppt content.

I've been exploring the python-pptx module, and extracting most of the data is easy enough with the code below, but it skips a table in a slide:

for slide_number, slide in enumerate(presentation.slides):
print(f"Slide {slide_number + 1}:")
for shape in slide.shapes:
    if hasattr(shape, "text"):
        print(shape.text)
        

Question is, what's the best way to grab tables with this module (or another lightweight tool)? I've been perusing documentation for the module but an obvious solution hasn't presented itself given the tables can appear anywhere.


Solution

  • Try this:

    for slide_number, slide in enumerate(presentation.slides):
        print(f"Slide {slide_number + 1}:")
        for shape in slide.shapes:
            if hasattr(shape, "text"):
                print(shape.text)
            # Check is the shape has a table
            if shape.has_table == True:
                # Generate iterable cells
                cells = shape.table.iter_cells()
                # Iterate through cells
                for cell in cells:
                    print(cell.text)
    

    Using this pptx file to test:

    enter image description here

    The output is:

    Slide 1:
    File title
    Table read testing
    Slide 2:
    Column A
    Column B
    Column C
    Column D
    Cell A1
    Cell B1
    Cell C1
    Cell D1
    Cell A2
    Cell B2
    Cell C2
    Cell D2
    Cell A3
    Cell B3
    Cell C3
    Cell D3