pythonpandasdataframems-wordpython-docx

python-docx: Parse a table to Panda Dataframe


I'm using the python-docx library to extract ms word document. I'm able to get all the tables from the word document by using the same library. However, I'd like to parse the table into a panda data frame, is there any built-in functionality I can use to parse the table into data frame or I'll have to do it manually? Also, is there a possibility to know the heading name in which the table lies inside? Thank you

from docx import Document
from docx.shared import Inches
document = Document('test.docx')

tabs = document.tables

Solution

  • You can extract tables from the document in data-frame by using this code :

    from docx import Document  # Import the Document class from the docx module to work with Word documents
    import pandas as pd  # Import pandas for data manipulation and analysis
    
    # Load the Word document
    document = Document('test.docx')
    
    # Initialize an empty list to store tables
    tables = []
    
    # Iterate through each table in the document
    for table in document.tables:
        # Create a DataFrame structure with empty strings, sized by the number of rows and columns in the table
        df = [['' for _ in range(len(table.columns))] for _ in range(len(table.rows))]
        
        # Iterate through each row in the current table
        for i, row in enumerate(table.rows):
            # Iterate through each cell in the current row
            for j, cell in enumerate(row.cells):
                # If the cell has text, store it in the corresponding DataFrame position
                if cell.text:
                    df[i][j] = cell.text
        
        # Convert the list of lists (df) to a pandas DataFrame and add it to the tables list
        tables.append(pd.DataFrame(df))
    
    # Print the list of DataFrames representing the tables
    print(tables)
    

    You can get all the tables from the tables variable.