artificial-intelligencelangchain

Extracting tables and cleaning it from spreadsheets


I am creating RAG using langchain. Everything is working fine when it comes to PDF.

But, I have a task to ingest excel .xlsx spreadsheets, find the table inside of them and then parse that table for the RAG.

I am unable to automatically identify the empty rows, columns, merged rows/columns for text etc, and remove them. Are there open source libraries that can help with it? Or is there something in langchain itself that can be of help?


Solution

  • You can use UnstructuredExcelLoader to read excel, it will automatically remove empty rows, columns, merged rows/columns.

    here is the example excel:

    info.xlsx

    install UnstructuredExcelLoader:

    pip install --upgrade --quiet langchain-community unstructured openpyxl
    

    here is the code:

    from langchain_community.document_loaders import UnstructuredExcelLoader
    
    loader = UnstructuredExcelLoader("info.xlsx", mode="elements")
    docs = loader.load()
    print(docs)
    

    output:

    [
        Document(
            metadata={
                'source': 'info.xlsx', 
                'filename': 'info.xlsx', 
                'last_modified': '2025-02-26T12:27:20', 
                'page_name': 'sheet1', 
                'page_number': 1, 
                'text_as_html': '...', 
                'languages': ['eng', 'slk'], 
                'filetype': 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet', 
                'category': 'Table', 
                'element_id': 'f49eba60e57d559f9af34a28d1a01e51'}, 
            page_content='id name email phone a1 nick 12345678 a2 jack 2@mail.com 12345678 a3 john 3@mail.com 12345678 a4 4@mail.com 12345678 a5 may 5@mail.com 12345678'
        )
    ]
    

    ref : https://python.langchain.com/docs/integrations/document_loaders/microsoft_excel/