I am creating RAG using langchain. Everything is working fine when it comes to PDF.
But, I have a task to ingest excel .xlsx spreadsheets, find the table inside of them and then parse that table for the RAG.
I am unable to automatically identify the empty rows, columns, merged rows/columns for text etc, and remove them. Are there open source libraries that can help with it? Or is there something in langchain itself that can be of help?
You can use UnstructuredExcelLoader to read excel, it will automatically remove empty rows, columns, merged rows/columns.
here is the example excel:
install UnstructuredExcelLoader:
pip install --upgrade --quiet langchain-community unstructured openpyxl
here is the code:
from langchain_community.document_loaders import UnstructuredExcelLoader
loader = UnstructuredExcelLoader("info.xlsx", mode="elements")
docs = loader.load()
print(docs)
output:
[
Document(
metadata={
'source': 'info.xlsx',
'filename': 'info.xlsx',
'last_modified': '2025-02-26T12:27:20',
'page_name': 'sheet1',
'page_number': 1,
'text_as_html': '...',
'languages': ['eng', 'slk'],
'filetype': 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
'category': 'Table',
'element_id': 'f49eba60e57d559f9af34a28d1a01e51'},
page_content='id name email phone a1 nick 12345678 a2 jack 2@mail.com 12345678 a3 john 3@mail.com 12345678 a4 4@mail.com 12345678 a5 may 5@mail.com 12345678'
)
]
ref : https://python.langchain.com/docs/integrations/document_loaders/microsoft_excel/