The below def load_documents function is able to load various documents such as .docx, .txt, and .pdf into langchain. I would also like to be able to load power point documents and found a script here: https://python.langchain.com/docs/integrations/document_loaders that I added to below function.
However, the function is unable to read .pptx files because I am not able to pip install UnstructuredPowerPointLoader. Can somebody please suggest a way to do this or to augment below function so I can load .pptx files?
Python function follows below:
def load_document(file):
import os
name, extension = os.path.splitext(file)
if extension == '.pdf':
from langchain.document_loaders import PyPDFLoader
print(f'Loading {file}')
loader = PyPDFLoader(file)
elif extension == '.docx':
from langchain.document_loaders import Docx2txtLoader
print(f'Loading {file}')
loader = Docx2txtLoader(file)
elif extension == '.txt':
from langchain.document_loaders import TextLoader
print(f'Loading {file}')
loader = TextLoader(file)
elif extension == '.pptx':
from langchain_community.document_loaders import UnstructuredPowerPointLoader
print(f'Loading {file}')
loader = UnstructuredPowerPointLoader(file)
else:
print('Document format is not supported!')
return None
data = loader.load()
return data
The error I am getting is because !pip install unstructured is failing. I tried also tried !pip install -q unstructured["all-docs"]==0.12.0 but was unsuccessful again. Appreciate any help!
try with this: unstructured[docx,pptx]