pythonpowerpointloaderlangchain

langchain loader with power point not working


The below def load_documents function is able to load various documents such as .docx, .txt, and .pdf into langchain. I would also like to be able to load power point documents and found a script here: https://python.langchain.com/docs/integrations/document_loaders that I added to below function.

However, the function is unable to read .pptx files because I am not able to pip install UnstructuredPowerPointLoader. Can somebody please suggest a way to do this or to augment below function so I can load .pptx files?

Python function follows below:

def load_document(file):
    import os
    name, extension = os.path.splitext(file)

    if extension == '.pdf':
        from langchain.document_loaders import PyPDFLoader
        print(f'Loading {file}')
        loader = PyPDFLoader(file)
    elif extension == '.docx':
        from langchain.document_loaders import Docx2txtLoader
        print(f'Loading {file}')
        loader = Docx2txtLoader(file)
    elif extension == '.txt':
        from langchain.document_loaders import TextLoader
        print(f'Loading {file}')
        loader = TextLoader(file)
    elif extension == '.pptx':
        from langchain_community.document_loaders import UnstructuredPowerPointLoader
        print(f'Loading {file}')
        loader = UnstructuredPowerPointLoader(file)
    else:
        print('Document format is not supported!')
        return None

    data = loader.load()
    return data

The error I am getting is because !pip install unstructured is failing. I tried also tried !pip install -q unstructured["all-docs"]==0.12.0 but was unsuccessful again. Appreciate any help!


Solution

  • try with this: unstructured[docx,pptx]