pythonlangchainpy-langchainweaviate

Error "'tuple' object has no attribute 'page_content'" when using Weaviate.add_documents


I have the following piece of code:

if file.filename.lower().endswith('.pdf'):
                        pdf = ep.PDFLoad(file_path)  # this is the loader from langchain
                        doc = pdf.load()
                        archivo = crear_archivo(doc, file)

Inside crear_archivo function I am splitting the document and sending it to the Weaviate.add_documents:

   cliente = db.NewVect()  # This one creates the weaviate.client
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    docs = text_splitter.split_documents(document)

    embeddings = OpenAIEmbeddings()
    return Weaviate.add_documents(docs, embeddings, client=client, weaviate_url=EnvVect.Host, by_text=False,      index_name="LangChain") 
# using this instead of from_documents since I don't want to initialize a new vectorstore         

# Some more logic to save the doc to another database

Whenever I try to run the code it breaks during the Weaviate.add_documents() function prompting the following error: 'tuple' object has no attribute 'page_content'. I tried to check the type of docs, but that doesn't seem wrong since it returns a List[Document] which is the same type the function accepts.

How can I make it work? I kind of followed this approach but the difference is I am loading files such as PDF, txt etc.


Solution

  • (También soy nuevo acá)

    The error you're getting is likely because docs isn't a Document object.

    AFAIK, a "Document" in LangChain is a list of Document objects. If you run type(docs[0]) you should get langchain.schema.document.Document. This Document object is a dictionary with two keys: one is page_content: which accepts string values, and the second key is metadata: which only accepts dictionaries. {page_content: str, metadata: dict}. Not very well explained in LangChain's documentation.

    My suggestions to tackle your problem:

    1. Make sure that the document you're splitting here docs = text_splitter.split_documents(document), is effectively a LangChain Document object. Use print(document), and you should see this in the first line: [Document(page_content='your text etc... and at the end of the output, you should see ...end of your text', metadata={'...
    2. If document isn't a LangChain Document, you'll need to check how you created it.
    3. If document is a LangChain Document, try Weaviate.from_documents() instead.

    Hope this helps, y un abrazo!