I have the following piece of code:
if file.filename.lower().endswith('.pdf'):
pdf = ep.PDFLoad(file_path) # this is the loader from langchain
doc = pdf.load()
archivo = crear_archivo(doc, file)
Inside crear_archivo
function I am splitting the document and sending it to the Weaviate.add_documents
:
cliente = db.NewVect() # This one creates the weaviate.client
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(document)
embeddings = OpenAIEmbeddings()
return Weaviate.add_documents(docs, embeddings, client=client, weaviate_url=EnvVect.Host, by_text=False, index_name="LangChain")
# using this instead of from_documents since I don't want to initialize a new vectorstore
# Some more logic to save the doc to another database
Whenever I try to run the code it breaks during the Weaviate.add_documents()
function prompting the following error:
'tuple' object has no attribute 'page_content'
.
I tried to check the type of docs
, but that doesn't seem wrong since it returns a List[Document]
which is the same type the function accepts.
How can I make it work? I kind of followed this approach but the difference is I am loading files such as PDF, txt etc.
(También soy nuevo acá)
The error you're getting is likely because docs
isn't a Document object.
AFAIK, a "Document" in LangChain is a list of Document
objects. If you run type(docs[0])
you should get langchain.schema.document.Document
. This Document object is a dictionary with two keys: one is page_content:
which accepts string values, and the second key is metadata:
which only accepts dictionaries. {page_content: str, metadata: dict}
. Not very well explained in LangChain's documentation.
My suggestions to tackle your problem:
document
you're splitting here docs = text_splitter.split_documents(document)
, is effectively a LangChain Document object. Use print(document)
, and you should see this in the first line: [Document(page_content='your text etc...
and at the end of the output, you should see ...end of your text', metadata={'...
document
isn't a LangChain Document, you'll need to check how you created it.document
is a LangChain Document, try Weaviate.from_documents()
instead.Hope this helps, y un abrazo!