I'm trying to read a Word document (.doc) to create a CustomWordLoader for LangChain. I'm currently able to read .docx files using the Python-docx package.
The stream is created by reading a word document from a Sharepoint site.
Here is code for docs:
class CustomWordLoader(BaseLoader):
"""
This class is a custom loader for Word documents. It extends the BaseLoader class and overrides its methods.
It uses the python-docx library to parse Word documents and optionally splits the text into manageable documents.
Attributes:
stream (io.BytesIO): A binary stream of the Word document.
filename (str): The name of the Word document.
"""
def __init__(self, stream, filename: str):
# Initialize with a binary stream and filename
self.stream = stream
self.filename = filename
def load_and_split(self, text_splitter=None):
# Use python-docx to parse the Word document from the binary stream
doc = DocxDocument(self.stream)
# Extract and concatenate all paragraph texts into a single string
text = "\n".join([p.text for p in doc.paragraphs])
# Check if a text splitter utility is provided
if text_splitter is not None:
# Use the provided splitter to divide the text into manageable documents
split_text = text_splitter.create_documents([text])
else:
# Without a splitter, treat the entire text as one document
split_text = [{'text': text, 'metadata': {'source': self.filename}}]
# Add source metadata to each resulting document
for doc in split_text:
if isinstance(doc, dict):
doc['metadata'] = {**doc.get('metadata', {}), 'source': self.filename}
else:
doc.metadata = {**doc.metadata, 'source': self.filename}
return split_text
My solution will be deployed on a Docker using "3.11.8-alpine3.18" (a slim version of unix).
For security reasons, I can't download the file locally, so I really need to able to read the stream like my example: doc = DocxDocument(self.stream)
I tried to find the equivalent package to Python-docx that is able to read a .docx but not a .doc.
I was able to do it using Textract. I have to save the stream in a file locally, but that's the only way I found.
here is my code:
class CustomWordLoader(BaseLoader):
"""
A custom loader for Word documents, extending BaseLoader. It reads Word documents from a binary stream,
writes them temporarily to disk, and uses textract to extract text. If textract fails, an exception is raised.
"""
def __init__(self, stream, filename: str):
self.stream = stream
self.filename = filename
def load_and_split(self, text_splitter=None):
# Generate a unique filename
temp_filename = str(uuid.uuid4()) + '.doc'
# Create a temporary directory
temp_dir = os.path.join(os.getcwd(), 'temp')
os.makedirs(temp_dir, exist_ok=True)
# Full path to the temporary file
temp_file_path = os.path.join(temp_dir, temp_filename)
# Write the content of the stream into the temporary file
with open(temp_file_path, 'wb') as f:
f.write(self.stream.read())
# Use textract to extract the text from the file
text = textract.process(temp_file_path).decode('utf-8')
if text_splitter is not None:
split_text = text_splitter.create_documents([text])
else:
split_text = [{'text': text, 'metadata': {'source': self.filename}}]
for doc in split_text:
if isinstance(doc, dict):
doc['metadata'] = {**doc.get('metadata', {}), 'source': self.filename}
else:
doc.metadata = {**doc.metadata, 'source': self.filename}
# Remove the temporary file
os.remove(temp_file_path)
return split_text
I hope this can help someone!