I currently trying to implement langchain functionality to talk with pdf documents. I have a bunch of pdf files stored in Azure Blob Storage. I am trying to use langchain PyPDFLoader to load the pdf files to the Azure ML notebook. However, I am not being able to get it done. If I have the pdf stored locally, it is no problem, but to scale up I have to connect to the blob store. I have not really found any documents on langchain website or azure website. Wondering, if any of you is having similar problem.
Thank you
Below is an example of code i am trying:
from azureml.fsspec import AzureMachineLearningFileSystem
fs = AzureMachineLearningFileSystem("<path to datastore>")
from langchain.document_loaders import PyPDFLoader
with fs.open('*/.../file.pdf', 'rb') as fd:
loader = PyPDFLoader(document)
data = loader.load()
Error: TypeError: expected str, bytes or os.PathLike object, not StreamInfoFileObject
Another example tried:
from langchain.document_loaders import UnstructuredFileLoader
with fs.open('*/.../file.pdf', 'rb') as fd:
loader = UnstructuredFileLoader(fd)
documents = loader.load()
Error: TypeError: expected str, bytes or os.PathLike object, not StreamInfoFileObject
If you still need an answer, you must convert the blob data into BytesIO object, and save it locally (whether temporarily or forever) before processing the files. Here is how i do it:
def az_load_files(storage_acc_name, container_name, filenames=None):
container_client = get_blob_container_client(container_name, storage_acc_name)
blob_data = []
for filename in filenames:
blob_client = container_client.get_blob_client(filename)
if blob_client.exists():
blob_data.append(io.BytesIO(blob_client.download_blob().readall()))
return blob_data
Then create a temp folder for BytesIO objects to be read and 'converted' into their respective document types
import temp
temp_pdfs = []
temp_dir = tempfile.mkdtemp()
for i, byteio in enumerate(ss['loaded_files']):
file_path = os.path.join(temp_dir, ss['selected_files'][i])
with open(file_path, 'wb') as file:
file.write(byteio.getbuffer())
temp_pdfs.append(file_path)
And use DirectoryLoader to load any type of doc you may have
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import (
PyPDFLoader,
DirectoryLoader,
CSVLoader,
Docx2txtLoader,
TextLoader,
UnstructuredExcelLoader,
UnstructuredHTMLLoader,
UnstructuredPowerPointLoader,
UnstructuredMarkdownLoader,
JSONLoader
)
file_type_mappings = {
'*.txt': TextLoader,
'*.pdf': PyPDFLoader,
'*.csv': CSVLoader,
'*.docx': Docx2txtLoader,
'*.xlss': UnstructuredExcelLoader,
'*.xlsx': UnstructuredExcelLoader,
'*.html': UnstructuredHTMLLoader,
'*.pptx': UnstructuredPowerPointLoader,
'*.ppt': UnstructuredPowerPointLoader,
'*.md': UnstructuredMarkdownLoader,
'*.json': JSONLoader,
}
docs = []
for glob_pattern, loader_cls in file_type_mappings.items():
try:
loader_kwargs = {'jq_schema': '.', 'text_content': False} if loader_cls == JSONLoader else None
loader_dir = DirectoryLoader(
temp_dir, glob=glob_pattern, loader_cls=loader_cls, loader_kwargs=loader_kwargs)
documents = loader_dir.load_and_split()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=800, chunk_overlap=200)
# for different glob pattern it will split and add texts
docs += text_splitter.split_documents(documents)
except Exception as e:
continue