I'm really struggling to read my pdf files asynchronously. I tried using aiofiles which is open-source on GitHub. I want to extract the text from pdfs. I want to do it with pdfminer because pypdf is not rendering math (greek letters) or double letters (e.g. ff) properly for now.
The routine that works is:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
with open(pdf_filename, 'rb') as file:
resource_manager = PDFResourceManager(caching=False)
# Create a string buffer object for text extraction
text_io = StringIO()
# Create a text converter object
text_converter = TextConverter(resource_manager, text_io, laparams=LAParams())
# Create a PDF page interpreter object
page_interpreter = PDFPageInterpreter(resource_manager, text_converter)
# Process each page in the PDF file
async for page in extract_pages(file):
page_interpreter.process_page(page)
text = text_io.getvalue()
but then if I replace with open(pdf_filename, 'rb') as file
by async with aiofiles.open(pdf_filename, 'rb') as file
, the line async for page in extract_pages(file)
is not happy and I get this error:
async for page in extract_pages(file): TypeError: 'async for' requires an object with aiter method, got generator
So how do I get the file returned by aiofiles to be like a normal file with aiter?
And I use that to replace the original extract_pages function to try to make it work asynchronously:
async def extract_pages(file):
with file:
for page in PDFPage.get_pages(file, caching=False):
yield page
Many thanks if you can help me how to read a pdf file asynchronously in python with pdfminer or something equivalent that can read math.
PDFPage.get_pages
is really a generator, so it must be wrapped in an asynchronous generator. I haven't found a ready-made solution to do this, so here is my own:
import asyncio
class WrappedStopIteration(Exception):
""" "StopIteration" can't be transferred through a Future, so we need our own replacement"""
pass
def nextwrap(it):
try:
return next(it)
except StopIteration as e:
raise WrappedStopIteration(e)
async def agen(it):
loop = asyncio.get_running_loop()
try:
while True:
v = await loop.run_in_executor(None, nextwrap, it)
yield v
except WrappedStopIteration:
pass
(Caveat: Fails if thread-local variables are used or the generator/iterator otherwise assumes that it is executed completely in the same thread.)
In your case it can be used as follows:
async def extract_pages(file):
# "with file:" can be omitted because there is already the outer "with"
# enclosing the whole execution
async for page in agen(PDFPage.get_pages(file, caching=False)):
yield page