I have pdf files where I want to extract info only from the first page. My solution is to:
It works but I do not like this solution. What is the need to save and still read the exact same file? Can I not use the file directly at runtime?
Here is what I have done that I don't like:
from PyPDF2 import PdfReader, PdfWriter
from io import BytesIO
import boto3
def analyse_first_page(bucket_name, file_name):
s3 = boto3.resource("s3")
obj = s3.Object(bucket_name, file_name)
fs = obj.get()['Body'].read()
pdf = PdfReader(BytesIO(fs), strict=False)
writer = PdfWriter()
page = pdf.pages[0]
writer.add_page(page)
# Here is the part I do not like
with open("first_page.pdf", "wb") as output:
writer.write(output)
with open("first_page.pdf", "rb") as pdf_file:
encoded_string = bytearray(pdf_file.read())
#Analyse text
textract = boto3.client('textract')
response = textract.detect_document_text(Document={"Bytes": encoded_string})
return response
analyse_first_page(bucket, file_name)
Is there no AWS way to do this? Is there no better way to do this?
You can use BytesIO
as stream in memory without write to file then read it again.
with BytesIO() as bytes_stream:
writer.write(bytes_stream)
bytes_stream.seek(0)
encoded_string = b64encode(bytes_stream.getvalue())