I tried convert PDF to JPEG on Google Cloud Functions. I used the Python module pdf2image
. But I have no idea how to solve the errors No such file or directory: 'pdfinfo'
and "Unable to get page count. Is poppler installed and in PATH?
on the cloud function.
The error code is very similar to this question. pdf2image is a wrapper around "pdftoppm" and "pdftocairo" of poppler. But how can I install the poppler package on google cloud function, and add it to PATH? I can't find relevant references for it. It is even possible? If not, what could be done?
There is also this question, but it isn't useful.
The code look something like the following. Entry point is process_image
.
import requests
from pdf2image import convert_from_path
def process_image(event, context):
# Download sample pdf file
url = 'https://www.adobe.com/support/products/enterprise/knowledgecenter/media/c4611_sample_explain.pdf'
r = requests.get(url, allow_redirects=True)
open('/tmp/sample.pdf', 'wb').write(r.content)
# Error occur on this line
pages = convert_from_path('/tmp/sample.pdf')
# Save pages to /tmp
for idx, page in enumerate(pages):
output_file_path = f"/tmp/{str(idx)}.jpg"
page.save(output_file_path, 'JPEG')
# To be saved to cloud storage
Requirement.txt:
requests==2.25.1
pdf2image==1.14.0
This is the error code I get:
Traceback (most recent call last):
File "/layers/google.python.pip/pip/lib/python3.8/site-packages/pdf2image/pdf2image.py", line 441, in pdfinfo_from_path
proc = Popen(command, env=env, stdout=PIPE, stderr=PIPE)
File "/opt/python3.8/lib/python3.8/subprocess.py", line 858, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/opt/python3.8/lib/python3.8/subprocess.py", line 1706, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'pdfinfo'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/app.py", line 2447, in wsgi_app
response = self.full_dispatch_request()
File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/app.py", line 1952, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/app.py", line 1821, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/_compat.py", line 39, in reraise
raise value
File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/app.py", line 1950, in full_dispatch_request
rv = self.dispatch_request()
File "/layers/google.python.pip/pip/lib/python3.8/site-packages/flask/app.py", line 1936, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "/layers/google.python.pip/pip/lib/python3.8/site-packages/functions_framework/__init__.py", line 149, in view_func
function(data, context)
File "/workspace/main.py", line 11, in process_image
pages = convert_from_path('/tmp/sample.pdf')
File "/layers/google.python.pip/pip/lib/python3.8/site-packages/pdf2image/pdf2image.py", line 97, in convert_from_path
page_count = pdfinfo_from_path(pdf_path, userpw, poppler_path=poppler_path)["Pages"]
File "/layers/google.python.pip/pip/lib/python3.8/site-packages/pdf2image/pdf2image.py", line 467, in pdfinfo_from_path
raise PDFInfoNotInstalledError(
pdf2image.exceptions.PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?
Thanks in advance for any help.
Cloud Functions does not support installing custom system-level packages (even though it support third-party libraries for a relevant programming language with a package manager like npm, pip). As shown in https://cloud.google.com/functions/docs/reference/system-packages, there is no package “poppler”.
However, you can still make use the other pre-installed packages. ghostscript can be used to convert pdf to images.
First of all you should save the pdf file in cloud function (e.g. from cloud storage). You only have disk write access to /tmp (https://cloud.google.com/functions/docs/concepts/exec#file_system).
An example of terminal command to convert pdf to jpeg would be like this
gs -dSAFER -dNOPAUSE -dBATCH -sDEVICE=jpeg -dJPEGQ=100 -r300 -sOutputFile=output/file/path input/file/path
Sample code to use the command in python environment:
# download the file from google cloud storage
gcs = storage.Client(project=os.environ['GCP_PROJECT'])
bucket = gcs.bucket(bucket_name)
blob = bucket.blob(file_name)
blob.download_to_filename(input_file_path)
# run ghostscript
cmd = f'gs -dSAFER -dNOPAUSE -dBATCH -sDEVICE=jpeg -dJPEGQ=100 -r300 -sOutputFile="{output_file_path}" {input_file_path}'.split(' ')
p = subprocess.Popen(cmd, stderr=subprocess.PIPE, stdout=subprocess.PIPE)
stdout, stderr = p.communicate()
error = stderr.decode('utf8')
if error:
logging.error(error)
return
Note: You might want to use the imagemagick package instead, which itself use ghostscript. However, as mentioned in Can't load PDF with Wand/ImageMagick in Google Cloud Function, PDF reading by ImageMagick has been disabled because of a security vulnerability Ghostscript had as of the time of writing (2021-07-12). The solution provided is essentially another way to run ghostscript.
Reference: https://www.the-swamp.info/blog/google-cloud-functions-system-packages/