pythonlinuxamazon-web-servicespdftotextpoppler

Get pdftotext Python module running on Lambda


I need to get the pdftotext python library for 3.8.6 running in an AWS Lambda Function.

I have the library installed and running on an Amazon Linux AMI, however when I copy the library files into a lambda I get:

[ERROR] ModuleNotFoundError: No module named 'pdftotext' Traceback (most recent call last)

The lambda function has the Python path set to the site-packages directory, which I have confirmed is the same on the Amazon Linux instance. Other libraries in the same directory can be imported fine.

The python package is an actual binary (pdftotext.cpython-38-x86_64-linux-gnu.so), and I'm assuming the binary generated on the Amazon Linux AMI isn't compatible with lambda? So perhaps this is the reason it can't be imported.

I've also attempted to install the library in the Amazon Linux docker container (amazonlinux:2018.03), but when importing the module I get the following error:

ImportError: /root/package/lib/pdftotext.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN7poppler8document18load_from_raw_dataEPKciRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESA_

Has anybody got this working? Or any ideas on things I can try / steps I can table to troubleshoot?


Solution

  • Based on the comments.

    The issue was caused by using Amazon Linux 1 (AL1), instead of AL2. This is because Lambda environment for Python 3.8 is based on AL2, not AL1.

    The solution was to use the shared objects needed for pdftotext from AL2, rather then from AL1.