I'm trying to convert some PDF to images via pdf2image and poppler, to then run some computervision tasks on.
The conversion itself works fine.
However, the conversion creates some artifacts for each page in the pdf as it is being converted, which I would like to be deleted at the end of the function. To facilitate this, I am using tempfile.TemporaryDirectory(). The function looks as follow:
with tempfile.TemporaryDirectory() as path:
images_from_path: [Image] = convert_from_path(
os.path.join(path_superfolder, "calibration_target.pdf"),
size=(2480, 3508),
output_folder=path, poppler_path=r'E:\poppler-22.04.0\Library\bin')
if len(images_from_path) >= page:
images_from_path[page - 1].save(os.path.join(path_superfolder, "result.jpg"))
The trouble is, that the program always crashes with the following errors, after transforming the PDF and writing the required image to a file.
Traceback (most recent call last):
File "C:\Python310\lib\shutil.py", line 617, in _rmtree_unsafe
os.unlink(fullname)
PermissionError: [WinError 32] The process cannot access the file, because it is being used by another process: 'C:\\Users\\tobia\\AppData\\Local\\Temp\\tmp24c4bmzv\\bd76d834-672e-49fc-ac30-7751b7b660d0-01.ppm'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Python310\lib\tempfile.py", line 843, in onerror
_os.unlink(path)
PermissionError: [WinError 32] The process cannot access the file, because it is being used by another process: 'C:\\Users\\tobia\\AppData\\Local\\Temp\\tmp24c4bmzv\\bd76d834-672e-49fc-ac30-7751b7b660d0-01.ppm'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Python310\lib\code.py", line 90, in runcode
exec(code, self.locals)
File "<input>", line 1, in <module>
File "E:\PyCharm 2022.2.3\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line 198, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script
File "E:\PyCharm 2022.2.3\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "D:\Dokumente\Uni\Informatik\BA_Thesis\tumexam-scheduling-codebase\generate_data.py", line 393, in <module>
extract_calibration_page_as_image_from_pdf()
File "D:\Dokumente\Uni\Informatik\BA_Thesis\tumexam-scheduling-codebase\generate_data.py", line 190, in extract_calibration_page_as_image_from_pdf
tmp_dir.cleanup()
File "C:\Python310\lib\tempfile.py", line 873, in cleanup
self._rmtree(self.name, ignore_errors=self._ignore_cleanup_errors)
File "C:\Python310\lib\tempfile.py", line 855, in _rmtree
_shutil.rmtree(name, onerror=onerror)
File "C:\Python310\lib\shutil.py", line 749, in rmtree
return _rmtree_unsafe(path, onerror)
File "C:\Python310\lib\shutil.py", line 619, in _rmtree_unsafe
onerror(os.unlink, fullname, sys.exc_info())
File "C:\Python310\lib\tempfile.py", line 846, in onerror
cls._rmtree(path, ignore_errors=ignore_errors)
File "C:\Python310\lib\tempfile.py", line 855, in _rmtree
_shutil.rmtree(name, onerror=onerror)
File "C:\Python310\lib\shutil.py", line 749, in rmtree
return _rmtree_unsafe(path, onerror)
File "C:\Python310\lib\shutil.py", line 600, in _rmtree_unsafe
onerror(os.scandir, path, sys.exc_info())
File "C:\Python310\lib\shutil.py", line 597, in _rmtree_unsafe
with os.scandir(path) as scandir_it:
NotADirectoryError: [WinError 267] Directory name invalid: 'C:\\Users\\tobia\\AppData\\Local\\Temp\\tmp24c4bmzv\\bd76d834-672e-49fc-ac30-7751b7b660d0-01.ppm'
When stepping through the cleanup routine, everything seems fine, the path is correct and it starts deleting files, until at some point the internal path variable gets jumbled up and the routine crashes, because obviously a file is not a directory. To me it seems like a race condition is causing problems here.
tmp_dir.cleanup()
While experimenting some more and writing this question, I found a working solution:
with tempfile.TemporaryDirectory() as path:
images_from_path: [Image] = convert_from_path(
os.path.join(path_superfolder, f"calibration_target_{exam_type}.pdf"),
size=(2480, 3508),
output_folder=path, poppler_path=r'E:\poppler-22.04.0\Library\bin')
if len(images_from_path) >= page:
images_from_path[page - 1].save(os.path.join(path_superfolder, "result.jpg"))
images_from_path = []
It seems that somehow, the routine had trouble cleaning up, because the converted images, are actually the artifacts created by pdf2image
and were still being held by my data structure. Resetting the data structure, before implicitly initiating the cleanup fixed the issue.
If there is a better way of tackling this issue, please do not hesitate to inform me.