pythonpdfocrpdfplumberocrmypdf

ocrmypdf - could not find source-pdf?


i would like to use ocrmypdf to convert some pdf-file from a picture to a readable pdf -

Tried it with the following simple code: (the invoice.pdf is of course available in the same path as the python-script and the output.pdf should be generated)

import ocrmypdf
if __name__ == '__main__':
  fn = r"C:\Users\Polzi\Documents\DEV\Python-Diverses\PDFOCR\invoice.pdf"
  ocrmypdf.ocr(fn, 'output.pdf', deskew=True)

But unfortunately i get this error message:

$ python exPDFOCR.py
[WinError 2] Das System kann die angegebene Datei nicht finden
Traceback (most recent call last):
  File "C:\Users\Polzi\Documents\DEV\Python-Diverses\PDFOCR\exPDFOCR.py", line 25, in <module>
    ocrmypdf.ocr('invoice.pdf', 'output.pdf', deskew=True)
  File "C:\Users\Polzi\Documents\DEV\.venv\testing\lib\site-packages\ocrmypdf\api.py", line 336, in ocr
    check_options(options, plugin_manager)
  File "C:\Users\Polzi\Documents\DEV\.venv\testing\lib\site-packages\ocrmypdf\_validation.py", line 271, in check_options
    ocr_engine_languages = plugin_manager.hook.get_ocr_engine().languages(options)
  File "C:\Users\Polzi\Documents\DEV\.venv\testing\lib\site-packages\ocrmypdf\builtin_plugins\tesseract_ocr.py", line 155, in languages
    return tesseract.get_languages()
  File "C:\Users\Polzi\Documents\DEV\.venv\testing\lib\site-packages\ocrmypdf\_exec\tesseract.py", line 143, in get_languages
    proc = run(
  File "C:\Users\Polzi\Documents\DEV\.venv\testing\lib\site-packages\ocrmypdf\subprocess\__init__.py", line 53, in run
    proc = subprocess_run(args, env=env, **kwargs)
  File "c:\users\polzi\appdata\local\programs\python\python39\lib\subprocess.py", line 505, in run
    with Popen(*popenargs, **kwargs) as process:
  File "c:\users\polzi\appdata\local\programs\python\python39\lib\subprocess.py", line 951, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "c:\users\polzi\appdata\local\programs\python\python39\lib\subprocess.py", line 1420, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] Das System kann die angegebene Datei nicht finden

Why can´t he find the file in the same folder as the py-file is executed?


Solution

  • Sometimes the first error message may be misleading without a clear cause

    In this case the primary message "The system cannot find the specified file"

    Will lead a user to concentrate on why a filename is not correct, as in this case.

    What the error should report is that a required file in the dependencies was not found. which can be caused by one or more Tesseract or related Leptonica / Language data files not in the correct location either due to no install or poor install.

    It transpired that installing tesseract on windows from https://github.com/UB-Mannheim/tesseract/wiki "the script now works fine"

    Note a missing dependency was the cause of a similar message here Import ocrmypdf in Visual Stdio Code in Python