pythonpython-tesseractpython-watchdog

How come Python says my file doesn't exist, even though it has already been accessed earlier in the code?


I guess my main question is why am i getting an error that says the file or directory doesn't exist. It has already been accessed and modified earlier in the code.

I am writing a program to monitor a network and scan all new files to ensure that all PDFs are searchable. When the program finds a PDF that isn't searchable it OCRs the document and replaces the original PDF with the new searchable PDF. I had it working when I was working with a very specific folder and path but I'm modifying the code so that I can have it monitor the whole C: drive and it keeps breaking when I try and have it OCR the created JPGs.

I have this code running to monitor the folder

class Handler(watchdog.events.PatternMatchingEventHandler):
    def __init__(self):
        watchdog.events.PatternMatchingEventHandler.__init__(self, patterns=['*.pdf'],
                                                           ignore_patterns = None,
                                                           ignore_directories = False,
                                                           case_sensitive = False)
    def on_created(self, event):
        print(f"File was created at {event.src_path}")
        OCRscript(self, event)
    def on_deleted(self, event):
        print(f"File was deleted at {event.src_path}")

event_handler = Handler()
observer = watchdog.observers.Observer()
observer.schedule(event_handler, "C://Users//Installer//Desktop//tesseract test",
                  recursive = False)
observer.start()
observer.join()

It then sends the event through a few functions to determine if the new file is a PDF and if it is searchable. If it comes back as a PDF that is not searchable the program, then creates a folder and converts each page into a JPG and saves each JPG to the new folder. From there it's supposed to OCR each JPG and convert each JPG into a searchable PDF using the OCR data. It then runs a function to merge each of the searchable PDFs.

I have tried changing the variables where im storing the path to have 1 or 2 '/' to see if the computer was having trouble processing the stored path because of that. It worked great when I had actually typed in the path instead of trying to store the path in a variable and then call it.

EDIT I just changed the OCR function a little and I'm not getting a new error. I realized I was still trying to open the JPG instead of convert it. Here is the updated code and the new error.

def OCR(pages):
    input_dir = f"{folder_save_point}\\{file_name}_page_{pages}.jpg"
    img = cv2.imread(input_dir, 1)
    result = pytesseract.image_to_pdf_or_hocr(img, lang = "eng",
                                              config = tessdata_dir_config)
    f = open((folder_save_point + '//' + file_name+ '_page_' + str(pages) + '.pdf'), "w+b")
    f.write(bytearray(result))
    f.close()
    print("Page " , pages, " OCRed.")  

Error

Exception in thread Thread-1:
Traceback (most recent call last):
  File "C:\Program Files\Python311\Lib\threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "C:\Users\Installer\AppData\Roaming\Python\Python311\site-packages\watchdog\observers\api.py", line 205, in run
    self.dispatch_events(self.event_queue)
  File "C:\Users\Installer\AppData\Roaming\Python\Python311\site-packages\watchdog\observers\api.py", line 381, in dispatch_events
    handler.dispatch(event)
  File "C:\Users\Installer\AppData\Roaming\Python\Python311\site-packages\watchdog\events.py", line 403, in dispatch
    super().dispatch(event)
  File "C:\Users\Installer\AppData\Roaming\Python\Python311\site-packages\watchdog\events.py", line 272, in dispatch
    {
  File "C:\Users\Installer\Desktop\Scripts\tesseract_OCR_2.0.py", line 21, in on_created
    OCRscript(self, event)
  File "C:\Users\Installer\Desktop\Scripts\tesseract_OCR_2.0.py", line 128, in OCRscript
    OCR(pages)
  File "C:\Users\Installer\Desktop\Scripts\tesseract_OCR_2.0.py", line 68, in OCR
    result = pytesseract.image_to_pdf_or_hocr(img, lang = "eng",
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Installer\AppData\Roaming\Python\Python311\site-packages\pytesseract\pytesseract.py", line 446, in image_to_pdf_or_hocr
    return run_and_get_output(*args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Installer\AppData\Roaming\Python\Python311\site-packages\pytesseract\pytesseract.py", line 277, in run_and_get_output
    with save(image) as (temp_name, input_filename):
  File "C:\Program Files\Python311\Lib\contextlib.py", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File "C:\Users\Installer\AppData\Roaming\Python\Python311\site-packages\pytesseract\pytesseract.py", line 197, in save
    image, extension = prepare(image)
                       ^^^^^^^^^^^^^^
  File "C:\Users\Installer\AppData\Roaming\Python\Python311\site-packages\pytesseract\pytesseract.py", line 174, in prepare
    raise TypeError('Unsupported image object')
TypeError: Unsupported image object

EDIT 2: I just removed the old code. I realized the old error was related to me trying to open the file and not just from trying to OCR the file.

EDIT 3: I just added a few print statements to see the path a little better. I saw that it had a mix of \ and //, so I added a couple .replaces to get the path in the same format. I then tried printing the img variable and its returning 'None'. This makes me think that the cv2.imread() is still having issues reading the path even with it formated better. Here is the updated Code and updated error messages.

        input_dir = f"{folder_save_point}\\{file_name}_page_{pages}.jpg"
        input_dir = input_dir.replace('''//''', "\\")
        print(input_dir)
        input_dir = input_dir.replace('''\\''', "//")
        print(input_dir)
        img = cv2.imread(input_dir, 1)
        print(img)
        result = pytesseract.image_to_pdf_or_hocr(img, lang = "eng",
                                                  config = 
        tessdata_dir_config)
        input_dir = input_dir.replace('.jpg', '.pdf')
        print(input_dir)
        f = open(input_dir, "w+b")

Output:

File was created at C://Users//Installer//Desktop//tesseract test\File Destruction Policy (2).pdf
File Destruction Policy (2)  needs OCRed
Converting to JPG
Page 1 converted to JPG.
Page 2 converted to JPG.
Page 3 converted to JPG.
File Destruction Policy (2) converted to JPG.
C:\Users\Installer\Desktop\tesseract test\File Destruction Policy (2)_ocr_data\File Destruction Policy (2)_page_1.jpg
C://Users//Installer//Desktop//tesseract test//File Destruction Policy (2)_ocr_data//File Destruction Policy (2)_page_1.jpg
None

error message:

Exception in thread Thread-1:
Traceback (most recent call last):
  File "C:\Program Files\Python311\Lib\threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "C:\Users\Installer\AppData\Roaming\Python\Python311\site-packages\watchdog\observers\api.py", line 205, in run
    self.dispatch_events(self.event_queue)
  File "C:\Users\Installer\AppData\Roaming\Python\Python311\site-packages\watchdog\observers\api.py", line 381, in dispatch_events
    handler.dispatch(event)
  File "C:\Users\Installer\AppData\Roaming\Python\Python311\site-packages\watchdog\events.py", line 403, in dispatch
    super().dispatch(event)
  File "C:\Users\Installer\AppData\Roaming\Python\Python311\site-packages\watchdog\events.py", line 272, in dispatch
    {
  File "C:\Users\Installer\Desktop\Scripts\tesseract_OCR_2.0.py", line 20, in on_created
    OCRscript(self, event)
  File "C:\Users\Installer\Desktop\Scripts\tesseract_OCR_2.0.py", line 134, in OCRscript
    OCR(pages)
  File "C:\Users\Installer\Desktop\Scripts\tesseract_OCR_2.0.py", line 72, in OCR
    result = pytesseract.image_to_pdf_or_hocr(img, lang = "eng",
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Installer\AppData\Roaming\Python\Python311\site-packages\pytesseract\pytesseract.py", line 446, in image_to_pdf_or_hocr
    return run_and_get_output(*args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Installer\AppData\Roaming\Python\Python311\site-packages\pytesseract\pytesseract.py", line 277, in run_and_get_output
    with save(image) as (temp_name, input_filename):
  File "C:\Program Files\Python311\Lib\contextlib.py", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File "C:\Users\Installer\AppData\Roaming\Python\Python311\site-packages\pytesseract\pytesseract.py", line 197, in save
    image, extension = prepare(image)
                       ^^^^^^^^^^^^^^
  File "C:\Users\Installer\AppData\Roaming\Python\Python311\site-packages\pytesseract\pytesseract.py", line 174, in prepare
    raise TypeError('Unsupported image object')
TypeError: Unsupported image object

Solution

  • I solved it. When the program was converting each page to a JPG it was saving each file with the folder name included in the filename. So when I was trying to call each of the JPGs I wasn't accounting for the inclusion of the folder name in the new JPG name. As soon as I corrected the code to account for the addition to the file name it ran first try.

    I changed the input_dir variable to include the additional information.

    input_dir = f"{folder_save_point}\\{folder_name}{file_name}_page_{pages}.jpg"