tesseractdatabricksazure-databrickspython-tesseract

How to install Tesseract OCR on Databricks


I am trying to run the following script on a databrick python notebook:

pip install presidio-image-redactor
pip install pytesseract
python -m spacy download en_core_web_lg

from PIL import Image
from presidio_image_redactor import ImageRedactorEngine
import pytesseract

image = Image.open("images/ImageData.PNG")

engine = ImageRedactorEngine()

redacted_image = engine.redact(image, (255, 192, 203))

Upon running the last line, I'm getting the error below:

TesseractNotFoundError: tesseract is not installed or it's not in your PATH.

am I missing anything?


Solution

  • You can use %sh in a separate cell to execute the shell commands on the driver node. To install tesseract, you can do:

    %sh apt-get -f -y install tesseract-ocr 
    

    If you need to install it to all nodes of the cluster, you need to use cluster init script with the same command (without %sh)