javascriptnode.jstypescriptocrtesseract.js

How to improve tesseract.js accuracy?


Im using this piece of code from the website but its not accurate enough

 const worker1 = createWorker();
  const worker2 = createWorker();

  await worker1.load();
  await worker2.load();
  await worker1.loadLanguage("eng");
  await worker2.loadLanguage("eng");
  await worker1.initialize("eng");
  await worker2.initialize("eng");

  scheduler.addWorker(worker1);
  scheduler.addWorker(worker2);

  /** Add 10 recognition jobs */
  const {
    data: { text }
  } = await scheduler.addJob("recognize", image);

this is the type of image i'm trying to read its text:

enter image description here

thou it seems simple and easy ,sometimes tesseract fails to read it . is there any better alternatives to tesseract.js or any way to improve the accuracy?


Solution

  • When applying OCR using Tesseract, it is important to preprocess the image so that the desired text to detect is in black with the background in white. To do this, you can apply a simple threshold to obtain a binary image. Here's the image after preprocessing:

    enter image description here

    Result from Tesseract

    52024
    

    I implemented this approach in Python OpenCV, but you can adapt a similar strategy into Javascript!

    import cv2
    import pytesseract
    
    pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
    
    # Load image and Otsu's Threshold to get a binary image
    image = cv2.imread('1.png', 0)
    thresh = cv2.threshold(image, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
    
    # Perform OCR
    data = pytesseract.image_to_string(thresh, lang='eng', config='--psm 6')
    print(data)
    
    cv2.imshow('thresh', thresh)
    cv2.waitKey()