tesseractpython-tesseractpytesser

pytesseract results different from tesseract command line results


I am trying to convert a scanned page to text using both pytesseract and tesseract command line on Ubuntu. The results are remarkably different (pytesseract performs way better than tesseract command line) and I am unable to understand why. I looked at the default values for the parameters and tried altering some of the parameter values in tesseract command line (like psm ) but I am unable to get the same result as pytesseract. Due to lack of proper documentation in pytesseract I am not able to figure out what default values for parameters are used.

Here is my pytesseract code print(pytesseract.image_to_string(Image.open('test.tiff'))


Solution

  • Looking at the source code of pytesseract, it seems the image is always converted into a .bmp file. Working with a .bmp file and psm of 6 at the command line with Tesseract gives same result as pytesseract. Also, tesseract can work with uncompressed bmp files only. Hence, if ImageMagick is used to convert .pdf to .bmp, the following will work

    convert -density 300 -quality 100 mypdf.pdf BMP3:mypdf.bmp
    tesseract mypdf.bmp -psm 6 mypdf txt