Using Tesseract
PS C:\Program Files\Tesseract-OCR> .\tesseract --version
tesseract v5.3.0.20221222
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found libarchive 3.5.0 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5 libzstd/1.4.5
Found libcurl/7.77.0-DEV Schannel zlib/1.2.11 zstd/1.4.5 libidn2/2.0.4 nghttp2/1.31.0
I have tested Tesseract successfully on command line:
PS C:\Program Files\Tesseract-OCR> .\tesseract C:\ocr\target\31832_226140__0001-00002b.jpg C:\ocr\results\31832_226140__0001-00002bb6523dpi300fullest --dpi 300 --psm 6 -c preserve_interword_spaces=1 -c tessedit_char_whitelist='abcdefghijklm
nopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789. '
Partial output
269 Wellington Road Wainumomats Marned 101 ARNOLD. Frank Witham ...............................15 Rossiter Avenue.Lower Hutt. Butcher
002 ANKER. Doreen Akson .............................4 Bledisioe Crescent. Wamuiomata. Teacher 102 ARONA. Amosa ...............0000...........3 Donnelley Drve.Wasnuiomata.Pub. Servant
004 ANKER. Robert James ..........................269 Wellington Road.WainuiomataBank Off 104 ARPS. Velde Lucia ................ ..........53 Westminster Road Wamnuomata Resch Intvr
005 ANNESLEV. Boyne Evan .............................. 13 Manurewa GroveWainwomata Clerk 105 ARPS. Wilkem David ..........................53 Westmnster Road. Waimuomata.Foreman
006 ANNESLEY. Janet Maree ....................13 Manurewa Grove Wainuomats Housewite 106 ARROWSMITH. Margaret Bessie .... ... . 4 Isabel Grove. Wainuiomata. Mamed
007 ANSELL. Anme Ena Elizabeth .........................3 Lewghton Av. Lower Hutt. Homemaker 107 ARROWSMITH. Morns Anthony ................ . 4 Isabel Gr Wamuomata Fetry Magr
O08 ANGELL. Eb se by oe ceeseceereeess 76 Bell Road. Lower Hutt. Housewrfe
I need to process hundreds of files so I downloaded and installed pytesseract.
Successfully installed pytesseract-0.3.10
I upgraded pip
Successfully installed pip-23.0.1
I have run tox
PS C:\Program Files\Tesseract-OCR> tox
←[1m←[35mROOT:←[0m←[36m No tox.ini or setup.cfg or pyproject.toml found, assuming empty tox.ini at C:\Program Files\Tesseract-OCR←[0m
py: OK (4.34 seconds)
congratulations :) (4.67 seconds)
However when I run the following, same path-to-exe, python script interword spacing is not preserved.
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'
image = 'C:\\ocr\\target\\31832_226140__0001-00002b.jpg'
target = print(pytesseract.image_to_string(image, config='--dpi 96 --psm 6 -c preserve_interword_spaces=1 -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789. '))
Partial output
.269WellngtonRoadWainumomatsMarned 101ARNOLD.FrankWitham...............................15RossiterAvenue.LowerHutt.Butcher
002ANKER.DoreenAkson.............................4BledisioeCrescent.Wamuiomata.Teacher 102ARONA.Amosa...............0000...........3DonnelleyDrve.Wasnuiomata.Pub.Servant
004ANKER.RobertJames..........................269WellingtonRoad.WainuiomataBankOff 104ARPS.ValdaLucis..........................53WestminsterRoadWamnuomataReschIntvr
005ANNESLEV.BoyneEvan..............................13ManurewaGroveWainwomataClerk 105ARPS.WilkemDavid..........................53WestmnsterRoad.Waimuomata.Foreman
006ANNESLEY.JanotMaree....................13ManurewaGroveWainuomatsHousewite 106ARROWSMITH.MargaretBessie........4IsabelGrove.Wainuiomata.Mamed
007ANSELL.AnmeEnaElizabeth.........................3LewghtonAv.LowerHutt.Homemaker 107ARROWSMITH.MornsAnthony.................4IsabelGrWamuomata.FetryMagr
O008ANMGELL.Ebsebyyceeseceereeess76BellRoad.LowerHutt.Housewrfe 108ARTHUR.BruceJames....................65MoohanStreet.WainuomataApp.Mouider
Can anyone see why this python-tesseract print image to string command is not using the config parameter preserve_interword_spaces=1
like the tesseract command line example?
The answer is making sure that you are NOT omitting the space character from the 'whitelist'. Because this effectively removes spaces from the output. Thus making it look like the preserve_interword_spaces=1
parameter is not functioning.
For reference. The correct command should have been:
target = print(pytesseract.image_to_string(image, config='--dpi 96 --psm 6 -c preserve_interword_spaces=1 -c tessedit_char_whitelist="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789. "'))
The use of single/double quotes is important. The single quotes surround the complete config statement. The double quotes for the literal whitelist.
It would seem from this that the whitelist has precedence over the preserve_interword_spaces
parameter. The preserve_interword_spaces
parameter may be redundant if you are including a space in your whitelist.