I am using pdftools
in R to extract text from both scanned and text based PDF files. One problem is with the §
character. This is not recognized by tesseract.
I looked at the following links: CRAN tesseract package vignette
And I tried the following:
I found the configuration files using tesseract_info()
and edited the digits
file under configs
.
The digits
file content was like this:
tessedit_char_whitelist 0123456789.
After editing it looks like this:
tessedit_char_whitelist 0123456789-$§.
This did not change anything at all, I am still not able to extract §
. They still appear as 8
.
After the 1st step failed, I tried the following:
filepng <- pdftools::pdf_convert(filePathPDF, dpi = 600)
specs <- tesseract("deu", options = list(tessedit_char_whitelist = "1234567890-.,;:qwertzuiopüasdfghjklöäyxcvbnmQWERTZUIOPÜASDFGHJKLÖÄYXCVBNM@߀!$%&§/()=?+"))
text <- tesseract::ocr(filepng, engine = specs)
This one failed too. I am by no means an expert on OCR and tesseract has room for improvements when it comes to documentation.
How can I add §
to the list of characters to be recognized in the right way, so that it applies?
Update
The following works to recognize §
, when I remove language
from the argument list:
charlist <- tesseract(options = list(tessedit_char_whitelist = " 1234567890-.,;:qwertzuiopüasdfghjklöäyxcvbnmQWERTZUIOPÜASDFGHJKLÖÄYXCVBNM@߀!$%&§/()=?+"))
text <- tesseract::ocr(filepng, engine = charlist)
But this time, I am losing German umlauts. I cannot find out how I can specify the language and the char_whitelist at the same time. According to the documentation, tesseract()
accepts language argument and options argument. But this does not seem to work. Any ideas?
Update: I tried using tesseract in command line (MacOS Catalina 10.15.7).
I converted a scanned PDF file first to an image then used this:
tesseract fileConverted.tiff fileToText
It creates fileToText.txt
. It does recognize §
. All of them are correctly recognized. But German umlauts are not recognized correctly, since I did not specify language at all. When I use the same command with the language
argument
tesseract fileConverted.tiff fileToText -l deu
German umlauts are recognized properly but §
is not.
The digits
config file I changed is here:
/usr/local/Cellar/tesseract/4.1.1/share/tessdata/configs
My understanding is: it is not a problem specific to R, but it occurs with tesseract itself. Setting tessedit_char_whitelist
and the language at the same time does not seem to be possible or I am missing something horribly.
As said above, tesseract 4 does not support setting a whitelist. To go around that problem, you could use the command-line switch. You need to set OCR Engine mode to the "Original Tesseract only" with --oem 0
then use -c tessedit_char_whitelist=abc...
to pass your whitelist directly via the command-line.
Overall, it should look something like this :
tesseract fileConverted.tiff fileToText --oem 0 -l deu -c tessedit_char_whitelist=0123456789-$§