My workflow:
I need to extract consistent text to jump from 3 to 4. If text is garbled, I have to OCR its page. But, OCR all pages is out of question. How to identify beforehand which pages should be OCRed? I've tried to run pdffonts and pdftohtml on each page. Isn't it expensive to run subprocess.run
twice a page?
A PDF page that is not possible to extract text from its source, maybe due to to_unicode conversion.
I'm building an application that relies on the extraction of text from a thousand PDF files every day. The layout of text in each PDF is somewhat structured, therefore calling pdftotext from python works well in most cases. But, some PDF files from one or two resources bring pages with problematic fonts, which results in garbled text. I think that using OCR only on problematic pages would be ok to overcome such an issue. So, my problem is how to identify, before extracting text, which pages are likely to result in gibberish.
First, I tried to identify garbled text, after extracting it, using regex (\p{Cc}
or unlikely chars outside Latin alphabet), but it did not work because I found corrupted text with valid chars and numbers, i.e AAAAABS12 54c] $( JJJJ Pk
, as well.
Second, I tried to identify garbled text calling pdffonts - to identify name, encoding, embeddedness and existence of to_unicode map - on each page and parsing its output. In my tests, it kinda works well. But I found also necessary to count how many chars used likely problematic fonts, pdftohtml - Display each text block in p
tag along with its font name - saved the day here. @LMC helped me to figure out how to do it, take a look at the answer. The bad part is I ended up calling subprocess.run
two times for each pdf page, what is super expensive. It would be cheaper if I could just bind those tools.
I'd like to know if it's possible and feasible to look at PDF source and validate some CMAP (uni
yes and not custom font), if present, or maybe other heuristics to find problematic fonts before extracting text or OCR it.
Example of garbled text in one of my PDF files:
0\n1\n2\n3\n4\n2\n0\n3\n0\n5 6\n6\nÿ\n89 ÿ\n4\n\x0e\n3\nÿ\n\x0f\x10\n\x11\n\x12\nÿ\n5\nÿ\n6\n6\n\x13\n\x11\n\x11\n\x146\n2\n2\n\x15\n\x11\n\x16\n\x12\n\x15\n\x10\n\x11\n\x0e\n\x11\n\x17\n\x12\n\x18\n\x0e\n\x17\n\x19\x0e\n\x1a\n\x16\n2 \x11\n\x10\n\x1b\x12\n\x1c\n\x10\n\x10\n\x15\n\x1d29 2\n\x18\n\x10\n\x16\n89 \x0e\n\x14\n\x13\n\x14\n\x1e\n\x14\n\x1f\n5 \x11\x1f\n\x15\n\x10\n! \x1c\n89 \x1f\n5\n3\n4\n"\n1\n1\n5 \x1c\n89\n#\x15\n\x1d\x1f\n5\n5\n1\n3\n5\n$\n5\n1 5\n2\n5\n%8&&#\'#(8&)\n*+\n\'#&*,\nÿ\n(*ÿ\n-\n./0)\n1\n*\n*//#//8&)\n*ÿ\n#/2#%)\n*,\nÿ\n(*/ÿ\n/#&3#40)\n*/ÿ\n#50&*-\n.()\n%)\n*)\n/ÿ\n+\nÿ\n*#/#\n&\x19\n\x12\nÿ\n\x1cÿ\n,\x1d\n\x12\n\x1b\x10\n\x15\n\x116\nÿ\n\x15\n7\nÿ\n8\n9\n4\n6\nÿ\n%\x10\n\x15\n\x11\n\x166\nÿ\n:\x12\x10;\n2\n*,\n%#26\nÿ\n<\n$\n3\n0\n3\n+\n3\n8\n3\nÿ\n+\nÿ\n=\x15\n\x10\n6\nÿ\n>\n9\n0\n?\nÿ\n4\n3\n3\n1\n+\n8\n9\n3\n<\n@A\nB\nC\nD\nEÿ\nGH\nI\nÿ\nJ\nJ\nK\nL\nJ\nM\nJ\nN\nO\nP\nO\nQ\nI\n#\x1bÿ\n0\n1\nÿ\n\x1c\n\x10\nÿ\n*\x1a\n\x16\n\x18\nÿ\n\x1c\n\x10\nÿ\n0\n3\n0\n5\n\x0e\n/\x10\n\x15\n\x13\x16\n\x12\nÿ\n/\x10\n\x16\n\x1d\x1c\x16\n\x12\n6\nÿ\n* \x19\n\x15\n\x116\nÿ\n\x12\n\x19\n\x11\n\x19\n\x12\n\x16\nÿ\n\x15ÿ\n/*-\n\x0e\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\n(\x10\nÿ\x16\n\x1c\n\x10\n\x1bÿ\n\x1c\n\x12\nÿ\n%\x13\n\x10\n9\n\x10\nÿ\n\x1c\n\x10\nÿ\n\'\x12\n\x1a\x15\n\x10\n\x11\n\x10\nÿ\n\x1c\n\x12\nÿ\n%\x16\n\x16\n\x10\nR\n\x10\n\x1c\x16\n\x12\nÿ\n\'\x10\n\x16\n\x12\n\x18\nÿ\n\x1c\n\x12\nÿ\n-\n\x19\x11\n1\n\x12\nÿ\n\x1cÿ\n#\x11\n\x12\n\x1cÿ\n\x1c\n\x10\nÿ\n*\x18\n\x12\nR\x126\nÿ\n/\x16\n\x12\n\x0e\n& \x10\n\x12\n\x15\n\x12\nÿ\n%\x10\n\x18\x11\n\x16\n\x10\nÿ\n:\x12\x13\n\x12\n\x1c\x0e\nÿ\n*\x19\n\x11\n\x19\n\x10\n+\x10\nÿ\n\x10\nÿ\n&\x10\nR\x11\n\x16\n\x10\n+\x10\nÿ\n\x15ÿ\n/*-\n2\n2\'<\nÿ\n+\nÿ\n#S\n\x11\n\x16\n\x12\n\x17\n\x19\n\x1c \x12\n\x18\nÿ\n*\x1c\n\x1b\x15\x11\n\x16\n\x12\n\x11\n\x1d\x0e\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\n*\x11\n\x10\n\x15 \x12\n\x1b\x10\n\x15\n\x11\n\x10\n6\nTU\nV\nWU\nXÿ\nYXÿ\nTU\nV\nW\nX\nXYZU\n[U\nT\\]X\\U\nW\nX\nVD\n^\n_\n`\nÿ\nab\nÿ\nXGb\nc\nE^\nd\nO\nP\nO\nQ\nP\ne\nO\nf\nP\nf\nJ\nf\nP\ne\ng\nGb\nh_\nEGI\niaA\nYjTk\nXlm@ YjTk\nXlmX] ]jTk@[Yj] U\nZk]U\nZU\n] X]noU\nW\nX] W@V\n\\\nX]\nÿ\n89\nÿ\n89\np ÿ\nq\n(\x10\x14\n\x12\x13\n8r\nIOV\x11\x03\x14\n(VWH\x03GRFXPHQWR\x03p\x03FySLD\x03GR\x03RULJLQDO\x03DVVLQDGR\x03GLJLWDOPHQWH\x03SRU\x03(00$18(/$\x030$5,$\x03&$/$\'2\x03\'(\x03)$5,$6\x036,/9$\x11\x033DUD\x03FRQIHULU\x03R\x03RULJLQDO\x0f\x03DFHVVH\x03R\x03VLWH\x03\x0f\x03LQIRUPH\x03R\x03SURFHVVR\x03\x13\x13\x13\x13\x16\x17\x18\x10\x1a\x18\x11\x15\x13\x15\x14\x11\x1b\x11\x13\x15\x11\x13\x13\x1a\x16\x03H\x03R\x03\nFyGLJR\x03\x17(\x14\x14\x16\x14\x13\x11\x03
The text above was extracted from page 25 of this document using pdftotext.
For that page, pdffonts outputs:
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
[none] Type 3 Custom yes no no 13 0
DIIDPF+ArialMT CID TrueType Identity-H yes yes yes 131 0
DIIEDH+Arial CID TrueType Identity-H yes yes no 137 0
DIIEBG+TimesNewRomanPSMT CID TrueType Identity-H yes yes yes 142 0
DIIEDG+Arial CID TrueType Identity-H yes yes no 148 0
Arial TrueType WinAnsi yes no no 159 0
It's easy to identify that [none]
named font as problematic. My take so far, given the data I've analysed, is to mark fonts with custom or identity-h encoding, no to_unicode map or none named as likely problematic. But, as I said, I also found cases with ToUnicode table and not Custom encoding fonts, problematic as well. As far as I know, it's also possible to find, for example, a single char that is defined for a broken font, but does not affect the overall readability of the page, so maybe it would be not necessary to OCR that page. In other words, if a font, in a given page, does not have ToUnicode convertion, it does not mean that the text of the page is totally affected.
I'm looking for a solution that is better than regex garbled text.
All pages bellow contains text in portuguese, but if you try to copy the text and paste somewhere you will see universal gibberish.
I've avoid calling subprocess twice a page since I created a bash script that iterate pages and merges pdftohtml and pdffonts output for each one into a single HTML:
#!/bin/sh
# Usage: ./font_report.sh -a 1 -b 100 -c foo.pdf
while getopts "a:b:c:" arg; do
case $arg in
a) FIRST_PAGE=$OPTARG;;
b) LAST_PAGE=$OPTARG;;
c) FILENAME=$OPTARG;;
*)
echo 'Error: invalid options' >&2
exit 1
esac
done
: ${FILENAME:?Missing -c}
if ! [ -f "$FILENAME" ]; then
echo "Error: $FILENAME does not exist" >&2
exit 1
fi
echo "<html xmlns='http://www.w3.org/1999/xhtml' lang='' xml:lang=''>" ;
for page in $(seq $FIRST_PAGE $LAST_PAGE)
do
{
echo "<page number=$page>" ;
echo "<pdffonts>" ;
pdffonts -f $page -l $page $FILENAME ;
echo "</pdffonts>" ;
(
pdftohtml -f $page -l $page -s -i -fontfullname -hidden $FILENAME -stdout |
tail -n +35 | # skips head tag and its content
head -n -1 # skips html ending tag
) ;
echo "</page>"
}
done
echo "</html>"
The code above has enabled me to call subprocess once and parse html using lxml
for each page (considering <page>
tag). But it is still needed to look at text content to have a idea if the text is broken.
pdftotext
Here is a full (rewrited) function scanning for badpages:
#!/bin/bash
findBadPages() {
local line opts progress=true usage="Usage: ${FUNCNAME[0]} [-f first page]"
usage+=' [-l last page] [-m min bad/page] [-q (quiet)]'
local -a pdftxt=(pdftotext -layout - -)
local -ia badpages=()
local -i page=1 limit=10 OPTIND
while getopts "ql:f:m:" opt;do
case $opt in
f ) pdftxt+=(-f $OPTARG); page=$OPTARG ;;
l ) pdftxt+=(-l $OPTARG) ;;
m ) limit=$OPTARG ;;
q ) progress=false ;;
* ) printf >&2 '%s ERROR: Unknown option!\n%s\n' \
"${FUNCNAME[0]}" "$usage";return -1 ;;
esac
done
shift $((OPTIND-1))
while IFS= read -r line; do
[ "$line" = $'\f' ] && page+=1 && $progress && printf %d\\r $page
((${#line} > 1 )) && badpages[page]+=${#line}
done < <(
tr -d '0-9a-zA-Z\047"()[]{}<>,-./+?!$&@#:;%$=_ºÁÃÇÔàáâãçéêíóôõú– ' < <(
"${pdftxt[@]}" <"$1"
))
for page in ${!badpages[@]} ;do
(( ${badpages[page]} > limit )) && {
$progress && printf "There are %d strange characters in page %d\n" \
${badpages[page]} $page || echo $page ;}
done
}
Then now:
findBadPages DJE_3254_I_18062021\(1\).pdf
There are 2237 strange characters in page 23
There are 258 strange characters in page 24
There are 20 strange characters in page 32
findBadPages -m 100 -f 40 -l 100 DJE_3254_I_18062021.pdf
There are 623 strange characters in page 80
There are 1068 strange characters in page 81
There are 1258 strange characters in page 82
There are 1269 strange characters in page 83
There are 1245 strange characters in page 84
There are 256 strange characters in page 85
findBadPages DJE_3254_III_18062021.pdf
There are 11 strange characters in page 125
There are 635 strange characters in page 145
findBadPages -qm100 DJE_3254_III_18062021.pdf
145
findBadPages -h
/bin/bash: illegal option -- h
findBadPages ERROR: Unknown option!
Usage: findBadPages [-f first page] [-l last page] [-m min bad/page] [-q (quiet)]
Usage:
findBadPages [-f INTEGER] [-l INTEGER] [-m INTEGER] [-q] <pdf file>
Where
-f
Let you specify first page.-l
for last page.-m
for Minimum strange character found per page to print status.-q
flag suppress page number display during progression, then show only badpages numbers.Note:
The string used by tr -d
: 0-9a-zA-Z\047"()[]{}<>,-./:;%$=_ºÁÃÇÔàáâãçéêíóôõú–
was built by sorting used characters in your PDF files! They could not match another language! Maybe adding some more accented charaters or other missed printable could become necessary in some futur uses.