pythonbashpdfmojibake

How to identify likely broken pdf pages before extracting its text?


TL;DR

My workflow:

  1. Download PDF
  2. Split it into pages using pdftk
  3. Extract text of each page using pdftotext
  4. Classify text and add metadata
  5. Send it to client in a structured format

I need to extract consistent text to jump from 3 to 4. If text is garbled, I have to OCR its page. But, OCR all pages is out of question. How to identify beforehand which pages should be OCRed? I've tried to run pdffonts and pdftohtml on each page. Isn't it expensive to run subprocess.run twice a page?

What do I mean by broken page?

A PDF page that is not possible to extract text from its source, maybe due to to_unicode conversion.

Description

I'm building an application that relies on the extraction of text from a thousand PDF files every day. The layout of text in each PDF is somewhat structured, therefore calling pdftotext from python works well in most cases. But, some PDF files from one or two resources bring pages with problematic fonts, which results in garbled text. I think that using OCR only on problematic pages would be ok to overcome such an issue. So, my problem is how to identify, before extracting text, which pages are likely to result in gibberish.

First, I tried to identify garbled text, after extracting it, using regex (\p{Cc} or unlikely chars outside Latin alphabet), but it did not work because I found corrupted text with valid chars and numbers, i.e AAAAABS12 54c] $( JJJJ Pk , as well.

Second, I tried to identify garbled text calling pdffonts - to identify name, encoding, embeddedness and existence of to_unicode map - on each page and parsing its output. In my tests, it kinda works well. But I found also necessary to count how many chars used likely problematic fonts, pdftohtml - Display each text block in p tag along with its font name - saved the day here. @LMC helped me to figure out how to do it, take a look at the answer. The bad part is I ended up calling subprocess.run two times for each pdf page, what is super expensive. It would be cheaper if I could just bind those tools.

I'd like to know if it's possible and feasible to look at PDF source and validate some CMAP (uni yes and not custom font), if present, or maybe other heuristics to find problematic fonts before extracting text or OCR it.

Example of garbled text in one of my PDF files:

0\n1\n2\n3\n4\n2\n0\n3\n0\n5 6\n6\nÿ\n89 ÿ\n4\n\x0e\n3\nÿ\n\x0f\x10\n\x11\n\x12\nÿ\n5\nÿ\n6\n6\n\x13\n\x11\n\x11\n\x146\n2\n2\n\x15\n\x11\n\x16\n\x12\n\x15\n\x10\n\x11\n\x0e\n\x11\n\x17\n\x12\n\x18\n\x0e\n\x17\n\x19\x0e\n\x1a\n\x16\n2 \x11\n\x10\n\x1b\x12\n\x1c\n\x10\n\x10\n\x15\n\x1d29 2\n\x18\n\x10\n\x16\n89 \x0e\n\x14\n\x13\n\x14\n\x1e\n\x14\n\x1f\n5 \x11\x1f\n\x15\n\x10\n! \x1c\n89 \x1f\n5\n3\n4\n"\n1\n1\n5 \x1c\n89\n#\x15\n\x1d\x1f\n5\n5\n1\n3\n5\n$\n5\n1 5\n2\n5\n%8&&#\'#(8&)\n*+\n\'#&*,\nÿ\n(*ÿ\n-\n./0)\n1\n*\n*//#//8&)\n*ÿ\n#/2#%)\n*,\nÿ\n(*/ÿ\n/#&3#40)\n*/ÿ\n#50&*-\n.()\n%)\n*)\n/ÿ\n+\nÿ\n*#/#\n&\x19\n\x12\nÿ\n\x1cÿ\n,\x1d\n\x12\n\x1b\x10\n\x15\n\x116\nÿ\n\x15\n7\nÿ\n8\n9\n4\n6\nÿ\n%\x10\n\x15\n\x11\n\x166\nÿ\n:\x12\x10;\n2\n*,\n%#26\nÿ\n<\n$\n3\n0\n3\n+\n3\n8\n3\nÿ\n+\nÿ\n=\x15\n\x10\n6\nÿ\n>\n9\n0\n?\nÿ\n4\n3\n3\n1\n+\n8\n9\n3\n<\n@A\nB\nC\nD\nEÿ\nGH\nI\nÿ\nJ\nJ\nK\nL\nJ\nM\nJ\nN\nO\nP\nO\nQ\nI\n#\x1bÿ\n0\n1\nÿ\n\x1c\n\x10\nÿ\n*\x1a\n\x16\n\x18\nÿ\n\x1c\n\x10\nÿ\n0\n3\n0\n5\n\x0e\n/\x10\n\x15\n\x13\x16\n\x12\nÿ\n/\x10\n\x16\n\x1d\x1c\x16\n\x12\n6\nÿ\n* \x19\n\x15\n\x116\nÿ\n\x12\n\x19\n\x11\n\x19\n\x12\n\x16\nÿ\n\x15ÿ\n/*-\n\x0e\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\n(\x10\nÿ\x16\n\x1c\n\x10\n\x1bÿ\n\x1c\n\x12\nÿ\n%\x13\n\x10\n9\n\x10\nÿ\n\x1c\n\x10\nÿ\n\'\x12\n\x1a\x15\n\x10\n\x11\n\x10\nÿ\n\x1c\n\x12\nÿ\n%\x16\n\x16\n\x10\nR\n\x10\n\x1c\x16\n\x12\nÿ\n\'\x10\n\x16\n\x12\n\x18\nÿ\n\x1c\n\x12\nÿ\n-\n\x19\x11\n1\n\x12\nÿ\n\x1cÿ\n#\x11\n\x12\n\x1cÿ\n\x1c\n\x10\nÿ\n*\x18\n\x12\nR\x126\nÿ\n/\x16\n\x12\n\x0e\n& \x10\n\x12\n\x15\n\x12\nÿ\n%\x10\n\x18\x11\n\x16\n\x10\nÿ\n:\x12\x13\n\x12\n\x1c\x0e\nÿ\n*\x19\n\x11\n\x19\n\x10\n+\x10\nÿ\n\x10\nÿ\n&\x10\nR\x11\n\x16\n\x10\n+\x10\nÿ\n\x15ÿ\n/*-\n2\n2\'<\nÿ\n+\nÿ\n#S\n\x11\n\x16\n\x12\n\x17\n\x19\n\x1c \x12\n\x18\nÿ\n*\x1c\n\x1b\x15\x11\n\x16\n\x12\n\x11\n\x1d\x0e\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\nÿ\n*\x11\n\x10\n\x15 \x12\n\x1b\x10\n\x15\n\x11\n\x10\n6\nTU\nV\nWU\nXÿ\nYXÿ\nTU\nV\nW\nX\nXYZU\n[U\nT\\]X\\U\nW\nX\nVD\n^\n_\n`\nÿ\nab\nÿ\nXGb\nc\nE^\nd\nO\nP\nO\nQ\nP\ne\nO\nf\nP\nf\nJ\nf\nP\ne\ng\nGb\nh_\nEGI\niaA\nYjTk\nXlm@ YjTk\nXlmX] ]jTk@[Yj] U\nZk]U\nZU\n] X]noU\nW\nX] W@V\n\\\nX]\nÿ\n89\nÿ\n89\np ÿ\nq\n(\x10\x14\n\x12\x13\n8r\nIOV\x11\x03\x14\n(VWH\x03GRFXPHQWR\x03p\x03FySLD\x03GR\x03RULJLQDO\x03DVVLQDGR\x03GLJLWDOPHQWH\x03SRU\x03(00$18(/$\x030$5,$\x03&$/$\'2\x03\'(\x03)$5,$6\x036,/9$\x11\x033DUD\x03FRQIHULU\x03R\x03RULJLQDO\x0f\x03DFHVVH\x03R\x03VLWH\x03\x0f\x03LQIRUPH\x03R\x03SURFHVVR\x03\x13\x13\x13\x13\x16\x17\x18\x10\x1a\x18\x11\x15\x13\x15\x14\x11\x1b\x11\x13\x15\x11\x13\x13\x1a\x16\x03H\x03R\x03\nFyGLJR\x03\x17(\x14\x14\x16\x14\x13\x11\x03

The text above was extracted from page 25 of this document using pdftotext.

For that page, pdffonts outputs:

name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
[none]                               Type 3            Custom           yes no  no      13  0
DIIDPF+ArialMT                       CID TrueType      Identity-H       yes yes yes    131  0
DIIEDH+Arial                         CID TrueType      Identity-H       yes yes no     137  0
DIIEBG+TimesNewRomanPSMT             CID TrueType      Identity-H       yes yes yes    142  0
DIIEDG+Arial                         CID TrueType      Identity-H       yes yes no     148  0
Arial                                TrueType          WinAnsi          yes no  no     159  0

It's easy to identify that [none] named font as problematic. My take so far, given the data I've analysed, is to mark fonts with custom or identity-h encoding, no to_unicode map or none named as likely problematic. But, as I said, I also found cases with ToUnicode table and not Custom encoding fonts, problematic as well. As far as I know, it's also possible to find, for example, a single char that is defined for a broken font, but does not affect the overall readability of the page, so maybe it would be not necessary to OCR that page. In other words, if a font, in a given page, does not have ToUnicode convertion, it does not mean that the text of the page is totally affected.

I'm looking for a solution that is better than regex garbled text.

Examples of PDF pages that I had to OCR

All pages bellow contains text in portuguese, but if you try to copy the text and paste somewhere you will see universal gibberish.

What I've done so far

I've avoid calling subprocess twice a page since I created a bash script that iterate pages and merges pdftohtml and pdffonts output for each one into a single HTML:

#!/bin/sh

# Usage: ./font_report.sh -a 1 -b 100 -c foo.pdf


while getopts "a:b:c:" arg; do
    case $arg in
        a) FIRST_PAGE=$OPTARG;;
        b) LAST_PAGE=$OPTARG;;
        c) FILENAME=$OPTARG;;
        *)
            echo 'Error: invalid options' >&2
            exit 1
    esac
done

: ${FILENAME:?Missing -c}

if ! [ -f "$FILENAME" ]; then
    echo "Error: $FILENAME does not exist" >&2
    exit 1
fi

echo "<html xmlns='http://www.w3.org/1999/xhtml' lang='' xml:lang=''>" ;

for page in $(seq $FIRST_PAGE $LAST_PAGE)
do
   { 
       echo "<page number=$page>" ; 
       echo "<pdffonts>" ; 
       pdffonts -f $page -l $page $FILENAME ; 
       echo "</pdffonts>" ;  
       (
           pdftohtml -f $page -l $page -s -i -fontfullname -hidden $FILENAME -stdout | 
           tail -n +35 |  # skips head tag and its content
           head -n -1  # skips html ending tag
        ) ;
       echo "</page>"
    }
done

echo "</html>"

The code above has enabled me to call subprocess once and parse html using lxml for each page (considering <page> tag). But it is still needed to look at text content to have a idea if the text is broken.


Solution

  • A quick function based on pdftotext

    Here is a full (rewrited) function scanning for badpages:

    #!/bin/bash
    
    findBadPages() {
        local line opts progress=true usage="Usage: ${FUNCNAME[0]} [-f first page]"
        usage+=' [-l last page] [-m min bad/page] [-q (quiet)]'
        local -a pdftxt=(pdftotext -layout - -)
        local -ia badpages=()
        local -i page=1 limit=10 OPTIND
        while getopts "ql:f:m:" opt;do
            case $opt in
                f ) pdftxt+=(-f $OPTARG); page=$OPTARG ;;
                l ) pdftxt+=(-l $OPTARG) ;;
                m ) limit=$OPTARG ;;
                q ) progress=false ;;
                * ) printf >&2 '%s ERROR: Unknown option!\n%s\n' \
                               "${FUNCNAME[0]}" "$usage";return -1 ;;
            esac
        done
        shift $((OPTIND-1))
        while IFS= read -r line; do
            [ "$line" = $'\f' ] && page+=1 && $progress && printf %d\\r $page
            ((${#line} > 1 )) && badpages[page]+=${#line}
        done < <(
            tr -d '0-9a-zA-Z\047"()[]{}<>,-./+?!$&@#:;%$=_ºÁÃÇÔàáâãçéêíóôõú– ' < <(
                "${pdftxt[@]}" <"$1"
        ))
        for page in ${!badpages[@]} ;do
            (( ${badpages[page]} > limit )) && {
                $progress && printf "There are %d strange characters in page %d\n" \
                   ${badpages[page]} $page || echo $page ;}
        done
    }
    

    Then now:

    findBadPages DJE_3254_I_18062021\(1\).pdf
    There are 2237 strange characters in page 23
    There are 258 strange characters in page 24
    There are 20 strange characters in page 32
    
    findBadPages -m 100 -f 40 -l 100 DJE_3254_I_18062021.pdf 
    There are 623 strange characters in page 80
    There are 1068 strange characters in page 81
    There are 1258 strange characters in page 82
    There are 1269 strange characters in page 83
    There are 1245 strange characters in page 84
    There are 256 strange characters in page 85
    
    findBadPages DJE_3254_III_18062021.pdf
    There are 11 strange characters in page 125
    There are 635 strange characters in page 145
    
    findBadPages -qm100 DJE_3254_III_18062021.pdf 
    145
    
    findBadPages -h
    /bin/bash: illegal option -- h
    findBadPages ERROR: Unknown option!
    Usage: findBadPages [-f first page] [-l last page] [-m min bad/page] [-q (quiet)]
    

    Usage:

    findBadPages [-f INTEGER] [-l INTEGER] [-m INTEGER] [-q] <pdf file>
    

    Where

    Note:

    The string used by tr -d: 0-9a-zA-Z\047"()[]{}<>,-./:;%$=_ºÁÃÇÔàáâãçéêíóôõú– was built by sorting used characters in your PDF files! They could not match another language! Maybe adding some more accented charaters or other missed printable could become necessary in some futur uses.