tesseracthocr

Getting exact font size in hocr output


I'm using Tesseract to extract text and formatting from a large volume of pages that look like this:

Sample page of OCR text with different line heights

(My original images are 1200 DPI; I've reduced to 600 DPI and re-encoded to keep the file-size sane.)

When the book uses block-quotes (such as the ones that occupy most of the left column of this page), the most prominent difference is the slightly smaller font size.

The problem is that when I set hocr_font_info to 1 in my hocr config file, the xml output produces word-tags like this:

<span class='ocrx_word' id='word_1_131' title='bbox 561 3188 981 3278; x_wconf 89; x_font Century_Schoolbook_L_Medium; x_fsize 7' lang='fra' dir='ltr'>dération</span>

The x_fsize attribute is usually 6 on the small lines and 7 on the larger lines, but Tesseract will sometimes assign a value of 7 to a smaller line - and it will do so for the entire line, so I can't rely on neighboring words to fix the problem. (In some cases, I can use neighboring lines, but not always. Sometimes I'll be dealing with a isolated line of text, so I really need the exact size, if possible.)

What's the best approach to getting more granularity in my font-sizes? In a pinch, I could probably get by if I had the exact height and width of each character, although a font-size with decimal-places (e.g. "x_fsize='6.62'") would be a lot easier to work with.


Solution

  • The calculation of the font size is given in Tesseract in these three lines:

      *pointsize = scaled_yres_ > 0
          ? static_cast<int>(row_height * kPointsPerInch / scaled_yres_ + 0.5)
    : 0;
    

    What you want is to avoid the type casting of this float into an integer. However, there are several other places where the structure and type is defined as well, which would also need to be adjusted...

    The main information here is the row_height which is the same as the x_size-parameter in the ocr_line's in the hocr file. Thus, you can simply go through the hocr file and try to decide for each line depending on its x_size whether it is the smaller font size or the larger font size. For going trough a hocr file and do some stuff you can look at the examples from the hocr-tools.

    To make actually the calculation from above you would just need to know your resoultion (600 or 1200 dpi) and the value kPointsPerInch = 72. As a proof of concept try this perl one-liner:

    $ perl -ne 'print("$1 ", $2*72/600, "\n") if /^.*id=.([^ ]*). .*x_size ([0-9.]*);.*$/' h7.hocr
    line_1_1 8.62807344
    line_1_2 7.08
    line_1_3 6.36
    line_1_4 6.36
    line_1_5 6.36
    line_1_6 6.35710104
    line_1_7 6.48
    line_1_8 6.36
    line_1_9 6.24
    line_1_10 6.36
    ...