python-3.xpython-tesseracttext-extractionpython-pdfreaderimage-text

Reading images from pdf and extract Text from it


Problem Statement: I have a pdf which contains n number of pages and each page has 1 image whose text I need to read and perform some operation.

What I tried: I have to do this in python, and the only library I found with the best result is pytesserac.

I am pasting the sample code which I tried

    fn = kw['fn'] = self.env.context.get('wfg_pg', kw['fn'])
    zoom, zoom_config = self.get_zoom_for_doc(index), ' -c tessedit_do_invert=0'
    if 3.3 < zoom < 3.5:
        zoom_config += ' --oem 3 --psm 4'
    elif 0 != page_number_list[0]:
        zoom_config += ' --psm 6'
    full_text, page_length = '', kw['doc'].pageCount
    if recursion and index >= 10:
        return fn.get('most_correct') or fn.get(page_number_list[0])
    mat = fitz.Matrix(zoom, zoom)  # increase resolution
    for page_no in page_number_list:
        page = kw['doc'].loadPage(page_no)  # number of page
        pix = page.getPixmap(matrix=mat)
        with Image.open(io.BytesIO(pix.getImageData())) as img:
            text_of_each_page = str(pytesseract.image_to_string(img, config='%s' % zoom_config)).strip()
        fn[page_no] = text_of_each_page
        full_text = '\n'.join((full_text, text_of_each_page, '\n'))
    _logger.critical(f"full text in load image {full_text}")
    args = (full_text, page_number_list)
    load = recursion and self.run_recursion_to_load_new_image_to_text(*args, **kw)
    if recursion and load:
        return self.load_image
    return full_text

The issue: My pdf is having dates like 1/13, 1/7 the library is reading them as 143, 1n and in some places, it is reading 17 as 1). Also after the text, it is also giving some symbols like { & . , = randomly whereas in pdf these things are not even there.

For accuracy

  1. I tried converting the image to .tiff format but it didn't work for me.
  2. Tried adjusting the resolution of the image.

Solution

  • You can use pdftoppm tool for converting you images really fast as it provides you to use multi-threading feature by just passing thread_count=(no of threads). You can refer to this link for more info on this tool. Also better images can increase the accuracy of tesseract.