Problem Statement: I have a pdf which contains n number of pages and each page has 1 image whose text I need to read and perform some operation.
What I tried: I have to do this in python, and the only library I found with the best result is pytesserac
.
I am pasting the sample code which I tried
fn = kw['fn'] = self.env.context.get('wfg_pg', kw['fn'])
zoom, zoom_config = self.get_zoom_for_doc(index), ' -c tessedit_do_invert=0'
if 3.3 < zoom < 3.5:
zoom_config += ' --oem 3 --psm 4'
elif 0 != page_number_list[0]:
zoom_config += ' --psm 6'
full_text, page_length = '', kw['doc'].pageCount
if recursion and index >= 10:
return fn.get('most_correct') or fn.get(page_number_list[0])
mat = fitz.Matrix(zoom, zoom) # increase resolution
for page_no in page_number_list:
page = kw['doc'].loadPage(page_no) # number of page
pix = page.getPixmap(matrix=mat)
with Image.open(io.BytesIO(pix.getImageData())) as img:
text_of_each_page = str(pytesseract.image_to_string(img, config='%s' % zoom_config)).strip()
fn[page_no] = text_of_each_page
full_text = '\n'.join((full_text, text_of_each_page, '\n'))
_logger.critical(f"full text in load image {full_text}")
args = (full_text, page_number_list)
load = recursion and self.run_recursion_to_load_new_image_to_text(*args, **kw)
if recursion and load:
return self.load_image
return full_text
The issue: My pdf is having dates like 1/13, 1/7 the library is reading them as 143, 1n and in some places, it is reading 17 as 1). Also after the text, it is also giving some symbols like { & . , =
randomly whereas in pdf these things are not even there.
For accuracy
You can use pdftoppm
tool for converting you images really fast as it provides you to use multi-threading feature by just passing thread_count=(no of threads)
.
You can refer to this link for more info on this tool. Also better images can increase the accuracy of tesseract.