pythonimage-processingocrdocument-layout-analysis

text layout recognition with python


I'm trying to sort through several thousand scanned files and sort them into folders based on type (ie: if one of the files is a scanned copy of formA, then it should go in the formA folder, if it's a scanned copy of formB, then it should go in the formB folder, etc...). I feel like the best way to match the files and types is based on their text outlines, but am totally new to image processing, so if there's a better solution, then I'm all ears.

I'm working in python. Any ideas of a best way to do this? PIL? OpenCV? imageMagick?

Thanks in advance...


Solution

  • This library is probably of interest to you -
    http://code.google.com/p/ocropus/
    Its made by googlers and lets you do OCR and layout analysis from python.
    I had some trouble installing it, but that was quite a while back, so things may have gotten fixed by now.