machine-learning computer-vision artificial-intelligence conv-neural-network object-recognition

Learning method for detecting relevant fields in forms (image format)

So I have been working on an application where users can upload a (scanned) PDF file representing some form, draw bounding boxes around the fields they are interested in and have those contents OCR'd and returned in structured text format. Because drawing bounding boxes is kind of a drag, I was thinking about a way to decrease the work required by the user; i.e. already offer him/her an auto-detected division of fields in de form. I started researching the matter and found some interesting approaches, mostly based on computer vision algorithms. However, as this application might be used frequently in the future and thus a lot of bounding boxes will be drawn by users, it'd almost seem like a waste to me to not try and use this dataset to apply a learning method to. So I started looking over a lot of different forms and noticed that most of them are structured by borders in a way like this:

A few observations here: boxes that are filled 100% with text are usually not requested for extraction as they represent terms/conditions/disclaimers/etc. Boxes that are (mostly) empty are also not requested as they mostly indicate unrelevant fields. The only interesting boxes appear to be those with a label in the top/left and some content in the body of the box.

It should of course also be said that not every form is as nicely structured with borders as the above one. Some use only a single dividing border (i.e. either horizontally or vertically) between fields and sometimes there are no borders at all.

Since we are working with images, I started looking into object recognition and tried out YOLOv2 (convolutional neural network) that I let train for a night on a dataset of 100 forms (I realize this dataset is still too small and since I trained on my CPU, that I haven't trained it long enough either). Anyway, I was hoping that the fact that all training fields had a border and some content would quickly help the system in finding bordered boxes itself. The results were however quite disappointing (avg loss/error = 9.6) so far. I started thinking about this and then realised that if users skip drawing certain fields that were perfectly fine bordered boxes, it would confuse the neural network in its learning process, am I right about this?

As for the remainder of my question: do you guys think object recognition is the way to go here or is it way too confusing for the system given the nature of such forms? If so, would it still be the case if I applied for example some filter(s) to try and "blur" text together, making the boxes look a lot more like each other? Or else, given this dataset of coordinates of (most) relevant boxes per document, what would instead be a better learning method to apply? Perhaps even a method that would not be based so much on the presence of a border?

Keep in mind that the only requirement I ask for is to be able to use the user drawn bounding boxes as dataset to continuously improve the system.

Thank you all for your time!

Solution

As for the neural network strategy, it may be more interesting to first recognize a piece of text. This way, you'll have more data to learn from given your 100 documents. Later, you can learn it to recognize specific headers. If you then have the bounding boxes of text, it would be easy to determine which text is close to the said header. If your desired output is the bounding box as big as you show in your image, the network will have a much harder time to find the useful information as opposed to restricted small boxes containing the text directly. Of course because your boxes are entered manually, its fuzziness will be a major source of loss of accuracy in predicting them. So having pixel accurate input would help for this as well.

Also consider using version spaces as an alternative learning method. Learning boxes containing features is one of its flagship use cases.

Another strategy would be not to use machine learning at all. Math frameworks such as Matlab and Octave have powerful algorithms that can reduce an image to a binary single pixel wide grid of detected lines (example). This of course would require some of extra algorithmic work when working with no lines (finding vertical / horizontal 'cuts' with the least black pixels) or partial lines as well. Still though, the result might be more accurate than a learner.