pythonmachine-learningdeep-learning

Text extraction and text recognition with AI


Starting from text I'd like to be able to identify specific informations.

Example :

Input texts : "The invoice number is 18", "Inv : 75", "Inv N. : 84"

Identified invoice numbers : "18", "75", "84"

The concrete problem is I have a lot of documents containing lots of this information and I would like to use an algorithm to identify and extract various type of fields.

I thought that in theory I'll use some kind of framework / algorithm, input all of my documents and train the algorithm by approving or not the results, but I don't know where to start.

I looked into deep learning for unstructured text, machine learning, Stanford NER, Named Entity Recognition as a general concept etc.

I would appreciate some guidance on where to start implementing such a solution.

Thanks


Solution

  • Depending specifically in your use case, the main architecture that I would recommend is AVEQA.

    NER was basically made for identifying repetitions of a certain entity (for example, country) in cases where the entity type is not explicitly in the text (i.e “In South Africa last summer was colder than other years”). It’s not a bad approach but as you have your explicit entities within the text you can take advantage from it.

    AVEQA was basically made for this use case. You ask for a certain question, as it could be: Which is the invoice number? And the model extracts from the input text the answer. It’s trained from texts were the answer lays on the texts itself and you just give the algorithm the start and end indexes positions of the answer.

    A example of the whole thing for extracting invoice number out of a sentence:

    It also has a module called no-answer in order to avoid false positives in the input text, as asking for invoice numbers where within the text, there are no invoice numbers.