node.jstesseracttesseract.js

To read the text from PAN Card


I have the usecase to read the text from the PAN Card. Ideally the application should have the screen to scan the PAN Card and the text should be extracted from there. The extracted texts will be auto populated on the further screens.

I have read about tesseract npm module, but still didn't have the clue where to start as there is no compete blogs available for this usecase over the internet. Also tried the npm module - okrabyte, this is not giving 100% result. Any guidance or help would be required.

I tried AWS Textract service as well. This is not helping to parse the PAN CARD as the extracted results were completely different.


Solution

  • You need to use OCR to achieve this. There are various options for doing this. Tesseract is open source. I hope this blog helps you get started with tesseract on nodejs.

    You can use OCR apis from different cloud providers to achieve this as well. Example: Microsoft Cognitive Services Vision API, Abbyy Cloud, etc.

    Also, improving the quality of your image helps in extracting text with higher accuracy. Personally, I've seen big difference between 200 dpi images vs 600 dpi images.

    Hope this helps!