Working on a project for Retrieving content from a given image and compare with other images in the repository and list out the matching images.
what should be right approach to do it so that the search wont slowdown eventually.
What I was planning to do as a first level of filtering was to use any Image Querying (CBIR technique) to retrieve images matching the pattern of given image. Then do OCR to get the image content and do a match check.
Please let me know if there is any better approach for this.
Steps done
Softwares 1. Tesseract OCR 2. Image Magick - For image cleaning 3. Textcleaner script
Found out the image orientation using Image Magick software
OCRed the image to get the text and applied filtering to get the bill no, date and amount.
Saved data is used for future search feature to eradicate duplication