c++ocrtesseractbounding-boxdocument-layout-analysis

Tesseract: How to export text and boundingboxes?


I'd like to convert document images to XML and also export the location where a certain word has been found within a page. In order to access bounding box information, tesseract's layout analysis can be used:

 tess.SetImage(...); 
 tess.SetPageSegMode( tesseract::PSM_AUTO_OSD); 
 tesseract::PageIterator* it = tess.AnalyseLayout(); 
 while(it->Next(tesseract::RIL_WORD)
 {
      int top, bottom, left, right; 
      it->BoundingBox(tesseract::RIL_WORD, &left, &top, &right, &bottom); 

 }

At that point, however, I don't know the actual content of a bounding box and by executing the following code, OCR is performed on the current image, so text contains the whole text of a page.

 tess.Recognize(0); 
 std::string text = tess.GetUTF8Text(); 

Currently I temporarily store all bounding boxes in a vector. For each box I cut out a subimage from the original one and perform OCR for each bounding box. Basically this works, but when I compare the results to the Tesseract Command Line Tool, far more OCR errors occur.

Therefore I'd like to know how to I can iterate through the OCR result word by word and get the corresponding bounding box.


Solution

  • tess.Recognize(0);
    
    PAGE_RES_IT resultIter(page_res_);
    
    for (resultIter.start_page(false); resultIter.block() != NULL; resultIter.forward()) 
    {
    
                WERD_RES* wordResult = resultIter.word();
                WERD_CHOICE* word = wordResult->best_choice;
    
                TBOX& box = wordResult->word->bounding_box();
    }