I have been reading OCR papers such as this one https://arxiv.org/pdf/1704.08628.pdf , and I am have trouble finding out how these datasets are actually generated.
In the linked paper, they use a regressor to predict the start location (a point) and height of a line of text. Then, based on that starting point and height, a second network performs OCR and end of line detection. I realize this is a very simplified explanation, but it follows that their dataset consists (at least in part) of full page text 'images' annotated with where each line begins, and then a transcription of the text on a given line. Alternatively, they could have just used the lower left point of bounding boxes as the start point and the height of the box as the word height (avoiding the need to re-annotate if the data was previously prepared using bounding boxes).
So how is a dataset like this actually created? Looking at other datasets it seems like there is some software that can create XML files containing the ground truths relevant to each image, can someone point me in the right direction? I've been googling around and finding lots of tools for annotating text with sentiment etc and other tools for annotating images for segmentation (for something like a YOLO network), but I'm coming up empty for the creation something like the Maurdoor dataset used in the linked paper.
Thank you
So after submitting this, the related threads window showed me many threads that my googling did not turn up. This http://www.prima.cse.salford.ac.uk/tools software seems to be what I was looking for, but I would still love to hear other ideas.