I have a string of text highlighted in an ebook. This ebook has new, revised versions coming out every couple of years. I want to programatically re-locate this highlight across all these updated ebook versions. How would I approach this problem? (Assume I have read access to the original ebook in which the highlight was made.)
Here are what the data structures look like. loc
is just a char index with respect to the entire text of the book laid out as a single string. toc
is table of contents.
// a single highlight
{
"start_loc": 5000,
"end_loc": 5044,
"end_loc_of_book": 10000,
"highlighted_text": "The quick brown fox jumps over the lazy dog.",
"toc_path": ["Chapter 5: Animal Relationships", "Foxes and dogs"],
}
// an ebook
{
"toc": [
{
"heading_title": "Chapter 1: All work and no play makes Jack a dull boy",
"heading_start_loc": 0,
"heading_end_loc": 2000,
// each heading can have nested subheadings within
// the range of its start_loc and end_loc
"subheadings": [
{
"heading_title": "Jack is still a dull boy",
"heading_start_loc": 300,
"heading_end_loc": 500,
// each heading can have nested subheadings within
// the range of its start_loc and end_loc
"subheadings": []
},
// ...
]
},
// ...
{
"heading_title": "Chapter 5: Animal Relationships",
"heading_start_loc": 4000,
"heading_end_loc": 6000,
"subheadings": [
{
"heading_title": "Foxes and dogs",
"heading_start_loc": 4500,
"heading_end_loc": 5500,
"subheadings": []
},
// ...
]
},
// ...
],
"full_book_text": "Lorem ipsum dolor sit amet, consectetur
adipiscing elit, sed do eiusmod tempor incididunt ut labore
et dolore magna aliqua. In fermentum et sollicitudin ac
orci phasellus.
...
The quick brown fox jumps over the lazy dog.
...
Praesent semper feugiat nibh sed pulvinar proin. Augue
eget arcu dictum varius duis at consectetur lorem donec.
Adipiscing elit duis tristique sollicitudin."
}
The solution to this problem is fuzzy anchoring, detailed by hypothesis.is.
In gist, save a bunch of document-structure independent selectors and use an approximating strategy to make an educated guess about the highlight's location in the new document.
This consists of: