searchtext-searchsentence-similarity

How do I locate the same string of text across different revisions of the same text (an ebook)?


I have a string of text highlighted in an ebook. This ebook has new, revised versions coming out every couple of years. I want to programatically re-locate this highlight across all these updated ebook versions. How would I approach this problem? (Assume I have read access to the original ebook in which the highlight was made.)


Here are what the data structures look like. loc is just a char index with respect to the entire text of the book laid out as a single string. toc is table of contents.

// a single highlight
{
  "start_loc": 5000,
  "end_loc": 5044,
  "end_loc_of_book": 10000,
  "highlighted_text": "The quick brown fox jumps over the lazy dog.",
  "toc_path": ["Chapter 5: Animal Relationships", "Foxes and dogs"],
}

// an ebook
{
  "toc": [
    {
      "heading_title": "Chapter 1: All work and no play makes Jack a dull boy",
      "heading_start_loc": 0,
      "heading_end_loc": 2000,
      // each heading can have nested subheadings within
      // the range of its start_loc and end_loc
      "subheadings": [
        {
          "heading_title": "Jack is still a dull boy",
          "heading_start_loc": 300,
          "heading_end_loc": 500,
          // each heading can have nested subheadings within
          // the range of its start_loc and end_loc
          "subheadings": []
        },
        // ...
      ]
    },
    // ...
    {
      "heading_title": "Chapter 5: Animal Relationships",
      "heading_start_loc": 4000,
      "heading_end_loc": 6000,
      "subheadings": [
        {
          "heading_title": "Foxes and dogs",
          "heading_start_loc": 4500,
          "heading_end_loc": 5500,
          "subheadings": []
        },
        // ...
      ]
    },
    // ...
  ],
  "full_book_text": "Lorem ipsum dolor sit amet, consectetur
adipiscing elit, sed do eiusmod tempor incididunt ut labore
et dolore magna aliqua. In fermentum et sollicitudin ac 
orci phasellus.

...

The quick brown fox jumps over the lazy dog.

...

Praesent semper feugiat nibh sed pulvinar proin. Augue 
eget arcu dictum varius duis at consectetur lorem donec.
Adipiscing elit duis tristique sollicitudin."
}


Solution

  • The solution to this problem is fuzzy anchoring, detailed by hypothesis.is.

    In gist, save a bunch of document-structure independent selectors and use an approximating strategy to make an educated guess about the highlight's location in the new document.

    This consists of:

    1. an XPath selector pointing to the element in the original document
    2. the start and end offsets with respects to the full text of the original document
    3. the 32 chars prefixing the original highlight and the 32 chars suffixing the original highlight