algorithmversion-control

Storing revision changes of a message


What algorithms and processes are involved in storing revision changes like stackoverflow and Wikipedia do?

Is only one copy of the message kept? And if so, is it only the latest copy? Then only changes to go back to the previous version(s) are stored from there? (This would make for a faster display of the main message). Or are complete messages stored? And if so, is the compare done between these on each display?

What algorithms are best used to determine the exact changes in the message? How is this data stored in a database?

If anyone knows exactly what Wikipedia or stackoverlfow does, I'd love to know.


Solution

  • The longest common substring algorithm can be used to detect differences between versions, but it is limited. For example, it does not detect the moving around of text as such, but it would see this as unrelated removals and insertions.

    I suppose that websites normally store the latest copy in full, and apply reverse diffs from there. This is also the way CVS works, but Subversion uses forward diffs, which results in slower checkouts.

    To store this in a database, one could maintain a main table with the latest versions, and have a separate table with the reverse differences. This table would have rows in the format (article_id, revision_id, differences).