I have sometimes heard esp in context of information retrieval, search engines, crawlers etc that we can detect duplicate pages by hashing content of a page. What kind of hash functions are able to hash an entire web page (which are at least 2 pagers), so that 2 copies have same hash output value? What is size of a typical hash output value?
Are such hash functions able to put 2 similar web pages with slight typos etc in the same bucket?
Thanks,
Any hash function, given two inputs x and y s.t. x = y, will by definition return the same value for them. But if you want to do this kind of duplicate detection properly, you will need either:
Which one to use really depends on your needs; crypto hashes are useless in near-duplicate detection, since they're designed to map near-duplicates to very different values.