algorithm

I don't understand sentence quoted from `Programming Pearls` book regarding algo implementation


Question from Programming Pearls, 2nd edition:

Given a sequential file containing 4,300,000,000 32-bit integers, how can you find one that appears at least twice?

Solution provided in the book:

Binary search find an element that occurs at least twice by recursively searching the subinterval that contains more than half of the integers. My original solution did not guarantee that the number of integers is halved in each iteration, so the worst case run time of its log2 n passes was proportional to n·log n. Jim Saxe reduced that to linear time by observing that the search can avoid carrying too many duplicates.

When his search knows that a duplicate must be in a current range of m integers, it will only store m+1 integers on its current work tape;

If more integers would have gone on the tape, his program discards them. Although his method frequently ignores input variables, its strategy is conservative enough to ensure that it finds at least one duplicate.


Above is content from the book. I don't understand the sentences quoted. How exactly can it be implemented? I mean, how can he know that "a duplicate must be in the current range of m integers"?


Solution

  • I think it refers to the pigeonhole principle, that if the difference between the minimum and the maximum element in a set is less than the cardinality of the set, there must be a duplicate.

    And you can check this as you're building your subsets and stop building them as soon as you're certain a duplicate has to exist in that subset.