data-structurestime-complexityheap

Time Complexity for finding k smallest elements in an array of size n


I came across 3 approaches for the problem.

  1. Sort the array and find the k elements O(nlog(n))
  2. Using minHeap, heapify in O(n) time and extract k elements O(klog(n)); total = O(n + klog(n))
  3. Using maxHeap of capacity k and scanning all n elements, popping the largest element from this heap each time O(n*log(k)

Approach 1 is obviously the worst. But how to compare 2 and 3? Does the comparison depend on k being << n ?


Solution

  • Using a min-heap, the complexity is indeed O(𝑛 + 𝑘log𝑛).

    Using a max-heap, the complexity can be further refined to O(𝑘 + (𝑛−𝑘)log𝑘)), because the initial max-heap is built with O(𝑘) complexity, and the rest of the processing concerns 𝑛−𝑘 push/pop operations on that heap.

    We can make some further analysis by fixing 𝑘 to a constant or making it a function of 𝑛. Here are some possibilities:

    value of 𝑘 min-heap algorithm complexity max-heap algorithm complexity Comparison
    O(1) O(𝑛 + 𝑘log𝑛) => O(𝑛) O(𝑘 + (𝑛−𝑘)log𝑘)) => O(𝑛) equal complexity
    O(log𝑛) O(𝑛 + 𝑘log𝑛) => O(𝑛) O(𝑘 + (𝑛−𝑘)log𝑘)) => O(𝑛log(log𝑛)) min-heap wins
    O(𝑛) O(𝑛 + 𝑘log𝑛) => O(𝑛log𝑛) O(𝑘 + (𝑛−𝑘)log𝑘)) => O(𝑛) max-heap wins

    So we see a mixed result in terms of complexity. But this is very theoretical. For instance, log(log𝑛) is relatively very small. If 𝑛 is 1080, which is the order of magnitude for the number of atoms in the observable universe, log2(log2𝑛) is close to ... 8!

    More importantly, time complexity says absolutely nothing about actual running times. Even if both complexities are equal, this tells us nothing about which algorithm might be faster than the other. It all depends on the actual number of atomic operations, and how much time each of them spends executing.

    Benchmarking

    To really compare these algorithms in their practical use, it is necessary to actually implement both methods and benchmark them, both of them using the same heap data structure and algorithm.

    Below I provide a JavaScript implementation (note that JS does not have a native heap implementation at the time of writing), and log execution times for an array with about a million values, and varying 𝑘, doubling it in each batch of runs. It takes some time to run:

    // Generic heap implementation with only the methods needed:
    class Heap {
        constructor(values, descending=false) {
            this.descending = descending;
            this.arr = values;
            for (let i = this.arr.length>>1; i--; ) this.siftDown(i);
        }
        siftDown(i=0, key=this.arr[i]) {
            if (i < this.arr.length) {
                while (true) {
                    let j = i*2+1;
                    if (j+1 < this.arr.length && (this.arr[j] < this.arr[j+1]) === this.descending) j++;
                    if (j >= this.arr.length || (key >= this.arr[j]) === this.descending) break;
                    this.arr[i] = this.arr[j];
                    i = j;
                }
                this.arr[i] = key;
            }
        }
        pop(arr) {
            return this.exchange(this.arr.pop());
        }
        pushpop(key) {
            return (key > this.arr[0]) === this.descending ? key : this.exchange(key);
        }
        exchange(key) {
            if (!this.arr.length) return key;
            let oldKey = this.arr[0];
            this.siftDown(0, key);
            return oldKey;
        }
    }
    
    // Named helpers to construct a heap
    const MinHeap = (arr) => new Heap(arr, false);
    const MaxHeap = (arr) => new Heap(arr, true);
    
    // The two algorithms to get the least k values from an array
    function leastKminHeap(arr, k) {
        const heap = MinHeap(arr);
        const result = [];
        while (k-- > 0) result.push(heap.pop());
        return result;
    }
    
    function leastKmaxHeap(arr, k) {
        const heap = MaxHeap(arr.slice(0, k));
        for (let i = k; i < arr.length; i++) heap.pushpop(arr[i]);
        return heap.arr;
    }
    
    // Helper function 
    function shuffle(arr) {
        for (let i = arr.length - 1; i > 0; i--) {
            const j = Math.floor(Math.random() * (i + 1));
            [arr[i], arr[j]] = [arr[j], arr[i]]
        }
        return arr;
    }
    
    // Measure time for running the algorithms for given n and k:
    function timeit(n, k) {
        let timeMinHeap = 0, 
            timeMaxHeap = 0;
        for (let repeat = 0; repeat < 3; repeat++) {
            let arr = shuffle([...Array(n).keys()]);
            let start = performance.now();
            leastKmaxHeap(arr, k);
            timeMaxHeap += performance.now() - start;
            start = performance.now();
            leastKminHeap(arr, k);
            timeMinHeap += performance.now() - start;
        }
        return [timeMinHeap, timeMaxHeap];
    }
    
    const pause = () => new Promise(r => setTimeout(r, 20));
    
    // Report time for running the algorithms for increasing k and fixed n
    (async () => {
        const n = 1024000;
        for (let k = 1000; k <= n; k *= 2) {
            const [timeMinHeap, timeMaxHeap] = timeit(n, k);
            console.log(`${k}/${n} => min-heap: ${Math.round(timeMinHeap)}, max-heap: ${Math.round(timeMaxHeap)}, ratio: ${(timeMinHeap / timeMaxHeap).toFixed(2)}`);
            await pause();
        }
    })();

    The output you get from this will be system dependent, but you should be able to notice a change in the ratios. On my installation I see an output that looks somewhat like this:

    1000/1024000 => min-heap: 45, max-heap: 21, ratio: 2.18
    2000/1024000 => min-heap: 46, max-heap: 23, ratio: 1.97
    4000/1024000 => min-heap: 53, max-heap: 29, ratio: 1.85
    8000/1024000 => min-heap: 41, max-heap: 32, ratio: 1.29
    16000/1024000 => min-heap: 44, max-heap: 41, ratio: 1.08
    32000/1024000 => min-heap: 57, max-heap: 64, ratio: 0.90
    64000/1024000 => min-heap: 112, max-heap: 113, ratio: 0.99
    128000/1024000 => min-heap: 289, max-heap: 339, ratio: 0.85
    256000/1024000 => min-heap: 700, max-heap: 650, ratio: 1.08
    512000/1024000 => min-heap: 1036, max-heap: 598, ratio: 1.73
    1024000/1024000 => min-heap: 993, max-heap: 71, ratio: 13.94
    

    For the smaller 𝑘 in this test, the ratio favours the max-heap algorithm.

    It is also clear that the max-heap algorithm does much better when 𝑘 is equal to 𝑛. Here the difference in time complexity becomes noticeable in the execution times. Both algorithms build a heap containing all values, but only the min-heap algorithm continues to pop off all elements. You could improve both algorithms so that they would invert the job when 𝑘 is greater than half of 𝑛, but it would only affect that last line in this output.

    Interestingly, the above output shows there is some range for 𝑘 where the max-heap algorithm performs worse than the min-heap algorithm. The size of this range (if any) will again depend on the actual implementation and configuration on which it is run.

    I hope this illustrates why the max-heap implementation will often have better running times, but also that at some point it could become worse.