pythonpython-3.xpython-2.7heapheapq

Python heapify() time complexity


def heapify(A):
    for root in xrange(len(A)//2-1, -1, -1):
        rootVal = A[root]
        child = 2*root+1
        while child < len(A):
            if child+1 < len(A) and A[child] > A[child+1]:
                child += 1
            if rootVal <= A[child]:
                break
            A[child], A[(child-1)//2] = A[(child-1)//2], A[child]
            child = child *2 + 1

This is a similar implementation of python heapq.heapify(). It is said in the doc this function runs in O(n). But it looks like for n/2 elements, it does log(n) operations. Why is it O(n)?


Solution

  • It requires more careful analysis, such as you'll find here. The basic insight is that only the root of the heap actually has depth log2(len(a)). Down at the nodes one above a leaf - where half the nodes live - a leaf is hit on the first inner-loop iteration.

    "Exact" derivation

    Waving hands some, when the algorithm is looking at a node at the root of a subtree with N elements, there are about N/2 elements in each subtree, and then it takes work proportional to log(N) to merge the root and those sub-heaps into a single heap. So the total time T(N) required is about

    T(N) = 2*T(N/2) + O(log(N))
    

    That's an uncommon recurrence. The Akra–Bazzi method can be used to deduce that it's O(N), though.

    I think more informative, and certainly more satifsying, is to derive an exact solution from scratch. Toward that end, I'll only talk about complete binary trees: as full as possible on every level. Then there 2**N - 1 elements in total, and all subtrees are also complete binary trees. This sidesteps mounds of pointless details about how to proceed when things aren't exactly balanced.

    When we're looking at a subtree with 2**k - 1 elements, its two subtrees have exactly 2**(k-1) - 1 elements each, and there are k levels. For example, for a tree with 7 elements, there's 1 element at the root, 2 elements on the second level, and 4 on the third. After the subtrees are heapified, the root has to moved into place, moving it down 0, 1, or 2 levels. This requires doing comparisons between levels 0 and 1, and possibly also between levels 1 and 2 (if the root needs to move down), but no more that that: the work required is proportional to k-1. In all, then,

    T(2**k - 1) = 2 * T(2**(k-1) - 1) + (k - 1)*C
    

    for some constant C bounding the worst case for comparing elements at a pair of adjacent levels.

    What about T(1)? That's free! A tree with only 1 element is a already a heap - there's nothing to do.

    T(1) = 0
    

    One level above those leaves, trees have 3 elements. It costs (no more than) C to move the smallest (for a min-heap; largest for a max-heap) to the top.

    T(3) = C
    

    One level above that trees have 7 elements. It costs T(3) to heapify each of the subtrees, and then no more than 2*C to move the root into place:

    T(7) = 2*C + 2*C = 4*C
    

    Continuing in the same way:

    T(15) = 2* 4*C + 3*C = 11*C
    T(31) = 2*11*C + 4*C = 26*C
    T(63) = 2*26*C + 5*C = 57*C
    ...
    T(2**k - 1) = (2**k - k - 1)*C
    

    where the last line is a guess at the general form. You can verify that "it works" for all the specific lines before it, and then it's straightforward to prove it by induction.

    So, where N = 2**k - 1,

    T(N) = (N - log2(N+1)) * C
    

    which shows that T(N) is bounded above by C*N, so is certainly O(N).