pythonpython-3.xheapheapq

Does heapq.heappush() compare on int AND string without been specified for?


I was checking out this solution to a question on leetcode.com

def topKFrequent(self, words, k):
        count = collections.Counter(words)
        heap = [(-freq, word) for word, freq in count.items()]
        heapq.heapify(heap)
        return [heapq.heappop(heap)[1] for _ in xrange(k)]

and when I provide it an array of strings like ["aa", "aaa", "a"] and 1 it correctly returns ["a"]. My question is did the heap also lexographically sort the tuples internally? Because according to me if there was no sorting, it would have simply returned ["aa"] (the order in which the heap was constructed since counts of all three are the same). Or have I misunderstood heapq?


Solution

  • heapq just compares values from the queue using using the "less than" operator [1] regardless of the type of the value. It is the type of the value that defines what the comparison will return. So, what makes the difference here is the tuple itself. As from the documentation:

    The comparison [of sequence objects] uses lexicographical ordering: first the first two items are compared, and if they differ this determines the outcome of the comparison; if they are equal, the next two items are compared, and so on, until either sequence is exhausted.

    Checking some examples:

    >>> (0, 'a') < (1, 'aa')
    True
    >>> (1, 'a') < (1, 'aa')
    True
    >>> (1, 'aa') < (1, 'a')
    False
    >>> (2, 'a') < (1, 'aa')
    False
    

    So you are right, the values are ordered lexicographically and the second value of the tuple is relevant. However, heapq does not have to do anything here to get this result, the mere tuple comparison does that.

    [1] One can check it in the code. Here is one of the lines where the comparison is made by heapq (in C):

    cmp = PyObject_RichCompareBool(newitem, parent, Py_LT);
    

    This PyObject_RichCompareBool() is, according to the documentation:

    the equivalent of the Python expression o1 op o2, where op is the operator corresponding to opid.