pythonlist

Find the most common element in a list


What is an efficient way to find the most common element in a Python list?

My list items may not be hashable so can't use a dictionary. Also in case of draws the item with the lowest index should be returned. Example:

>>> most_common(['duck', 'duck', 'goose'])
'duck'
>>> most_common(['goose', 'duck', 'duck', 'goose'])
'goose'

Solution

  • With so many solutions proposed, I'm amazed nobody's proposed what I'd consider an obvious one (for non-hashable but comparable elements) -- [itertools.groupby][1]. itertools offers fast, reusable functionality, and lets you delegate some tricky logic to well-tested standard library components. Consider for example:

    import itertools
    import operator
    
    def most_common(L):
      # get an iterable of (item, iterable) pairs
      SL = sorted((x, i) for i, x in enumerate(L))
      # print 'SL:', SL
      groups = itertools.groupby(SL, key=operator.itemgetter(0))
      # auxiliary function to get "quality" for an item
      def _auxfun(g):
        item, iterable = g
        count = 0
        min_index = len(L)
        for _, where in iterable:
          count += 1
          min_index = min(min_index, where)
        # print 'item %r, count %r, minind %r' % (item, count, min_index)
        return count, -min_index
      # pick the highest-count/earliest item
      return max(groups, key=_auxfun)[0]
    

    This could be written more concisely, of course, but I'm aiming for maximal clarity. The two print statements can be uncommented to better see the machinery in action; for example, with prints uncommented:

    print most_common(['goose', 'duck', 'duck', 'goose'])
    

    emits:

    SL: [('duck', 1), ('duck', 2), ('goose', 0), ('goose', 3)]
    item 'duck', count 2, minind 1
    item 'goose', count 2, minind 0
    goose
    

    As you see, SL is a list of pairs, each pair an item followed by the item's index in the original list (to implement the key condition that, if the "most common" items with the same highest count are > 1, the result must be the earliest-occurring one).

    groupby groups by the item only (via operator.itemgetter). The auxiliary function, called once per grouping during the max computation, receives and internally unpacks a group - a tuple with two items (item, iterable) where the iterable's items are also two-item tuples, (item, original index) [[the items of SL]].

    Then the auxiliary function uses a loop to determine both the count of entries in the group's iterable, and the minimum original index; it returns those as combined "quality key", with the min index sign-changed so the max operation will consider "better" those items that occurred earlier in the original list.

    This code could be much simpler if it worried a little less about big-O issues in time and space, e.g....:

    def most_common(L):
      groups = itertools.groupby(sorted(L))
      def _auxfun((item, iterable)):
        return len(list(iterable)), -L.index(item)
      return max(groups, key=_auxfun)[0]
    

    same basic idea, just expressed more simply and compactly... but, alas, an extra O(N) auxiliary space (to embody the groups' iterables to lists) and O(N squared) time (to get the L.index of every item). While premature optimization is the root of all evil in programming, deliberately picking an O(N squared) approach when an O(N log N) one is available just goes too much against the grain of scalability!-)

    Finally, for those who prefer "oneliners" to clarity and performance, a bonus 1-liner version with suitably mangled names:-).

    from itertools import groupby as g
    def most_common_oneliner(L):
      return max(g(sorted(L)), key=lambda(x, v):(len(list(v)),-L.index(x)))[0]