pythonsetordereddict

Python: How to sort items and remove duplicates in one pass using custom comparator?


I have a list of tuples, which i need to sort by their second element and remove duplicates.

Example input:

[
    ("1", "b"),
    ("2", "e"),
    ("2", "e"),
    ("3", "d"),
    ("3", "c"),
    ("4", "a"),
    ("5", "a"),
]

expected output:

[
    ("5", "a"),
    ("4", "a"),
    ("1", "b"),
    ("3", "c"),
    ("3", "d"),
    ("2", "e"),
]

The common answer to this problem is

myList = sorted( set(myList), key = lambda x: x[1] )

But this doesn't seem like the best way to do it, because first constructs a set from a list, then it constructs back a list from a set (with random different order) and then it applies sorting algorithm to the list.

In other languages that have better control over the containers and their implementations i would do this using ordered set, because that container will remove duplicates and sort the elements at the same time. I would like to write something like this.

myList = list( ordered_set( myList, key = lambda x: x[1] ) )

But in Python this seems like an overkill task. There is no ordered_set container in Python, or at least i haven't found one, and the OrderedDict added later in Python 3 does not seem to accept custom comparison lambdas.

Does anyone know an easy way to do this?


Solution

  • You can do this by sorting the list first, and then applying itertools.groupby. groupby groups the input iterable of rows according to the key, and outputs an iterable of tuples containing the group key value and the rows in that group.

    Also, don't forget that itemgetter from the operators module is the perfect way of expressing both a sorting and a grouping key in most cases:

    [k for k, _ in itertools.groupby(sorted(myList, key=itemgetter(1, 0)), key=itemgetter(0, 1))]
    

    Notice that we sort first by the right hand value (index 1) in the tuple, but use the left hand value (index 0) as a tie-breaker. This is necessary, otherwise the final sort order might include something like ("a", 3), ("b", 3), ("a", 3) and that duplicate ("a", 3) would then not be removed.

    If you only want to iterate through the result once then you can (and should) write this as a generator expression, which avoids the overhead of instantiating a second list:

    g = (k for k, _ in itertools.groupby(sorted(myList, key=itemgetter(1, 0)), key=itemgetter(0, 1)))
    for a, b in g:
        print(a, b)