pythonlistperformancecoding-efficiency

Difference between removing duplicates from a list using dict and set?


According to my research there are two easy ways to remove duplicates from a list:

a = list(dict.fromkeys(a))

and

a = list(set(a))

Is one of them more efficient than the other?


Solution

  • Definitely the second one is more efficient as sets are more or less created for that purpose and you skip the overhead related to creation of dict which is way heavier. Perfomance-wise it definitely depends on what the payload actually is.

    import timeit
    import random
    
    input_data = [random.choice(range(100)) for i in range(1000)]
    
    from_keys = timeit.timeit('list(dict.fromkeys(input_data))', number=10000, globals={'input_data': input_data})
    from_set = timeit.timeit('list(set(input_data))', number=10000, globals={'input_data': input_data})
    
    print(f"From keys performance: {from_keys:.3f}")
    print(f"From set performance: {from_set:.3f}")
    

    Prints:

    From keys performance: 0.230
    From set performance: 0.140
    

    It doesn't really mean it's almost twice as fast. The difference is barely visible. Try it for yourself with different random data.