pythondictionarysortingtime-complexity

Why is using a dictionary slower than sorting a list to generate frequency array?


I was doing this question on Codeforces (2057B - Gorilla and the Exam) where I had to use the frequencies of different numbers in a list. First I tried calculating it with a dictionary but I got TLE verdict (Time limit exceeded). After reading the editorial I implemented it by sorting the input list and then using comparisons with the previous element to generate the frequency array. This was faster than the approach using the dictionary but sorting is an O(nlogn) operation which should take more time than creating a dictionary which only requires O(n) time.

Dictionary approach (slower)

t= int(input())
for _ in range(t):
    n, k = map(int, input().split())
    arr = map(int, input().split())
    frequence_table = {}
    for num in arr:
        try:
            frequence_table[num] += 1
        except:
            frequence_table[num] = 1
    freq = list(frequence_table.values())
    freqs = sorted(freq, reverse=True)
    while k>0:
        if k>=freqs[-1]:
            k -= freqs.pop()
        else:
            break
    print(max(len(freqs), 1))

Sorting based approach (faster)

t= int(input())
for _ in range(t):
    n, k = map(int, input().split())
    arr = map(int, input().split())
    arr = sorted(arr)
    freq = [1]
    for i in range(1, len(arr)):
        if arr[i] == arr[i-1]:
            freq[-1] += 1
        else:
            freq.append(1)
    freqs = sorted(freq, reverse=True)
    while k>0:
        if k>=freqs[-1]:
            k -= freqs.pop()
        else:
            break
    print(max(len(freqs), 1))

Solution

  • Why is using a dictionary slower than sorting a list to generate frequency array?

    Apparently because one input is specially crafted to hack Python's dict implementation, so that building your dict doesn't take the expected linear time. It can take up to quadratic time, and likely they did that. See Python's TimeComplexity page (showing O(n) worst case time for each dict access) and Tutorial: How to hack hash table in Python 3 on CodeForces itself.

    One way to defeat it is to simply add 1 to all the numbers, i.e., change

        for num in arr:
    

    to this:

        for num in arr:
            num += 1
    

    Then the solution doesn't get stopped and rejected at the 1 second time limit but gets accepted with around 0.18 seconds. Despite doing more work, and without really changing the data like randomizing or sorting would. The effect is simply that Python's collision resolution takes different paths then, which is enough to thwart the brittle attack.

    Another way, also accepted in around 0.18 seconds, is to simply not convert to ints, instead counting the strings. I.e., change

        arr = map(int, input().split())
    

    to this:

        arr = input().split()
    

    That's also safer, as Python deliberately and unpredictably randomizes string hashes exactly to defeat such attacks:

    the __hash__() values of str and bytes objects are “salted” with an unpredictable random value. [...] This is intended to provide protection against a denial-of-service caused by carefully chosen inputs that exploit the worst case performance of a dict insertion, O(n2) complexity.