pythonarraysdata-structuresduplicateshashset

Why is a HashSet faster than an Array in this Duplicate Integers problem?


I am new to learning data structures and algorithms and am starting NeetCode 150 problems. The first problem I encountered was not difficult, I understand why it works, but I am curious to why the best solution involves the use of a HashSet instead of an array.

I'm sure it comes back to time complexity which is a topic I am struggling a bit to grasp, so I am hoping someone can give me a simple breakdown as to why one solution is better than the other, and if it would be really bad to use an array here instead of a HashSet (if I used it in an interview would they not like it).

This is the solution I came up with. It is almost identical other than the use of arrays and .append() instead of HashSet's and .add(). The solution was technically correct but based on the research I did the HashSet solution appears to better/faster, and I would like some help understanding why.

class Solution:
    def hasDuplicate(self, nums: List[int]) -> bool:
        newArray = []
        for i in nums:
            if i in newArray:
                return True
            newArray.append(i)    
        return False

Here is the actual best solution.

class Solution:
    def hasDuplicate(self, nums: List[int]) -> bool:
        hashset = set()

        for n in nums:
            if n in hashset:
                return True
            hashset.add(n)
        return False

What are the differences? Also what impact would adding nums.sort() have on the first approach? Would that alter the run time in any way? Sorry if these questions are painfully obvious I just need some help wrapping me head around the concepts.


Solution

  • Let examine the use of the in operator in the if statement.

    For an array, the in operation is an O(n) operation - the array elements are searched sequentially, and in the worst case (i.e., if the item being searched for is not in the array), it goes over all of it. For a hash set, the in operation is an O(1) operation - the item in question is hashed, and checked against the right bucket of the hash (caveat: this assumes the hashing algorithm is sound, which for the case of integers it should be).

    Regarding sorting - let's start with the easy part - the second approach, using a hash set, is an O(n) algorithm (n items, each having an O(1) operation to check if they're in a hash set, and if not add them). Sorting nums would be an O(nlog(n)) operation, so it definitely won't help here.

    With the first approach, where an array is used, it ultimately won't help either. First, let's understand the time complexity of this solution. nums has n items, for each of which an auxiliary array is searched. For the first item, an array with the size of zero is searched, for the second, an array with the size of 1, and so on. This ultimately has a log(n) complexity, and since it's done for n items, the total complexity is O(nlog(n)). So adding sorting definitely won't improve the solution by an order of magnitude.