I was attempting LeetCode question 128. Longest Consecutive Sequence:
Given an unsorted array of integers
nums
, return the length of the longest consecutive elements sequence.You must write an algorithm that runs in
O(n)
time.Example 1:
Input:
nums = [100,4,200,1,3,2]
Output:4
Explanation: The longest consecutive elements sequence is[1, 2, 3, 4]
. Therefore its length is4
.Constraints:
0 <= nums.length <= 105
-109 <= nums[i] <= 109
My first attempt did sorting followed by counting. This has O(nlogn) time complexity, but it surprisingly gave me 93.93% percentile for time complexity (40ms).
I then re-read the question and realised the answer must be in O(n) time complexity. So I wrote the following code:
def longestConsecutive(self, nums: List[int]) -> int:
s = set(nums)
longest_streak = 0
for num in nums:
if (num - 1) not in s:
current_streak = 1
while (num + 1) in s:
num += 1
current_streak += 1
longest_streak = max(longest_streak, current_streak)
return longest_streak
(I know, it's not a great practice to reuse num variable name in the nested loop, but that's beside the point of the question, I have tested using a separate variable as well with the same result as below)
While this should theoretically be O(n) time complexity, faster than my first solution, this actually resulted in time limit exceeded for a couple of the cases and my code was rejected.
I eventually submitted a passing solution after referring to the solution
class Solution:
def longestConsecutive(self, nums: List[int]) -> int:
nums = set(nums)
longest_streak = 0
for num in nums:
if (num - 1) not in nums:
next_num = num + 1
while next_num in nums:
next_num += 1
longest_streak = max(longest_streak, next_num - num)
return longest_streak
where I identified 2 key differences:
However, both of these changes does not seem like it should have significant impact on the runtime, enough to cross the line between time-limit exceeded into a passing solution. To puzzle me even more, this O(n) solution still performed worse than my sorting solution, ranking only at 75.73% percentile (46 ms).
So my questions are:
Why does a O(nlogn) algorithm perform faster than O(n) in practice?
Time complexity does not say anything about actual running times for concrete input sizes. It only says something about how running times will evolve asymptotically as the input size grows large.
In general (unrelated to the actual code you presented), we can imagine a O(𝑛) algorithm that needs 1000 + 𝑛 milliseconds to complete, and a O(𝑛log𝑛) algorithm that needs 1 + 𝑛log10𝑛 milliseconds to complete. And now it becomes clear how the O(𝑛log𝑛) algorithm will beat the O(𝑛) one for lots of realistic input sizes.
See how many milliseconds they would need for concrete values of 𝑛:
𝑛 | O(𝑛) | O(𝑛log𝑛) |
---|---|---|
1 | 1 010 | 1 |
10 | 1 100 | 11 |
100 | 2 000 | 201 |
1000 | 11 000 | 3 001 |
10000 | 101 000 | 40 001 |
100000 | 1 001 000 | 500 001 |
1000000 | 10 001 000 | 6 000 001 |
10000000 | 100 001 000 | 70 000 001 |
100000000 | 1 000 001 000 | 800 000 001 |
1000000000 | 10 000 001 000 | 9 000 000 001 |
10000000000 | 100 000 001 000 | 100 000 000 001 |
100000000000 | 1 000 000 001 000 | 1 100 000 000 001 |
By definition, there is always an 𝑛 above which the better time complexity will win out, but as you can see, it can be a very high threshold for 𝑛: in the table above only the last row shows a win for the theoretically more efficient algorithm.
Why is my first O(n) algorithm so slow that it reached time-limit exceeded while my second algorithm with minimal changes could pass?
It is because of the outer loop. In the slow version, it iterates over the input list, while in the faster version, it iterates over the set. This means that if the input is large and has lots of duplicate values, the slow version is repeating work that does not bring any benefit. Your initial version only needs to replace that in order to finish the job within the given time limit:
Change:
for num in nums:
to:
for num in s: