From what I understand, since count()
iterates over the list again for each element it is slower. Since, the Counter and dict construction both iterate over the list once
shouldn't the dict construction and counter time results be similar?
I used the https://stackoverflow.com/a/23909767/7125235 as code reference for getting the time values.
import timeit
if __name__ == "__main__":
code_block = """seen = {}
for i in l:
if seen.get(i):
seen[i] += 1
else:
seen[i] = 1
"""
setup = """import random
import string
from collections import Counter
n=1000
l=[random.choice(string.ascii_letters) for x in range(n)]
"""
t1 = timeit.Timer(
stmt="Counter(l)",
setup=setup,
)
t2 = timeit.Timer(
stmt="[[x,l.count(x)] for x in set(l)]",
setup=setup,
)
t3 = timeit.Timer(
stmt=code_block,
setup=setup,
)
print("Counter(): ", t1.repeat(repeat=3, number=10000))
print("count(): ", t2.repeat(repeat=3, number=10000))
print("seen{}: ", t3.repeat(repeat=3, number=10000))
Output:
Run1:
Counter(): [0.32974308, 0.319977907, 0.301750341]
count(): [6.424047524000001, 6.417152854, 6.450776530000001]
seen{}: [1.1089669810000018, 1.099655232, 1.116015376]
Run 2:
Counter(): [0.322483783, 0.32464020800000004, 0.33498838900000005]
count(): [6.3235339029999995, 6.48233445, 6.543396192000001]
seen{}: [1.1192663550000006, 1.1072084830000009, 1.1155270229999985]
TL;DR
Counter.__init__
is using a C loop (in CPython, at least) to count the iterable's elements, see https://github.com/python/cpython/blob/6b1ac809b9718a369aea67b99077cdd682be2238/Modules/_collectionsmodule.c#L2277.
Counter
is (mostly, see below) implemented in Python, meaning its code can be inspected, debugged and even modified very easily.
Counter.__init__
in CPython 3.8 (docstring not included):
def __init__(self, iterable=None, /, **kwds):
super(Counter, self).__init__()
self.update(iterable, **kwds)
and Counter.update
(irrelevant paths not included):
def update(self, iterable=None, /, **kwds):
if iterable is not None:
if isinstance(iterable, _collections_abc.Mapping):
...
else:
_count_elements(self, iterable)
if kwds:
...
and _count_elements
:
def _count_elements(mapping, iterable):
mapping_get = mapping.get
for elem in iterable:
mapping[elem] = mapping_get(elem, 0) + 1
However, there is a very important piece of code + comment right below the implementation of _count_elements
:
try: # Load C helper function if available
from _collections import _count_elements
except ImportError:
pass
In other words, Counter
uses a C loop to count the elements, which is inherently faster than the Python loop you are using.
We can do a small experiment. Comment out the code that imports the C function:
# try: # Load C helper function if available
# from _collections import _count_elements
# except ImportError:
# pass
Executing your code:
Counter(): [1.8369901, 1.8359803000000001, 1.940804]
seen{}: [1.2392313000000001, 1.2483893999999998, 1.3157528000000003]
Now Counter
is actually slower than a plain dict.