I have an ordered list of things to process that includes some duplicates and I only want to process the first occurrence. Presently, I'm doing it like this in Python v2.7:
seen = set()
for (value, fmt) in formats:
if fmt not in seen:
seen.add(fmt)
process(value, fmt)
Is there anyway to simultaneously insert a new element into seen
and detect whether or not it was already present? (This would avoid the repeated lookup of fmt
in the set
.)
seen = set()
for (value, fmt) in formats:
# myInsert() would return true if item was not already present.
if seen.myInsert(fmt):
process(value, fmt)
Or, alternatively, can I somehow just filter my formats
to exclude duplicate entries before looping?
unique_formats = removeDuplicates(formats, key=itemgetter(1))
for (value, fmt) in unique_formats:
process(value, fmt)
You could take the length of the set before and after the add()
. If it didn't change, the format was already in the set.
seen = set()
for (value, fmt) in formats:
l1 = len(seen)
seen.add(fmt)
if l1 != len(seen):
process(value, fmt)
Your question presumes that the in
test is a costly operation. This turns out not to be the case. Using len()
can take more time, although both are quite fast;
In [4]: seen = set(range(10000))
In [5]: %timeit 5995 in seen
10000000 loops, best of 3: 122 ns per loop
In [6]: %timeit len(seen)
1000000 loops, best of 3: 167 ns per loop
(measured with CPython 2.7.3 on a 2.5 GHz Core Quad Q9300)