I'm trying to write a generator function (or achieve the equivalent) which takes an iterable xs
in Python and counts the "runs". (This is a problem in Thinking Functionally with Haskell by Bird, which I want to translate into Python using Python's laziness features.) So
list(iter(count_runs(['a', 'a', 'b', 'c', 'a', 'd', 'd'])))
# => [(2, 'a'), (1, 'b'), (1, c'), (1, 'a'), (2, 'd')]
In Haskell it's
countRuns :: [a] -> [(Int, a)]
countRuns [] = []
countRuns x:xs = (1 + length us, x):countRuns vs
where us, vs = span (==x) xs
In Python, I'd like to write somethin like
from itertools import takewhile, dropwhile
def count_runs(xs):
# get first element x of xs, if it exists
us, vs = (takewhile(lambda y: y==x, xs),
dropwhile(lambda y: y==x, xs))
yield (1 + len(list(us)), x)
yield from count_runs(vs)
But the problem is that vs
is an iterator already, so I will run into trouble if I call takewhile
and dropwhile
on it in the next recursion. (When I call list(takewhile(..., xs))
in the next recursion, it will get rid of the first element of dropwhile(..., xs)
as well, because they're both looking at the same iterator.
How can I fix this issue, and what is the correct way to get the first element in the second line?
A significant difference between span
and takewhile
is that takewhile
consumes the first non-x
value in order to determine when to stop yielding values. As a result, you'll lose any singleton items in the input; in particular, takewhile
loses the first b
in producing the leading set of a
s. The iterator protocol has no way of peeking at the next element of an iterator, nor does it have a way to put back an element it consumes.
Instead, you'll need two independent iterators: one for takewhile
to produce the desired prefix, and another from which to drop that prefix for the recursive call.
def count_runs(xs):
try:
x = next(xs)
except StopIteration:
return
t1, t2 = tee(xs)
us = list(takewhile(lambda y: y == x, t1))
yield (1 + len(us), x)
yield from count_runs(dropwhile(lambda y: y == x, t2))
(Note that the itertools
documentation implements something similar to span
in its recipe section as a before_and_after
function. It doesn't use tee
, but I refer you to the actual implementation for details).
def before_and_after(xs):
...
def count_runs(xs):
try:
x = next(xs)
except StopIteration:
return
first, second = before_and_after(lambda y: y == x, xs)
yield (1 + len(list(first)), x)
yield from count_runs(second)
)
However, most of this work is already done for you by itertools.groupby
.
def count_runs(xs):
yield from ((len(list(v)), k) for k, v in groupby(xs))