pythonhaskellgeneratorcode-translationlazy-sequences

Count runs lazily in Python translate from Haskell


I'm trying to write a generator function (or achieve the equivalent) which takes an iterable xs in Python and counts the "runs". (This is a problem in Thinking Functionally with Haskell by Bird, which I want to translate into Python using Python's laziness features.) So

list(iter(count_runs(['a', 'a', 'b', 'c', 'a', 'd', 'd'])))
# => [(2, 'a'), (1, 'b'), (1, c'), (1, 'a'), (2, 'd')]

In Haskell it's

countRuns :: [a] -> [(Int, a)]
countRuns [] = []
countRuns x:xs = (1 + length us, x):countRuns vs
               where us, vs = span (==x) xs 

In Python, I'd like to write somethin like

from itertools import takewhile, dropwhile

def count_runs(xs):
  # get first element x of xs, if it exists
  us, vs = (takewhile(lambda y: y==x, xs), 
            dropwhile(lambda y: y==x, xs))
  yield (1 + len(list(us)), x)
  yield from count_runs(vs)

But the problem is that vs is an iterator already, so I will run into trouble if I call takewhile and dropwhile on it in the next recursion. (When I call list(takewhile(..., xs)) in the next recursion, it will get rid of the first element of dropwhile(..., xs) as well, because they're both looking at the same iterator.

How can I fix this issue, and what is the correct way to get the first element in the second line?


Solution

  • A significant difference between span and takewhile is that takewhile consumes the first non-x value in order to determine when to stop yielding values. As a result, you'll lose any singleton items in the input; in particular, takewhile loses the first b in producing the leading set of as. The iterator protocol has no way of peeking at the next element of an iterator, nor does it have a way to put back an element it consumes.

    Instead, you'll need two independent iterators: one for takewhile to produce the desired prefix, and another from which to drop that prefix for the recursive call.

    def count_runs(xs):
        try:
            x = next(xs)
        except StopIteration:
            return
    
        t1, t2 = tee(xs)
    
        us = list(takewhile(lambda y: y == x, t1))
        yield (1 + len(us), x)
        yield from count_runs(dropwhile(lambda y: y == x, t2))
    

    (Note that the itertools documentation implements something similar to span in its recipe section as a before_and_after function. It doesn't use tee, but I refer you to the actual implementation for details).

    def before_and_after(xs):
        ...
    
    def count_runs(xs):
        try:
            x = next(xs)
        except StopIteration:
            return
    
        first, second = before_and_after(lambda y: y == x, xs)
        yield (1 + len(list(first)), x)
        yield from count_runs(second)
    

    )

    However, most of this work is already done for you by itertools.groupby.

    def count_runs(xs):
        yield from ((len(list(v)), k) for k, v in groupby(xs))