pythonpandasnumpy

adding interval to multiple bin counts in python/pandas


Context: Given a piece of paper with text/drawings sporadically printed on it, determine how many unmarked strips of a given width could be cut from it (without doing anything clever like offsetting the cuts according to text position). Data comes in the form of a pandas DataFrame with x0 and x1 columns, which are the bounding box values in the relevant dimension.

My initial attempt to solve this was to use pd.cut() and DataFrame.groupby() to bin everything by an np.arange() and count the empty bins, but this only works if x1-x0 < strip_width:

def count_unmarked_strips(df, total_width, strip_width):
    bins = np.arange(0, total_width, strip_width)
    count_min = df.groupby(pd.cut(df['x0'], bins=bins))['x0'].count().to_numpy()
    count_max = df.groupby(pd.cut(df['x1'], bins=bins))['x1'].count().to_numpy()
    return ((count_min + count_max) == 0).sum()

This already seems like a bad way to do this and I cannot come up with a way to extend this to work beyond the above limitation. More generally, is there a good/reasonably performant way to compare two sets of intervals like this?

Update with alternate solution:

def count_unmarked_strips_alt(df, total_width, strip_width):
    bin_edges = np.arange(0, total_width, strip_width)
    bins = np.zeros(len(bin_edges)-1, dtype=bool)
    for i in range(len(bins)):
        bin_count = df.loc[
            ( # if writing starts in strip
                (df['x0'] >= bin_edges[i])
                & (df['x0'] <= bin_edges[i+1])
            )
            | ( # if writing ends in strip
                (df['x1'] >= bin_edges[i])
                & (df['x1'] <= bin_edges[i+1])
            )
            | ( # if writing crosses strip boundary
                (df['x0'] <= bin_edges[i])
                & (df['x1'] >= bin_edges[i])
            )
        ].shape[0]
        bins[i] = bin_count == 0
    return bins.sum()

This seems to work, but is much slower to run, which I would like to improve.

Example data:

total_width = 20.
strip_width = 2. # both functions give the same result
# strip_width = 1. # first function overcounts unmarked strips

df = pd.DataFrame()
df['x0'] = [0.1, 6.3, 2.5, 17.2, 11.8, 5.3]
df['x1'] = [0.7, 7.9, 5.4, 19.0, 12.2, 5.7]

so, e.g. we have a 20cm width of paper with 6 markings on of different sizes, bounded on the left hand side by x0 and on the right hand side by x1. If we were to cut the paper into equal strips with 2cm width, we would find 2 of them would be unmarked.


Solution

  • Rethinking this, I came up with a more performant solution. Iterating over each pair of x0 and x1 values and converting them from floating point positions to the corresponding indices of the bins array works and is much faster than calling df.loc() within a loop.

     def count_unmarked_strips(df, total_width, strip_width):
        n_total = int(total_width // strip_width)
        bins = np.ones(n_total, dtype=bool)
        for x_range in df[['x0', 'x1']].to_numpy():
            x0, x1 = (x_range // strip_width).astype(int)
            bins[x0:x1+1] = False
    
        return bins.sum()