pythonnumpyinformation-theory

Python Contingency Table


I am generating many, many contingency tables as part of a project that I'm writing.

The workflow is:

Initially, I wrote this as:

def make_table(x, y, num_bins):
    ctable = np.zeros((num_bins, num_bins), dtype=np.dtype(int))
    for xn, yn in zip(x, y):
        ctable[xn, yn] += 1
    return ctable

This works fine, but is so slow that it's eating up like 90% of the runtime of the entire project.

The fastest python-only optimization I've been able to come up with is this:

def make_table(x, y, num_bins):
    ctable = np.zeros(num_bins ** 2, dtype=np.dtype(int))
    reindex = np.dot(np.stack((x, y)).transpose(), 
                     np.array([num_bins, 1]))
    idx, count = np.unique(reindex, return_counts=True)
    for i, c in zip(idx, count):
        ctable[i] = c
    return ctable.reshape((num_bins, num_bins))

That's (somehow) a lot faster, but it's still pretty expensive for something that doesn't seem like it should be a bottleneck. Are there any efficient ways to do this that I'm just not seeing, or should I just give up and do this in cython?

Also, here's a benchmarking function.

def timetable(func):
    size = 5000
    bins = 10
    repeat = 1000
    start = time.time()
    for i in range(repeat):
        x = np.random.randint(0, bins, size=size)
        y = np.random.randint(0, bins, size=size)
        func(x, y, bins)
    end = time.time()
    print("Func {na}: {ti} Ms".format(na=func.__name__, ti=(end - start)))

Solution

  • The clever trick for representing the elements of np.stack((x, y)) as integers can be made faster:

    In [92]: %timeit np.dot(np.stack((x, y)).transpose(), np.array([bins, 1]))
    109 µs ± 6.55 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
    In [94]: %timeit bins*x + y
    12.1 µs ± 260 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    

    Moreover, the last part of your second solution can be simplified somewhat, by simply considering

    np.unique(bins * x + y, return_counts=True)[1].reshape((bins, bins))
    

    What is more, since we are dealing with equally spaced non-negative integers, np.bincount will outperform np.unique; with that, the above boils down to

    np.bincount(bins * x + y).reshape((bins, bins))
    

    All in all, this provides quite some performance over what you are currently doing:

    In [78]: %timeit make_table(x, y, bins)  # Your first solution
    3.86 ms ± 159 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    In [79]: %timeit make_table2(x, y, bins)  # Your second solution
    443 µs ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    In [101]: %timeit np.unique(bins * x + y, return_counts=True)[1].reshape((bins, bins))
    307 µs ± 25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    In [118]: %timeit np.bincount(bins * x + y).reshape((10, 10))
    30.3 µs ± 3.44 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    

    You may also want to be aware of np.histogramdd which takes care of both rounding and binning at once, although it's likely going to be slower than rounding and using np.bincount.