python-hypothesisproperty-based-testing

Can Hypothesis shrink an failing test case which was not found by Hypothesis?


Nicholas Chammas wrote on the hypothesis-users mailing list:

A user reported a bug in a function and provided a reproduction. I decided to write a test for this function to see if Hypothesis could find the same bug.

The test looks like this:

@given(data=lists(tuples(floats())))
def test_percentile(data):
    ...

I’ve run this test with up to 100,000 examples, but Hypothesis is not finding this bug. A very specific set of circumstances that I do not understand need to line up precisely for this bug to reveal itself.

I am now trying to figure out something different: Given a known failing example, can Hypothesis help me shrink it to its simplest form?

The reproduction provided by the user is a list that’s 373 elements long. I managed to manually shrink it down to 45 elements.

Is there any way to get Hypothesis to shrink the known failing example even further?

What is the answer to that? Can Hypothesis shrink a failing test case it didn't find?

(I've copied the question here so the answer can be updated over time.)


Solution

  • Hypothesis can only shrink examples that it generated, but you can improve your odds of finding a known failure by...

    1. running for even longer (e.g. overnight) - computer time is cheaper than your time
    2. using target() to 'aim at' the known failure can make the search much more efficient

    This is a fundamental consequence of the way Hypothesis is designed: [both generation and shrinking work on the same underlying representation, and because strategies can include arbitrary Python code it's impossible in general to "run them backwards" and turn your test case into IR.

    (It is possible to run generators backwards in restricted settings, and in principle this could be implemented for simple Hypothesis strategies - but I really dislike interfaces which only work in easy cases. We use a similar trick in our internals though.)

    Nonetheless, "shrink user-reported failing cases" has been on my personal wishlist for literally years now - @example(<your value here>).shrink() would be a lovely interface (fail if we can't reach it; dump a patch if we shrink it). The easiest implementation would be to automatically derive some target() metrics as described above; more complex and efficient to to supplement that with an analytic solution in cases where we can. Support for symbolic execution might also be useful, although we're speculating about combining wishlist ideas now :-)