This is an idea I’m playing with at the moment. I haven’t yet determined if it’s a good idea. My current belief is that it’s one of those ideas in the category of “Clever idea, could work, but too slow to be usable in practice”.
The problem I am trying to solve: I have some predicate \(f\) that takes a binary string and returns true or false. I am trying to find a string \(x\) such that \(f(x)\) is true and \(x\) is minimal length, and minimal lexicographically amongst the strings of minimal length (i.e. is minimal when comparing (len(x), x)). Although really we’re not actually interested in the minimal and only really care about smallish.
The traditional approach is to use a greedy algorithm. For example, one of the classic approaches in this field is delta debugging, which deletes sub-intervals to attempt to find a minimal length example (specifically it guarantees it finds an example such that you can’t delete any single character from it, though it usually does better in practice). This works and works well, but has the typical problem of greedy algorithms that it gets stuck in local minima. For example, a problem Hypothesis has in its shrinking right now is that there are often cases where a successful shrink would require both deleting one part of the string and changing the value of another part at the same time.
So there’s a question of how to get out of these local minima.
I’ve spotted one approach recently which I think might be interesting.
It starts from two key observations:
- Regardless of the structure of the predicate, the set of strings that satisfy the predicate and are dominated by some other string form a regular language (because they form a finite language). This regular language may in practice be prohibitively complicated, but we can always take a guess that it’s simple and bail out early if it turns out not to be.
- Given a regular language represented as a deterministic finite automaton, it is extremely fast to get the minimal element of it – annotate each node with its distance to a terminal node (e.g. using a Floyd-Warshall variant) then walk the graph, at each point picking the next node as the one with the smallest value, breaking ties by picking lower bytes.
So the idea is this: We use our existing shrinker to give us a corpus of positive and negative examples, then we attempt to find a regular language that is corresponds to those examples. We do this using a light variant on L* search where we offer only examples from our corpus that we’ve already calculated. As an optimization, we can also use our greedy shrinker to produce a minimized counter-example here (L* search works better the smaller your counter-examples are, and doing a lexicographic minimization keeps the number of prefixes under control).
Once we’ve done that we use the DFA we’ve built to get the minimal element of our inferred DFA. There are now three possibilities:
- The minimal element is also our current best example. Stop.
- The minimal element does not in fact satisfy the predicate. Update our L* search with a new counterexample.
- The minimal element does satisfy the predicate and is a strict improvement on our current best example. Start again from here.
You can also randomly walk the DFA and check if the result is interesting as another useful source of counter-examples, or indeed rerun the generator that got you the interesting example in the first place. Any way of randomly generating examples can be helpful here, although in the typical case of interest the set of true values for \(f\) is very sparse, so I expect the first form to be more useful because it has a high probability of generating something that is in our regular language but does not satisfy our predicate. I haven’t really experimented with this part yet.
At some point the number of states in the DFA will probably get prohibitively large. This can be used to terminate the shrink too. If we’ve made progress since we started the L* search we can restart the search from a new shrink, otherwise we stop there.
Does this work?
Empirically based on my prototyping the answer seems to be “Kinda”. My L* implementation appears to be quite slow in practice, but I’ve gotten a lot wrong in the course of writing it so I’m not sure that this isn’t just a bug. Additionally, there are definitely algorithmic improvements you can make to L* search. I’ve seen one that apparently improves the number of tests you have to perform from n^2 to n log(n) (n is the size of the final state machine), but I don’t understand it yet (mostly because I don’t quite get what a homing sequence is). Additionally there’s a paper I’ve yet to read on how to infer a non-deterministic finite automata instead, which I think may be more effective in practice (because converting NFAs to DFAs can cause an exponential blowup in the number of states, which I think may be happening here).
So at the moment the answer is “Ask again later”. But it’s an interesting idea that I think has some promise, so I thought I’d share my preliminary thoughts.