I’ve been doing some experimenting with how I might use the AFL approach to using coverage information for glass box testing in Hypothesis (apparently even when not working on it I still can’t stop experimenting). In particular I’m interested in the question of how to use this sort of information with purely generative approaches to data where you don’t have a starting pool of examples.
After some experimentation, the following seems to work pretty well with byte strings at least:
We have two operations, expand and shrink. Each of these takes a value and either produces larger or smaller versions of it. We also have some total ordering over examples where smaller is better.
We can run an example and get a set of labels out. Labels are:
- The same labels that AFL uses – i.e. hashed branches with bucketed counts of how many times they were taken.
- The type of any exception thrown by running the example
We maintain a mapping for each label to the best example which exhibits that label, and a set of previously seen examples (this should probably be some sort of bloom filter really).
We define an operation consider which works as follows:
seen =set() best ={}def consider(example): if example in seen: returnFalse seen.add(example) changed =Falsefor label in labels(example): if label notin best or simpler(example, best[label]): changed =True best[label]= example if changed: for smaller in shrink(example): if consider(smaller): breakreturn changed |
(only written to not be recursive to work around Python).
We then first consider a trivial example (e.g. the empty string) to start us off and then iterate the following loop indefinitely:
- Pick a label at random and choose the current best example for that label
- For each expansion of that example (including the example itself), consider that until one of them returns True or we run out of expansions.
…and that’s it. The use of minimize inside consider gives us a constant set of small seeds to build off, and then expanding them lets us find them at the end. Note that we’re not actually minimizing for each label deliberately – a lot of the time our labels will end up being quite large because we discovered several at once and got distracted by one of them. This appears to be OK in practice because over time labels get minimized as a byproduct of the constant generate and replace process.
To be concrete, the following are the operations for binary strings:
- The trivial starting element is the empty string
- Expansion first tries appending up to 1024 bytes of random nonsense to the end, then tries adding one byte to the end for each valid byte
- Shrinking does a bunch of sliding deletes. In particular there is no bytewise shrinking, only deletions.
- A value is simpler if it has a smaller length or equal length but is lexicographically smaller.
And this seems to work pretty well. I don’t have any empirical quantification, but it’s found two bugs: One in chardet and the other in binaryornot. I’ve tried applying it to cryptography, but it didn’t find anything. That’s not totally surprising as it’s already been extensively fuzzed with python-AFL, which I don’t expect this approach to be much better than – there’s less overhead, so it seems to be faster, but it’s otherwise a very similar approach which should produce similar results. Also, especially in this case, python AFL has a distinct advantage by not trying to synthesize complex message formats out of whole cloth.