I knew this reminded me of something, and it turns out it was one of his older posts which is also worth a read: http://blog.regehr.org/archives/970 (Finding Undefined Behavior Bugs by Finding Dead Code)
The PDF it links is 404, but the Wayback Machine caught it in a few different revisions (the date of Regehr's post would correspond more closely with the earlier revision):
It's an interesting method. Does anyone know if these kinds of torture tests get collected into a common library to help any future compilers or if they just result in a paper and bug reports to current ones?
The generated tests themselves are hardly worth the bytes they occupy. The value is in the compact program that can generate as many of them as desired very quickly, with varying options to focus on varying aspects of compilation (with or without bit-fields, with or without deep pointer nesting, …)
We forced Xuejun Yang (who turned Randprog, the prototype that came before Csmith into Csmith) to fix more bugs than were necessary for his PhD or than you can expect a research prototype to have bugs fixed (I am one of the developers of Frama-C, pitting Frama-C “against” Csmith was my hobby for a summer, and we found and reported as many bugs in Csmith as we found bugs in Frama-C). The sentence “over the last couple of years I’ve slacked off on reporting compiler bugs” near the post's conclusion is telling. You can expect the same story to unroll for EMI. Researchers are not rewarded for maintaining software ad vitam æternam, even if the software is useful, but hopefully they have or will soon release the generator as open-source, and then if you find it too useful not to use further, you can fix or work around bugs as you find them.
As an example, I think that there is still one bug in Csmith that we work around by ignoring programs that have the symptoms that usually indicate that bug (we still use Csmith to test Frama-C after we have finished a major feature that could introduce the sort of bug it can detect).
I disagree that the tests are not obviously worth keeping. Having a large suite of regression tests is vital to stop bugs reappearing.
This type of research spawns other research and projects outside of academia by acting as a proof of concept, even if the original researchers stop reporting bugs.
For instance Csmith (and others) inspired Gosmith, which has found a number of bugs in the Go compiler. I hope that someone will use the obviously successful strategy of EMI to improve it further.
> I disagree that the tests are not obviously worth keeping. Having a large suite of regression tests is vital to stop bugs reappearing.
What I left implicit is that a non-reduced Csmith test makes a terrible regression test. It may for instance spend several seconds incrementing a counter from 0 to 4000000000 before switching to the entirely unrelated computation that once triggered a bug. The value of these randomly generated tests is in generating a new one the next time, not in saving them to run again and again. Running the same randomly generated tests again and again would find very few bugs and would be a criminal waste of electricity.
The reduced program is worth keeping as a regression test, because it is typically a few lines long and these few lines contain a construct that once tripped the compiler. Sometimes the reduced version can be rewritten by a human to be even more concise and readable than the output of C-reduce. But as I said in another comment, one compiler's regression test does not obviously make a good test for another compiler.
Sounds like we agree then :) If we assume that no (Csmith) bug will be fixed without a reduced test-case, then the choice of which one to use as a regression test is obvious.
It's one from the last batch, that Xuejun hoped to have fixed but still comes up occasionaly. If there is a window of opportunity, I will try to get more information and re-report it.
This gives very poor coverage. The testcases in one compiler's regressions suites are usually reduced to the shortest program that causes a previous bug, and another compiler that does not share any source code with the first one will have no particular reason to contain bugs that are revealed by the same programs. Sure, you may find some this way, but it is not likely to be the best use of your time. Invent something like Csmith/EMI instead, that will be more efficient!
The C language needs a good test suite with known expected results and good coverage of the language's definition, but one way to build this suite is not to catenate all the regression tests of existing compilers. Actually, part of the reasons why we do not have such a reference testsuite is that it is not as easy as putting together regression tests from various origins.
I'm still interested in somehow creating such a test suite, Pascal -- one that is designed to test corner cases in the standard rather than testing corner cases in the optimizer. It would somehow be derived from various formalizations of the standard such as yours, Xavier's, Chucky's.
Looks like there is significant overlap between metamorphic testing an property based testing. Can anyone come up with examples that are clearly one but not the other?
Metamorphic testing means taking an existing test case and mutating it into a new test case that produces the same answer (or at least an answer that can be easily predicted).
Property-based testing is, as far as I can tell, a meaningless term since all testing is property-based.