> Do we know how many attempts were done to create such compiler before during p...

thesz · 2026-02-06T13:33:03 1770384783

  > No we don't and yeah we would expect them to only report positive results (this is both marketing and investigation).

This is matter of methodology. If they train models on that task or somewhat score/select models on their progress on that task, then we have test set leakage [1].

[1] https://en.wikipedia.org/wiki/Leakage_(machine_learning)

This question is extremely important because test set leakage leads to impressively looking results that do not generalize to anything at all.

disgruntledphd2 · 2026-02-06T15:59:47 1770393587

> This is matter of methodology. If they train models on that task or somewhat score/select models on their progress on that task, then we have test set leakage [1].

I am quite familiar with leakage, having been building statistical models for maybe 15+ years at this point.

However, that's not really relevant in this particular case given that LLMs are trained on approximately the entire internet, so leakage is not really a concern (as there is no test set, apart from the tasks they get asked to do in post-training).

I think that's its impressive that this even works at all as even if it's just predicting tokens (which is basically what they're trained to), as this is a pointer towards potentially more useful tasks (convert this cobol code base to java, for instance).

I think the missing bit here is that this only works for cases where there's a really large test set (the html spec, the linux kernel). I'm not convinced that the models would be able to maintain coherence without this, so maybe that's what we need to figure out how to build to make this actually works.

thesz · 2026-02-06T20:46:46 1770410806

  > I think the missing bit here is that this only works for cases where there's a really large test set (the html spec, the linux kernel). I'm not convinced that the models would be able to maintain coherence without this, so maybe that's what we need to figure out how to build to make this actually works.

Take any language with compiler and several thousands of users and you have a plenty of tests that approximate spec inward and outward.

Here's, for example, VHDL tests suite for GHDL, open source VHDL compiler and simulator: https://github.com/ghdl/ghdl/tree/master/testsuite

The GHDL test suite is sufficient and general enough to develop a pretty capable clone, to my knowledge. To my knowledge, there is only one open source VHDL compiler and it is written in Ada. And, again, expertise to implement another one from scratch to train an LLM on it is very, very scarce - VHDL, being highly parallel variant of Ada, is quirky as hell.

So someone can test your hypothesis on the VHDL - agent-code a VHDL compiler and simulator in Rust so that it passes GHDL test suite. Would it take two weeks and $20,000 as with C? I don't know but I really doubt so.