This is where one can notice that LLM are, after all, just stochastic parrots. If we don't have a reliable way to systematically test their outputs, I don't see many jobs being replaced by AI either.
this is flatly false for two reasons -- one is that all LLMs are not equal. The models and capacities are quite different, by design. Secondly a large number of standardized LLM testing, tests for sequence of logic or other "reasoning" capacity. Stating the fallacy of stochastic parrots is basically proof of not looking at the battery of standardized tests that are common in LLM development.
Even if not all LLMs are equal, almost all of them are based on the same base model: transformers. So the general idea is always the same: predict the next token. It becomes more obvious when you try to use LLMs to solve things that you can't find in internet (even if they're simple).
And the testing does not always work. You can be sure that only 80% of the time it will be really really correct, and that forces you to check everything. Of course, using LLMs makes you faster for some tasks, and the fact that they are able to do so much is super impressive, but that's it.