Hacker Newsnew | past | comments | ask | show | jobs | submit | heikkilevanto's commentslogin

Well, if you have a perfect evaluation function, you don't need to search. And if you can do a perfect search to the end, you don't an evaluation function. Un(?)fortunately none of these extremes seems reasonable for a game like chess (and even less for go). So most software use both search and evaluation. And a whole lot of optimizing and other tricks. With impressive results.

I just bought Kampot peppers from https://www.unclespepper.com/ which is in Germany, the name notwithstanding. And yes, I paid with my Danish Visa card. No problems except that I had to adjust my ad blocker once.

If we consider the prompts and LLM inputs to be the new source code, I want to see some assurance we get the same results every time. A traditional compiler will produce a program that behaves the same way, given the same source and options. Some even go out of their way to guarantee they produce the same binary output, which is a good thing for security and package management. That is why we don't need to store the compiled binaries in the version control system.

Until LLMS start to get there, we still need to save the source code they produce, and review and verify that it does what it says on the label, and not in a totally stupid way. I think we have a long way to go!


> If we consider the prompts and LLM inputs to be the new source code, I want to see some assurance we get the same results every time.

There’s a related issue that gives me deep concern: if LLMs are the new programming languages we don’t even own the compilers. They can be taken from us at any time.

New models come out constantly and over time companies will phase out older ones. These newer models will be better, sure, but their outputs will be different. And who knows what edge cases we’ll run into when being forced to upgrade models?

(and that’s putting aside what an enormous step back it would be to rent a compiler rather than own one for free)


> New models come out constantly and over time companies will phase out older ones. These newer models will be better, sure, but their outputs will be different.

IIUC, same model with same seed and other parameters is not guaranteed to produce the same output.

If anyone is imagining a future where your "source" git repo is just a bunch of highly detailed prompt files and "compilation" just needs an extra LLM code generator, they are signing up for disappointment.


>IIUC, same model with same seed and other parameters is not guaranteed to produce the same output.

Models are so large that random bit flips make such guarantees impossible with current computing technology:

https://aclanthology.org/2025.emnlp-main.528.pdf


Presumably, open models will work almost, but not quite, as well and you can store those on your local drive and spin them up in rented GPUs.


Greedy decoding gives you that guarantee (determinism). But I think you'll find it to be unhelpful. The output will still be wrong the same % of the time (slightly more, in fact) in equally inexplicable ways. What you don't like is the black box unverifiable aspect, which is independent of determinism.


If you’re using a model from a provider (not one that you’re hosting locally), greedy decoding via temperature = 0 does not guarantee determinism. A temperature of 0 doesn’t result in the same responses every time, in part due to floating-point precision and in part to to lack of batch invariance [1]

[1] https://thinkingmachines.ai/blog/defeating-nondeterminism-in...


What people don’t like is that the input-output relation of LLMs is difficult, if not impossible, to reason about. While determinism isn’t the only factor here (you can have a fully deterministic system that is still unpredictable in practical terms), it is still a factor.


The question is: if we keep the same context and model, and the same LLM configuration (quantization etc.), does it provide the same output at same prompt?

If the answer is no, then we cannot be sure to use it as a high-level language. The whole purpose of a language is providing useful, concise constructs to avoid something not being specified (undefined behavior).

If we can't guarantee that the behavior of the language is going to be the same, it is no better than prompting someone some requirements and not checking what they are doing until the date of delivery.


Mario Zechner has a very interesting article where he deals with this problem (https://mariozechner.at/posts/2025-06-02-prompts-are-code/#t...). He's exploring how structured, sequential prompts can achieve repeatable results from LLMs, which you still have to verify. I'm experimenting with the same, though I'm just getting started. The idea I sense here is that perhaps a much tighter process of guiding the LLM, with current models, can get you repeatable and reliable results. I wonder if this is the way things are headed.

> I want to see some assurance we get the same results every time

Genuine question, but why not set the temperature to 0? I do this for non-code related inference when I want the same response to a prompt each time.


A temperature of 0 doesn’t result in the same responses every time, in part due to floating-point precision and in part to to lack of batch invariance [1]

[1] https://thinkingmachines.ai/blog/defeating-nondeterminism-in...


Thank you for this, this was a really interesting read about batch invariance, something I didn't even know about.


This still doesn't help when you update your compiler to use a newer model


Anyone doing benchmarks with managed runtimes, or serverless, knows it isn't quite true.

Which is exactly one of the AOT only, no GC, crowds use as example why theirs is better.


Reproducible builds exist. AOT/JIT and GC are just not very relevant to this issue, not sure why you brought them up.


Because they are dynamic compilers!


But there is functional equivalence. While I don't want to downplay the importance of performance, we're talking about something categorically different when comparing LLMs to compilers.


Not when those LLMs are tied to agents, replacing what would be classical programming.

Using low code platforms with AI based automations, like most iPaaS are now doing.

If the agent is able to retrieve the required data from a JSON file, fill an email with the proper subject and body, sending it to another SaaS application, it is one less integration middleware that was required to be written.

For all practical business point of view it is an application.


Even those are way more predictable than LLMs, given the same input. But more importantly, LLMs aren’t stateless across executions, which is a huge no-no.


> But more importantly, LLMs aren’t stateless across executions, which is a huge no-no.

They are, actually. A "fresh chat" with an LLM is non-deterministic but also stateless. Of course agentic workflows add memory, possibly RAG etc. but that memory is stored somewhere in plain English; you can just go and look at it. It may not be stateless but the state is fully known.


Using the managed runtime analogy, what you are saying is that, if I wanted to benchmark LLMs like I would do with runtimes, I would need to take the delta between versions, plus that between whatever memory they may have. I don’t see how that helps with reproducibility.

Perhaps more importantly, how would I quantify such “memory”? In other words, how could I verify that two memory inputs are the same, and how could I formalize the entirety of such inputs with the same outputs?


Are you certain to predict the JIT generated machine code given the JVM bytecode?

Without taking anything else into account that the JIT uses on its decision tree?


For a single execution, to a certain extent, yes.

But that’s not the point I’m trying to make here. JIT compilers are vastly more predictable than LLMs. I can take any two JVMs from any two vendors, and over several versions and years, I’m confident that they will produce the same outputs given the same inputs, to a certain degree, where the input is not only code but GC, libraries, etc.

I cannot do the same with two versions of the same LLM offering from a single vendor, that had been released one year apart.


Good luck mapping OpenJDK with Azul's cloud JIT, in generated machine code.


The output being the actual program output, not the byte code. No one is arguing that in the scope of LLMs.


Enough so that I've never had a runtime issue because the compiler did something odd once, and correct thr next time. At least in c#. If Java is doing that, then stop using it...

If the compiler had an issue like LLMs do, the half my builds would be broken, running the same source.


> If we consider the prompts and LLM inputs to be the new source code, I want to see some assurance we get the same results every time.

Give a spec to a designer or developer. Do you get the same result every time?

I’m going to guess no. The results can vary wildly depending on the person.

The code generated by LLMs will still be deterministic. What is different is the product team tools to create that product.

At a high level, does using LLMs to do all or most of the coding ultimately help the business?


This comparison holds up to me only in the long standing debate "LLMs as the new engineer", not "LLMs as a new programming language" (like here).

I think there are important distinctions there, predictably one of them.


Even as a SSWE I do often wonder if I am but a high-level language.


A bit too mathematical for my taste. I learned my tuning theory by owning a harpsichord, and learning to tune it. A harpsichord is more sensitive to the "rounding errors" in equal tuning, owing to the richer overtones, so equal temperament does not sound quite as good a compromise as it does on a piano. And those historical temperaments are so much easier to tune by ear. Besides that is what the they used at the time of Bach, so historically correct for playing Baroque music.


I thought the article would be about the various meanings of operators like = == === .=. <== ==> <<== ==>> (==) => =~=


What is this, a Haskell for ants?


It has to be at least… three times bigger than this


My fist association was brainf..k (*.bf) programming language


This ended up being way more interesting


Several times, to be sure


Same here. Tried out Proton and Fastmail, and chose Fastmail. Been happy with it for a few months so far.


I wonder how the data in Danish MitId is managed and stored. The thing is used for everything here, from doing taxes to buying real estate to getting a library card.


I think there must be a continuous spectrum between the extremes you describe.


I had a similar experience many decades ago, taking a long overland trip and being out of touch of news for almost half a year. Coming back, I realized that the world had gone on perfectly well without me following all the daily drama. Most news seemed so irrelevant for a while after that trip.

Of course I fell back in to following the news, and the rest of the internet. Thank you for reminding me that it is not so important.


I did this except it was more like a month. When I got back I realised how much happier I was to be off of X and oblivious to the news. There is virtually zero utility to being "informed" of most things, and plenty of downsides.


I once stumbled upon the idea of calculating the signal to noise ration of "news". Say you consume 30 pieces of "news" a day, you are roughly at 10k "news" pieces a year. How many of those influence a decision you make. Like which job to take, whom to propose to, where to move.

The author, Hans Rosling by the way, showed with this little thought experiment, how little signal for our personal lives and our important decisions lies in "news".

I also worked in publishing for a while as my first job out of university. Ever since I left that industry I am so happy to be out of that drama generating machine.


The big one is voting, of course. It's not like following international politics will impact what firmness mattress I will buy.


I can easily read up, research the actual behavior of any candidate, their program (aka promises they won't keep) and what the party line is (as program - see before) and in the last few years, when actually putting in their votes on any issue.

I tend to prefer "just in time" and "up to date" research to "just in case" spamming my brain with noise.

But I also know how easy it is for me to fall down the rabbit hole of being back in the dopamine inducing social or news streams. I actively had to purge social apps from my phone(s) and on my private phone setup a launcher, that only shows a few selected apps as text names. No icons, no nothing - especially no badges and notifications for missed mails/messages.

This is currently the only way for me to battle these (at least to me) massively addictive systems.


I agree it's hard to not get sucked into the soap opera aspect of it all. Taking a break once in a while is a healthy strategy, but I think ultimately the sweet spot is to be informed about the important developments as they happen, something that shouldn't take more than 30 mins per day of watching news and political analysis while preparing food etc.

After all, there's more to the democratic process than just voting. There's a global conversation going on which needs informed and diverse participants. Plus there's a personal learning aspect to it, which goes much deeper when one tries to understand and anticipate trends as they happen only to have those inklings and reasonings being checked by the course of history. It's an ongoing lesson in the mysteries of human nature.


I have set up a few alerts and feeds I check semi-regularly. And I have a script sending me a summary of important world, regional, local news as a pushover notification in the morning. So I get two pushes - one is for the weather report, one for global events that might impact my life (I am still in the process of dialing in the sensitivity level of what the report should contain).

Because - on the other hand - I am very much interested in advances in different sciences and would love a better report on actual advances (and not just BS headlines about some new weight loss thing or currently alien speculations from interstellar objects)...

But I am getting there over time. So that I can increase the signal to noise ratio a bit.


But how often would your preferred candidate change compared to which / how much news you consumed? Most people I know are fairly set in their political opinions and already only consume news that confirms their biases


It's highly unlikely following the news will make you better informed of politics. Pretty much everyone votes on vibes.


uninstalling twitter on my phone was one of the best decisions I made


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: