While somewhat off-topic, I had an interesting experience highlighting the utility of GitHub's Copilot the other day. I decided to run Copilot on a piece of code functioning correctly to see if it would identify any non-existent issues. Surprisingly, it managed to pinpoint an actual bug. Following this discovery, I asked Copilot to generate a unit test to better understand the identified issue. Upon running the test, the program crashed just as Copilot had predicted. I then refactored the problematic lines as per Copilot's suggestions. This was my first time witnessing the effectiveness of Copilot in such a scenario, which provided small yet significant proof to me that Language Models can be invaluable tools for coding, capable of identifying and helping to resolve real bugs. Although they may have limitations, I believe any imperfections are merely temporary hurdles toward more robust coding assistants.
Copilot at present capabilities is already so valuable that not having it in some environment gives me the „disabledness feeling“ that I otherwise only get when vim bindings are not enabled. Absolute miracle technology! I‘m sure in the not too distant future we‘ll have have privacy preserving, open source versions that are good enough to not shovel everything over to openai
That sounds like very basic code review - which I guess is useful in instances where one can't get a review from a human. If it has a low enough false-positive rate, it could be great as a background CI/CD bot that can chime in the PR/changeset comments to say "You may have a bug here"
One nice thing about a machine code reviewing is no tedious passive-aggressive interactions or subjective style feedback you feel compelled to take etc.
Identifying potential bugs within a unit is only a part of a good code review; good code reviews also identify potential issues with broader system goals, readability, and idiomaticness, elegance and "taste" (e.g. pythonicity in Python) which require larger contexts than LLMs can currently muster.
So yes, the ability to identify a bug and providing a unit test to reproduce it is rather basic[1], compared to what a good code review can be.
1. An org I worked for had one such question
for entry-level SWEs interviews in 3-parts: What's wrong with this code? Design test cases for it. Write the correct version (and check if the tests pass)
Sharing knowledge, improving code quality, readability and comprehensability, reviewing test efficacy and coverage, validating business requirements and functionality, highlight missing features or edge cases, etc. AI can fulfill this role, but it does so in addition to other automated tools like linters and whatnot; it isn't as of yet a replacement for a human, only an addition.
The better your code is before submitting it for review, the smoother it'll go though. So if it's safe and allowed, by all means, have copilot have a look at your code first. But don't trust that it catches everything.
Calling it 'very basic' actually exalts the concept of code reviews, because the ideal code review is more than just identifying bugs in the code under review.
If I were to call the Mercedes A-Class a 'very basic Mercedes', it implies my belief in the existence of superior versions of the make.
Try it on a million line code base where it's not so cut and dry to even determine if the code is running correctly or what correctly means when it changes day to day.
"A tool is only useful if I can use it in every situation".
LLM's don't need to find every bug in your code - even if they found an additional 10% of genuine bugs compared to existing tools, it's still a pretty big improvement to code analysis.
In reality, I suspect the scope is much higher than 10%.
If it takes you longer to vet hallucinations than to just test your code better, is it an improvement? If you accept a bug fix for a hallucination that you got too lazy to check because you grew dependent on AI to do the analysis for you, and the bug "fix" itself causes other unforeseen issues or fails to recognize why an exception in this case might be worth preserving, is it really an improvement?
What if indeed. Most static analyses tools (disclaimer: anecdotal) have very little false positives these days. This may be much worse in C/C++ land though, I don't know.
But also a much much shorter attention span and tolerance for BS.
If you ask the LLM to analyze those 1000000 lines 1000 at a time, 1000 times, it’ll do it, with the same diligence and attention to detail across all 1000 pages.
Ask a human to do it and their patience will be tested. Their focus will waver, they’ll grow used to patterns and miss anomalies, and they’ll probably skip chunks that look fine at first glance.
Sure the LLM won’t find big picture issues at that scale. But it’ll find plenty of code smells and minor logic errors that deserve a second look.
Ok, why don't you run this experiment on a large public open source code base? We should be drowning in valuable bug reports right now but all I hear is hype.
While true, on the other hand an AI is a tool, and can have a much larger context size, and it can apply all of that at once. It also isn't limited by availability or time constraints, i.e. if you have only one developer that can do a review, and the tooling or AI can catch 90% of what that developer would catch.
I've separated 5000 line class into smaller domains yesterday. It didn't provide end solution, it wasn't perfect, but gave me a good plan where to place what.
Once it is capable to process larger context windows it will become impossible to ignore.
That’s rather an exception in my experience. For unit-tests it starts hallucinating hard once you have functions imported from other files. This is probably the reason most unit tests in their marketing materials are things like fibonacci…
How did you prompt Copilot to identify issues? In my experience the best I can do is to put in code comments of what what I want a snippet to do and copilot tries to write it. I haven't had good luck asking copilot to rewrite existing code. Nearest I've gotten is:
// method2 is identical to method1 except it fixes the bugs
public void method2(){
These things are amazing when you first experience but I think in most cases the user fails to realise how common their particular bug is. But then you also need to realise there maybe bugs in what has been suggested. We all know there are issue with stack overflow responses too.
Probably 85% of codebase are just rehashes of the same stuff. Co-pilot has seen it all I guess.
If not malicious, then this shows that there are people out there who don't quite know how much to rely on LLMs or understand the limits of their capabilities. It's distressing.
I can also attest as a moderator that there is some set of people out there who use LLMs, knowingly use LLMs, and will lie to your face that they aren't and aggressively argue about it.
The only really new aspect about that is the LLM part. The people who will truly bizarrely lie about total irrelevancies to people on the Internet even when they are fooling absolutely no one has always been small but non-zero.
please, even if they were caught by "the authorities", it would just be a fine of such low monetary value that it will be considered cost of doing business rather than punishment.
people don't get charged with criminal counts for something they did as an employee of a company
> It's really good at predicting text we like, but that's all it does.
It's important to recognize that predicting text is not merely about guessing the next letter or word, but rather a complex set of probabilities grounded in language and context. When we look at language, we might see intricate relationships between letters, words, and ideas.
Starting with individual letters, like 't,' we can assign probabilities to their occurrence based on the language and alphabet we've studied. These probabilities enable us to anticipate the next character in a sequence, given the context and our familiarity with the language.
As we move to words, they naturally follow each other in a logical manner, contingent on the context. For instance, in a discussion about electronics, the likelihood of "effect" following "hall" is much higher than in a discourse about school buildings. These linguistic probabilities become even more pronounced when we construct sentences. One type of sentence tends to follow another, and the arrangement of words within them becomes predictable to some extent, again based on the context and training data.
Nevertheless, it's not only about probabilities and prediction. Language models, such as Large Language Models (LLMs), possess a capacity that transcends mere prediction. They can grapple with 'thoughts'—an abstract concept that may not always be apparent but is undeniably a part of their functionality. These 'thoughts' can manifest as encoded 'ideas' or concepts associated with the language they've learned.
It may be true that LLMs predict the next "thought" based on the corpus they were trained on, but it's not to say they can generalize this behavior, past what "ideas" they were trained on. I'm not claiming generalized intelligence exists, yet.
Much like how individual letters and words combine to create variables and method names in coding, the 'ideas' encoded within LLMs become the building blocks for complex language behavior. These ideas have varying weights and connections, and as a result, they can generate intricate responses. So, while the outcome may sometimes seem random, it's rooted in the very real complex interplay of ideas and their relationships, much like the way methods and variables in code are structured by the 'idea' they represent when laid out in a logical manner.
Language is a means to communicate thought, so it's not a huge surprise that words, used correctly, might convey an idea someone else can "process", and that likely includes LLMs. That we get so much useful content from LLMs is a good indication that they are dealing with "ideas" now, not just letters and words.
I realize that people are currently struggling with whether or not LLMs can "reason". For as many times as I've thought it was reasoning, I'm sure there are many times it wasn't reasoning well. But, did it ever "reason" at all, or was that simply an illusion, or happy coincidence based on probability?
The rub with the word "reasoning" is that it directly involves "being logical" and how we humans arrive at being logical is a bit of a mystery. It's logical to think a cat can't jump higher than a tree, but what if it was a very small tree? The ability to reason about cats jumping abilities doesn't require understanding trees come in different heights, rather that when we refer to "tree" we mean "something tall". So, reasoning has "shortcuts" to arrive at an answer about a thing, without looking at all the things probabilities. For whatever reason, most humans won't argue with you about tree height at that point and just reply "No, cats can't jump higher than a tree, but they can climb it." By adding the latter part, they are not arguing the point, but rather ensuring that someone can't pigeonhole their idea of truth of the matter.
Maybe when LLMs get as squirrely as humans in their thinking we'll finally admit they really do "reason".
> It's important to recognize that predicting text is not merely about guessing the next letter or word, but rather a complex set of probabilities grounded in language and context. When we look at language, we might see intricate relationships between letters, words, and ideas.
> Maybe when LLMs get as squirrely as humans in their thinking we'll finally admit they really do "reason".
I know we can argue about the definitions of "intelligence", "reasoning", or even "sentience". But at the end of the day we get a list of tokens, and list of probabilities for each token. Yes it is extremely good at predicting tokens which embed information, and are able to predict in-depth concepts and predict what at least appears to be reasoning.
Regardless, probabilities of course contain the possibility of being either incorrect, or undesirable.
The more subtle point is that this cannot be corrected via what appears to humans as “conversation” with the LLM. Because it is more plausible that a confident liar keeps telling tall tales, than it is that the same liar suddenly becomes a brilliant and honest genius.
A human on the internet loves to argue, stand and prove a point, simply because they can. Guess what the AI's were trained on? People talking on the internet.
Which is fundamentally different from how our brain chains together thoughts when not actively engaging in meta thinking how? Especially once chain of thought etc. is applied.
It seems very similar to the case of the lawyers who used an LLM as a case law search engine. The LLM spit out bogus cases, then when the judge asked them to produce the cases as the references let nowhere they asked the LLM to produce cases which it "did".
Or similarly the case where a professor failed an entire class of students (resulting in their diplomas being denied) for cheating on the essays using AI because he asked an LLM if the essays were AI generated and it said yes.
We don't know what we can do with it yet and we don't understand the limits of their capabilities. Ethan Mollick calls it the ragged frontier[0], and that may be as good a metaphor as any. Obviously a frontier has to be explored, but the nature of that is that most of the time you are on one side or the other of the frontier.
Oh yeah. Google has warnings like "Bard may display inaccurate or offensive information that doesn’t represent Google’s views" all over it; Permanently in the footer, on splash pages, etc.
> Does Bard clearly warn to never rely on it for facts? I know OpenAI says "ChatGPT may give you inaccurate information" at the start of each session.
I know I shouldn't be, but I'm surprised the disclosure is even needed. People clearly don't understand how LLMs work -
LLM's predict text. That's it, they're glorified autocomplete (that's really good). When their prediction is wrong we call it a "hallucination" for some reason. Humans do the same thing all the time. Of course it's not always correct!
Of course not. Most developers don't understand how LLM work, even roughly.
> Humans do the same thing all the time. Of course it's not always correct!
The difference is that LLMs can not acknowledge incompetence, are always confidently incorrect, and will never reach a stopping point, at best they'll start going circular.
Everything out of an LLM is a confabulation, but you can constrain the output space of that confab by restraining it with proper prompting. You could ask it to put confidence intervals for each one of its sentences (ask it in the prompt), but those will be confabulated as well, but will give it some self doubt, as for now, it hasn't been programmed with any. Probably costs a lot more in power to run it with doubt. :)
Edit, I played around with this. It looks like GPT4 has a guard against asking for this. It flat out refused with the two prompts I gave it to include confidence intervals. Maybe that is a good thing.
> The difference is that LLMs can not acknowledge incompetence, are always confidently incorrect, and will never reach a stopping point, at best they'll start going circular.
There's a second wind to this story in the Mastodon replies. It sounds like the LLM appeared to be basing this output on a CVE that hadn't yet been made public, implying that it had access to text that wasn't public. I can't quite tell if that's an accurate interpretation of what I'm reading.
>> @bagder it’s all the weirder because they aren’t even trying to report a new vulnerability. Their complaint seems to be that detailed information about a “vulnerability” is public. But that’s how public disclosure works? And open source? Like are they going to start submitting blog posts of vulnerability analysis and ask curl maintainers to somehow get the posts taken down???
>> @derekheld they reported this before that vulnerability was made public though
>> @bagder oh as in saying the embargo was broken but with LLM hallucinations as the evidence?
Took me a while to figure out from the toot thread and comment history, but it appears that the curl 8.4.0 release notes (https://daniel.haxx.se/blog/2023/10/11/curl-8-4-0/) referred to the fact that it included a fix for an undisclosed CVE (CVE-2023-38545); the reporter ‘searched in Bard’ for information about that CVE and was given hallucinated details utterly unrelated to the actual curl issue.
The reporter is complaining that they thought this constituted a premature leak of a predisclosure CVE, and was reporting this as a security issue to curl via HackerOne.
No, it's not that Bard was trained on information that wasn't public. It's that the author of the report thought that the information about the upcoming CVE was public somewhere because Bard was reproducing it, because the author thinks Bard is a search engine. So they filed a report that the curl devs should take that information offline until the embargo is lifted.
> I responsibly disclosed the information as soon as I found it. I believe there is a better way to communicate to the researchers, and I hope that the curl staff can implement it for future submissions to maintain a better relationship with the researcher community. Thank you!
I was curious how many bogus security reports big open source projects have. If you go to https://hackerone.com/curl/hacktivity and scroll down to ones marked as "Not-applicable" you can find some additional examples. No other LLM hallucinations, but some pretty poorly-thought out "bugs".
Gaslighting and confabulation are very different things.
Gaslighting are deliberate lies with the intent of creating self-doubt in the targeted person. Confabulation is creating falsehoods without an intent to deceive.
When we're discussing naming, it might be a good idea not to throw more misleading names onto the bonfire.
Gaslighting is also usually associated with disorders involving strong delusional behavior. Delusions are maladaptive protective behavior (a false worldview that avoids actual ‘dangerous’ thoughts or information), and when challenged in a threatening way, particularly dangerous folks often gaslight the threat. It’s easy and natural for them to do, because they already have all the tools necessary to maintain the original delusion.
It’s the ‘my world view will be unchallenged or I will destroy yours’ reaction.
NPD being a very common example. Certainly not the only one though!
That's even better, then, to address the issue of laypeople misinterpreting a distinctive problem according to familiar, overloaded definitions of the word used to refer to it.
LLMs do not have beliefs, so "delusion" is no better than "hallucination". As statistical models of texts, LLMs do not deal in facts, beliefs, logic, etc., so anthropomorphizing them is counter-productive.
An LLM is doing the exact same thing when it generates "correct" text that it's doing when it generates "incorrect" text: repeatedly choosing the most likely next token based on a sequence of input and the weights it learned from training data. The meaning of the tokens is irrelevant to the process. This is why you cannot trust LLM output.
I think the right word is "bullshit". LLMs are neither delusional nor hallucinating since they have no beliefs or sensory input. The just generate loads of fertilizer and a lot of people like to spread it around.
This is the correct answer. It's not a hallucination. It's goal is to create something that seems like the truth despite the fact that it has no idea if it's actually being truthful. If a human were doing this we'd call them a bullshitter or of they were good at it, maybe even a bullshit artist.
IMHO it's fine to have a certain jargon within the context of "things neural nets do" and comes from the days of Deep Dream, when image classifiers were run in reverse and introduced the public to computer-generated-images that were quite psychedelic in nature. It's seeing things that aren't there.
LLMs don’t hold beliefs. Believing otherwise is itself a delusion.
In addition, the headline posted here doesn’t even say hallucinated, so that is also an hallucination. It says hallucineted. As portmanteaux go, that ain’t bad. I rather like the sense of referring to LLMs as hallucinets.
The phrase "you must be trippin'!" is commonly used by some when they say something completely nonsensical. I can easily see where how/why hallucinating was chosen.
It's clearly meant to poke fun of the system. If you think people are going to NOT use words in jest while making fun of something, perhaps you could use a little less starch in your clothing.
So the reporter thinks that they were able to get accurate info about private details of a embargo'ed cve from Bard. If correct they would have found a cve in bard, not in curl.
In this case the curl maintainers can tell the details are made up and don't correspond to any cve.
I'm not sure why this is interesting. AI was asked to make a fake vulnerability and it did. That's the sort of thing these AIs are good at, not exactly new at this point.
You're leaving out the "...and then they reported it to the project" part, which meant that the project maintainers had to put in time and effort responding to a reported vulnerability.
As someone who has been on the maintainer side of a bug bounty program - they are a mountain of BS with 1% being diamonds. This report probably didn't make much of a difference.
For one thing for the last week I've seen several articles about "curl is vulnerable and will be exposed soon!!". For it to turn out this way is certainly a plot twist.
This is not the way that turned it out. The curl vuln everyone was fretting about was https://curl.se/docs/CVE-2023-38545.html still very much a serious and real vulnerability.
I’m doing reverse engineering work every now and then and a year ago I’d have called myself a fool but I have found multiple exploitable vulnerabilities simply by asking an LLM (Claude refuses less often than GPT4, GPT4 generally got better results when properly phrasing the request).
One interesting find is that I wrote an integration with GPT4 for binaryninja and funnily enough when asking the LLM to rewrite a function into “its idiomatic equivalent, refactored and simplified without detail removal” and then asking it to find vulnerabilities, it cracked most of our joke-hack-me’s in a matter of minutes.
Interesting learning: nearly all LLMs can’t really properly work with disassembled Rust binaries, I guess that’s because the output doesn’t exactly resemble the rust code like it’d do in C and C++.
This is confusing - the reporter claims to have "crafted the exploit" using the info they got from Bard. So the hallucinated info was actionable enough to actually perform the/an exploit, even though the report was closed as bogus?
No, they weren't able to "craft the exploit". The text claims an integer overflow bug in curl_easy_setopt, and provides a code snippet that fixes it. Except the code snippet has a completely different function signature than the real curl_easy_setopt, and doesn't even compile. I doubt this person did any follow through at all, just copy/pasted the output from Bard directly into this bug report.