The people selling the proverbial shovels are turning a profit.
> The bigger AI companies like OpenAI and Anthropic are not losing money in their monetized APIs.
[CITATION NEEDED]
Inference is more expensive than claimed, used extensively as a 'slot machine' with users trained to just keep re-generating until they get something useful, and only keeps getting more expensive as model quality has to go up.
And in practice, training the model is far less one-off than claimed. Current tools are not sufficient.
> Every place I’ve worked is absolutely reducing costs and doing new / more business as a result of their LLM use.
Unless you are working in SEO, Marketing, or spam, I don't believe you.
LLMs aren't reliable enough to replace actual human labour. While it's true many companies are fooling themselves into believing they're reducing costs, in practice other staff is picking up the slack. This is unsustainable unless your company has massively overhired.
Things like "AI generated software tests" are a farce. The consequences aren't immediate, but will show up long term.
I work closely with both companies quite a lot so you can either take my word for it or not - it doesn’t bother me either way to be honest. They are pricing inference to make money.
I don’t feel like you really have much experience using LLMs in business. However an example of where they’re very powerful is in summarization. For instance we have a pretty complex customer support model for our fraud and other cases with various disparate data sets including prior cases, related possible fraudsters identified via our fraud models, etc. We built a copilot LLM multi agent system that has access to various functions as sub agents that are prompted and context aware of how to summarize their specified data sets. They also have the ability to render widgets on demand or if their context implies it’s relevant. This allows quite a lot of complex high cognitive load information to be distilled rapidly and the investigators to interrogate the copilot on a case. As the copilot develops “answers” as a summary it dynamically renders an appropriate contextual dashboard with the relevant visualization.
By structuring the application as a multi agent model we can constrain the LLM to pretty well specified tasks with fine tunings and very specific contexts for their specific task. This almost entirely eliminates hallucination and forgetfulness. Even if it were to do so the actual ground truth is visualized for the investigator.
Prior systems either dumped massive amounts of cognitive load in the investigators face or took man years of effort to create a specific workflow, and in an adversarial dynamic space like fraud you need a much more dynamic approach to different types of new attacks.
We aren’t replacing anyone. That’s not our goal. In fact we grew our investigator footprint because both our precision and recall have grown dramatically making our losses much less. We hire more skilled investigators and greater number to address more suspected cases faster and better.
Listen. When John Henry battled the steam drill he did win, but it killed him. Go to any modern bore site and you won’t see less people working on the tunnel but more people - people who aren’t there for their strong back and ability to swing a pick but because they’re highly trained experts. They’re just building more complex tunnels that don’t collapse and don’t lose dozens of workers per dig.
This form of automation is no different in my experience so far.
So, if all you can see is SEO and grift, it might be a lack of imagination and experience on your part and some magical AI thinking sprinkled in. All your points about LLMs failures are true but they also all have solutions that don’t require slot machines as you say or imply it’s all a scam. They’re a tool like any other and they require handling in specific ways to be most effective. Even if chatgpt is a pretty unconstrained interface and that leads to issues doesn’t mean that’s the only way to use the tech.
Use of LLMs to generate software is dumb. Although a protio, LLMs are actually pretty remarkable at generating Cucumber tests as Gherkin is a natural language grammar that plays into their native strength better than producing computer language grammars. This is useful if say you have business people or whatever writing effectiveness or whatever testing where they can provide a specification of policy and a well prompted LLM can generate pretty exhaustive cucumber tests (which can be pretty redundant and formulaic when asserting positive and negative cases exhaustively) which can then be revised by hand as needed. Since they’re natural language as well the business people tend to be pretty good at debugging the tests up front and with a large set of cucumber tests written by hand you’ll see tons of errors anyways. The LLM tests tend to be much much higher quality than the human written ones.
For what it pertains the finances, we simply will have to agree to disagree here. To be convinced I would require detailed financial data that you cannot and should not share with random strangers.
But to say something useful, let me try to elaborate my general criticism here:
> Prior systems either dumped massive amounts of cognitive load in the investigators face or took man years of effort to create a specific workflow, and in an adversarial dynamic space like fraud you need a much more dynamic approach to different types of new attacks.
This begs a question: Why didn't a computer system to summarize this data already exist? Or rather, what stopped the prior systems from doing this work? (And I'll consider conventional machine learning; classifiers and the like, as traditional computer systems here)
And there's generally two options here:
1. Conventional computer systems absolutely could do this work, but they just haven't been built. (Say, because nobody signed off on the R&D but would sign off on AI hype R&D)
2. The LLM system is doing a task the conventional computer system cannot do.
Number one's problem is simple: It's just inefficient and wasteful. Number two is a red flag: There's very little overlap between the things a conventional computer system cannot do, and the things you can trust an LLM to do reliably.
As you describe this system, selecting which data is relevant for fraud investigation is a very traditional classification task. Using normal machine learning for that is basically industry standard.
So what's the LLM actually doing? Subtract the hard logic of normal software, and the classification of machine learning, and the answer is generally: A complex nuanced reasoning task.
But that's precisely what LLMs are not to be trusted for, because they are incapable of that kind of reasoning.
> Listen. When John Henry battled the steam drill he did win, but it killed him.
You're missing the point I was making with that remark. It's not about firing people or not.
It's that these systems are dangerous to evaluate from a high level. It's very easy to miss externalities that'll tip the entire endeavour into a net-negative. You need the investigation of what exactly the AI systems are doing, on a specific detailed level.
E.g.:
> This is useful if say you have business people or whatever writing effectiveness or whatever testing where they can provide a specification of policy and a well prompted LLM can generate pretty exhaustive cucumber tests (which can be pretty redundant and formulaic when asserting positive and negative cases exhaustively) which can then be revised by hand as needed.
"A specification that has been prompted into sufficient detail" is just a program. You're describing the most inefficient declarative programming stack on the planet.
Granted, the programming stack to actually declare business rules this way isn't very good, but using AI here is just an error-prone transpiler.
It's very easy to "looks good to me" these tests and claim the project a success, yet miss subtle errors in the generated tests. I remain skeptical about how well these tests will hold up in the longer term.
You misunderstand. We are one of the top shops for ML based fraud detection. But when someone is accused of fraud they get to appeal it. Then a human is in the loop and the model scores and all inputs are investigated and compared against many other sets of data and policy etc. LLMs facilitate this effort by making the investigators job considerably easier in navigating the enormous amount of complex information. The LLMs role is not to make decisions but to assist in navigating and understanding a lot of high cognitive load information. We have been doing a lot for many years using traditional techniques to make this process easier. But LLMs unlocked a level of dynamism and responsive UX that has blown the lid off our ability to adjudicate appeals. This has significant economic gain for us as offboarding legit customers for fraud causes a lot of losses over a long term.
The LLM isn’t used for reasoning at all. The human does all the reasoning. The LLMs task is summarization and semantic analysis of relevance which LLMs are fantastic about especially in a well managed and fine tuned environment with guard rails and context scoping. It’s a true copilot scenario and the LLM takes direction from the human and answers questions only. All decisions are investigator driven. This is the right relationship. The LLM coupled with IR tools does information retrieval and summarization and the human makes decisions and reasons.
> The bigger AI companies like OpenAI and Anthropic are not losing money in their monetized APIs.
[CITATION NEEDED]
Inference is more expensive than claimed, used extensively as a 'slot machine' with users trained to just keep re-generating until they get something useful, and only keeps getting more expensive as model quality has to go up.
And in practice, training the model is far less one-off than claimed. Current tools are not sufficient.
> Every place I’ve worked is absolutely reducing costs and doing new / more business as a result of their LLM use.
Unless you are working in SEO, Marketing, or spam, I don't believe you.
LLMs aren't reliable enough to replace actual human labour. While it's true many companies are fooling themselves into believing they're reducing costs, in practice other staff is picking up the slack. This is unsustainable unless your company has massively overhired.
Things like "AI generated software tests" are a farce. The consequences aren't immediate, but will show up long term.