Hacker Newsnew | past | comments | ask | show | jobs | submit | prodigycorp's commentslogin

Comments trashing this are rightly correct skeptics who remember the benchmaxxing of llama 4. This model was out in the woods as early as like a couple months ago but they didn't release it because it was at gemini 2.5 pro levels.

> 4. This model was out in the woods as early as like a couple months ago but they didn't release it because it was at gemini 2.5 pro levels.

Source? (Even if rumor)


NYTimes had a story about this (March 12):

> Meta’s new foundational A.I. model, which the company has been working on for months, has fallen short of the performance of leading A.I. models from rivals like Google, OpenAI and Anthropic on internal tests for reasoning, coding and writing, said the people, who were not authorized to speak publicly about confidential matters.

> The model, code-named Avocado, outperformed Meta’s previous A.I. model and did better than Google’s Gemini 2.5 model from March, two of the people said. But it has not performed as strongly as Gemini 3.0 from November, they said.

> They added that the leaders of Meta’s A.I. division had instead discussed temporarily licensing Gemini to power the company’s A.I. products, though no decisions have been reached.

https://www.nytimes.com/2026/03/12/technology/meta-avocado-a...

https://archive.is/uUV5h#selection-715.98-715.277


Ah yes, because the NYTimes is famously unbiased towards Meta in their reporting while being hypocrites in that own right. They lost all credibility when they were doing that huge series on Meta years ago about data harvesting while simultaneously rolling out tons of new data harvesting of their subscribers to increase revenue.

It was from a techmeme ride home podcast where the host discussed "sources at the company said". I don't remember which day's episode it was.

The llama4 series was one of the earliest large MoE's to be made publically available. People just ignored it because they were focused on running smaller and denser models at the time, we should know better these days.

Deepseek R1 was a publically-available, MoE model that was getting a ton of attention before llama4. Llama4 didn't get much attention because it wasn't good.

Also, Gemini 2.5 Pro launched a week before Llama 4.

It was Gemini 2.5 Pro that redeemed Google in the eyes of most people as a valid competitor to OpenAI instead of as a joke, so Meta dropping the ball with Llama 4 was extra bad.


the models were objectively horrible

They really weren't horrible. They were ~gpt4o, with the added benefit that you could run them on premise. Just "regular" models, non "thinking". Inefficient architecture (number of active out of total) but otherwise "decent" models. They got trashed online by bots and chinese shills (I was online that weekend when it happened, it's something to behold). Just because they were non-thinking when thinking was clearly the future doesn't make them horrible. Not SotA by any means, but still.

> They were ~gpt4o, with the added benefit that you could run them on premise.

No, they are bad models. They were benchmaxxed on LMAreana and a few other benchmarks but as soon as you try them yourself they fall to pieces.

I have my own agentic benchmark[1] I use to compare models.

Llama-4-scout-17b-16e scores 14/25, while llama-4-maverick-17b-128e scores 12/25.

By comparison gemma-4-E4B-it-GGUF:Q4_K_M scores 15/25 (that is a 4B parameter model!) - even GPT3.5 scores 13/25 (with some adjustment because it doesn't do tool calling).

Llama 4 was a bad model, unfortunately.

[1] https://sql-benchmark.nicklothian.com/#all-data


Wrote longer comment steel-manning this, posted it to a reply, then realized you might like to know they had a reasoning model on deck ready for release in the next 2-4 weeks.

Got shitcanned due to bad PR & Zuck God-King terraforming the org, so there'd be a year delay to next release.

Real tragi-comedy, and you have no idea how happy it makes me to see someone in the wild saying this. It sounds so bizarre to people given the conventional wisdom, but, it's what happened.


Nah I remember how disgusted I felt trying llama 4 maverick and scout. They were both DOA.. couldn't even beat much smaller local models.

failing non-stop at tool calls on top of that.

I'll cosign what you said, simultaneously, yr interlocutor's point is also well-founded and it depresses me it's not better known and sounds so...off...due to conventional wisdom x God King Zuck's misunderstanding his own company and resulting overreaction.

They beat Gemini 2.5 Flash and Pro handily on my benchmark suite. (tl;dr: tool calling and agentic coding).

Llama 4 on Groq was ~GPT 4.1 on the benchmark at ~50% the cost.

They shouldn't have released it on a Saturday.

They should have spent a month with it in private prerelease, working with providers.[1]

The rushed launch and ensuing quality issues got rolled into the hypebeast narrative of "DeepSeek will take over the world"

I bet it was super fucking annoying to talk to due to LMArena maxxing.

[1] my understanding is longest heads up was single-digit days, if any. Most modellers have arrived at 2+ weeks now, there's a lot between spitting out logits and parsing and delivering a response.


Your comments seem to imply the engineers made a great product but Zuck intervened so now it's shit

I don't know how Zuck intervening could change float32s in a trained model, so I don't think I think that, but maybe I'm parsing your words incorrectly.

The way you put it, I understand it less. lol

So the answer is: no. lol. Remember Llama 4 Behemoth, and how we were supposed to get more great models from it?

meta's benchmaxing tendencies are well known. llama4 was mega benchmaxxed, there's nothing that suggests to me that meta's culture has changed.

Re: changes, there's been enormous turnover in AI organizations, and in theory this one was developed by a "new" org. Whether that means less or more benchmaxxing is anyone's guess.

More I'd guess since the new org needs to prove itself long enough for stock to vest. Fudge the benchmarks gives them a longer horizon before they're all fired anyways.

Wow, this song is horribly mastered.

It's not great in that way. The mastering -- if there is any -- is definitely kind of shit.

But that's a relatively easy thing for a human with the right combination of toolchain, ears, and experience to fix. It tends to be a slow process that takes a good bit of time, but lots of actual-mixdowns start off way worse than this before they get polished up by a skilled mastering engineer.

(Maybe in a year or three we'll have the mastering process automated into an uncanny mush of soullessness, as well.)


Audio mastering is already automated to the level of a mediocre human:

https://github.com/sergree/matchering

(I haven't actually tried this, I just watched the linked Benn Jordan video.)

IMO, the ideal would be for all music to be supplied unmastered so the listener's playback software can apply this process to their own taste. Mastering is necessary for listening with garbage playback equipment (e.g. phone speakers) or noisy listening environments (e.g. cars, parties), but it makes things sound worse in good conditions. The best sounding music CDs I own are classical CDs on Telarc that have liner notes bragging about the complete lack of mastering.


> Mastering is necessary for listening with garbage playback equipment (e.g. phone speakers) or noisy listening environments (e.g. cars, parties), but it makes things sound worse in good conditions.

Eh? I listened to it on quite good nearfield gear, in a decent room, and the AI track linked above still sounds like it needs a good bit of help from a responsible adult to bring it up on this rig. :)

Good mastering helps everywhere -- on all systems. For instance: The sound of Steely Dan is pretty good on playback with about anything, I think, and that sound took a ton of work.

And while classical music is not my first preference, I do love me a good Telarc recording. I strongly suspect that the signal path that they use isn't necessarily quite as pure as they insist that it is. Everything is a tone control, including a microphone -- and money is money. They're not going to reschedule an orchestra to fix an untoward blip at 3KHz. They'll just fix it in post (hopefully, as minimally as possible) and send it.

But otherwise, I agree. The mastering process can be automated. Ultimately, it will be. And for sure, it will also be a customizable user preference.

Some of that work has already been in the bag for decades. Ford, for instance, has been using DSPs in their factory car audio systems to shape sounds in unconventional ways for over 30 years. This gives them a lot of knobs to turn, and to fix into constraints, to help shape a listener's chosen music to sound as good as it can on less-than-ideal built-down-to-price on-road audio systems.

Or at least: It sounds as good it can to a consensus of engineers, or of a focus group.

But the knobs exist. And they don't have to be fixed or constrained: They can (and will) be automatically twisted to suit a listener's preferences.

I'll try to make time to check out your link in a day or two.


I went back and look what they wrote before and they've written like this in the past. Tiresome game of accusation in the comments nowadays.

This is my first time seeing this site and it's actually nice. I've added it to my RSS feed.


If you pay closer attention you can see when they switched to frequent use of em dashes for parenthetical asides, around about January 2024.

I've been looking and, like I said.. tiresome game! I could tell some differences but, going through the archives, i ended up just enjoying the old posts lol

Quite so. The old posts (2023 and earlier) have a fresher writing style. Compare to this latest which feels like an entry for Pseuds Corner.

Are you sure about that? Chain of thought does not need to be semantically useful to improve LLM performance. https://arxiv.org/abs/2404.15758

still doesn't mean all tokens are useful. it's the point of benchmarks

Care to share the benchmarks backing the claims in this repo?

If you're misusing LLMs to solve TC^0 problems, which is what the paper is about, then... you also don't need the slop lavine. You can just inject a bunch of filler tokens yourself.

The burden of proof is on the author to provide at least one type of eval for making that claim.

I notice that the number of people confidently talking about "burden of proof" and whose it allegedly is in the context of AI has gone up sharply.

Nobody has to proof anything. It can give your claim credibility. If you don't provide any, an opposing claim without proof does not get any better.


Sorry I don't know how engaging in this could lead to anything productive. There's already literature out there that gives credence to TeMPOraL claim. And, after a certain point, gravity being the reason that things fall becomes so self evident that every re-statements doesnt not require proof.

LLM quirks are not something all humans have been experiencing for thousands of years

> Nobody has to proof anything. It can give your claim credibility

“I don’t need to provide proof to say things” is a valueless, trivial assertion that adds no value whatsoever to any discussion anyone has ever had.

If you want to pretend this is a claim that should be taken seriously, a lack of evidence is damning. If you just want to pass the metaphorical bong and say stupid shit to each other with no judgment and no expectation, then I don’t know what to tell you. Maybe X is better for that.


It's not sad. He's a person like you and me. devnullbrain's comment is snarky. He invoked model collapse which has nothing to do with the topic of a wiki/kb, he wrote that karpathy is not normal, and then seemed to imply that the idea was useless. I'd be pretty in my feels and the fact that he wrote it and deleted it seems like a +1 normal guy thing.

Yeah. I know you didn’t see it, but it was truly a substance-free response. Glad to see he deleted it and I know I’ve been guilty of the same kind of knee-jerk response before.

I saw it. It sucked, I agree. But like you said, we all get one (or a few) of those.

Great work.

The Agent x Parent combo has become my favorite niche in LLM space. It's unlocked so much creativity at a time where we have the least disposable time.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: