As a comparison, here Grok 4 Fast, which is one of worst offenders I have encountered in doing very good with a Pelican Bicycle, yet not with other comparable requests: https://imgur.com/tXgAAkb
These are very subjective, naturally, but I personally find Haiku with those spots on the mushroom rather impressive overall. In any case, the delta between publicly known benchmark and modified scenarios evaluating the same basic concepts continues to be smallest with Anthropic models. Heck, sometimes I've seen their models outperform what public benchmarks indicated. Also, seems Time-to-first on Haiku is another notable advantage.
Prompt: https://t3.chat/share/ptaadpg5n8
Claude 4.5 Haiku (Reasoning High) 178.98 token/sec 1691 tokens Time-to-First: 0.69 sec
As a comparison, here Grok 4 Fast, which is one of worst offenders I have encountered in doing very good with a Pelican Bicycle, yet not with other comparable requests: https://imgur.com/tXgAAkb
Prompt: https://t3.chat/share/dcm787gcd3
Grok 4 Fast (Reasoning High) 171.49 token/sec 1291 tokens Time-to-First: 4.5 sec
And GPT-5 for good measure: https://imgur.com/fhn76Pb
Prompt: https://t3.chat/share/ijf1ujpmur
GPT-5 (Reasoning High) 115.11 tok/sec 4598 tokens Time-to-First: 4.5 sec
These are very subjective, naturally, but I personally find Haiku with those spots on the mushroom rather impressive overall. In any case, the delta between publicly known benchmark and modified scenarios evaluating the same basic concepts continues to be smallest with Anthropic models. Heck, sometimes I've seen their models outperform what public benchmarks indicated. Also, seems Time-to-first on Haiku is another notable advantage.