My favorite benchmark is to analyze a very long audio file recording of a manage...

rfw300 · 2025-11-18T19:00:11 1763492411

My audio experiment was much less successful — I uploaded a 90-minute podcast episode and asked it to produce a labeled transcript. Gemini 3:

- Hallucinated at least three quotes (that I checked) resembling nothing said by any of the hosts

- Produced timestamps that were almost entirely wrong. Language quoted from the end of the episode, for instance, was timestamped 35 minutes into the episode, rather than 85 minutes.

- Almost all of what is transcribed is heavily paraphrased and abridged, in most cases without any indication.

Understandable that Gemini can't cope with such a long audio recording yet, but I would've hoped for a more graceful/less hallucinatory failure mode. And unfortunately, aligns with my impression of past Gemini models that they are impressively smart but fail in the most catastrophic ways.

Rudybega · 2025-11-18T20:38:03 1763498283

I wonder if you could get around this with a slightly more sophisticated harness. I suspect you're running into context length issues.

Something like

1.) Split audio into multiple smaller tracks. 2.) Perform first pass audio extraction 3.) Find unique speakers and other potentially helpful information (maybe just a short summary of where the conversation left off) 4.) Seed the next stage with that information (yay multimodality) and generate the audio transcript for it

Obviously it would be ideal if a model could handle the ultra long context conversations by default, but I'd be curious how much error is caused by a lack of general capability vs simple context pollution.

ant6n · 2025-11-18T20:01:43 1763496103

The worst when it fails to eat simple pdf documents and lies and gas lights in an attempt to cover it up. Why not just admit you can’t read the file?

nomel · 2025-11-18T22:43:58 1763505838

This is specifically why I don't use Gemini. The gaslighting is ridiculous.

satvikpendem · 2025-11-19T07:45:57 1763538357

Now try an actual speech model like ElevenLabs or Soniox, not something not made for it.

satvikpendem · 2025-11-18T17:43:41 1763487821

I'd do the transcript and the summary parts separately. Dedicated audio models from vendors like ElevenLabs or Soniox use speaker detection models to produce an accurate speaker based transcript while I'm not necessarily sure that Google's models do so, maybe they just hallucinate the speakers instead.

trvz · 2025-11-19T07:12:40 1763536360

Agreed. I don’t see the need for Gemini to be able to do this task, although it should be able to offload it to another model.

iagooar · 2025-11-18T16:14:45 1763482485

What prompt do you use for that?

gregsadetsky · 2025-11-18T16:54:22 1763484862

I just tried "analyze this audio file recording of a meeting and notes along with a transcript labeling all the speakers" (using the language from the parent's comment) and indeed Gemini 3 was significantly better than 2.5 Pro.

3 created a great "Executive Summary", identified the speakers' names, and then gave me a second by second transcript:

    [00:00] Greg: Hello.
    [00:01] X: You great?
    [00:02] Greg: Hi.
    [00:03] X: I'm X.
    [00:04] Y: I'm Y.
    ...

Super impressive!

HPsquared · 2025-11-18T17:05:24 1763485524

Does it deduce everyone's name?

gregsadetsky · 2025-11-18T17:09:19 1763485759

It does! I redacted them, but yes. This was a 3-person call.

punnerud · 2025-11-18T16:56:48 1763485008

I made a simple webpage to grab text from YouTube videos: https://summynews.com Great for this kind of testing? (want to expand to other sources in the long run)

renegade-otter · 2025-11-18T17:25:33 1763486733

It's not even THAT hard. I am working on a side project that gets a podcast episode and then labels the speakers. It works.

valtism · 2025-11-18T16:54:36 1763484876

Parakeet TDT v3 would be really good at that

kridsdale3 · 2025-11-18T23:30:52 1763508652

Yes, this is the best solution for that goal. Use the MacWhisper app + Parakeet 3.