Hacker Newsnew | past | comments | ask | show | jobs | submit | minimaxir's commentslogin

The actual token calculations with input videos for Gemini 3 Pro is...confusing.

https://ai.google.dev/gemini-api/docs/media-resolution


That is because it isn't actually tokens that are fed into the model for non-text. For text, it is tokenized, and each token has a specific set of vectors. But with other media, they've trained encoders that analyze the media and produce a set of vectors that are the same "format" as the token's vectors, but it isn't actually ever a token.

Most companies have rules for how many tokens the media should "cost", but they aren't usually exact.


Gemini 3 Pro is not Nano Banana Pro, and the image generation/model that decodes the generated image tokens may not be as robust.

The thinking step of Nano Banana Pro can refine some lateral steps (i.e. the errors in the homework correction and where they are spatially in the image) but it isn't perfect and can encounter some of the typical pitfalls. It's a lot better than Nano Banana base, though.


As a consumer I typed this into "Gemini". The behind the scenes model selection just adds confusion.

If "AI" trust is the big barrier for widespread adoption to these products, Alphabet soup isn't the solution (pun intended).


Nano Banana generates images.

This article is about understanding images.

Your task is unrelated to the article.


It works fine for me. https://imgur.com/a/MKNufm1

Nothing new, it's just highlighting practical vision use cases.

Gemini 3 Pro has been playing Pokemon Crystal (which is significantly harder than Red) in a race against Gemini 2.5 Pro: https://www.twitch.tv/gemini_plays_pokemon

Gemini 3 Pro has been making steady progress (12/16 badges) while Gemini 2.5 Pro is stuck (3/16 badges) despite using double the turns and tokens.


I think what would be interesting is if it could play the game with vision only inputs. That would represent a massive leap multimodal understanding.

That's more of an issue with Nano Banana Pro than with Gemini 3 Pro.

What's the difference? I thought the vision ai component of gemini 3 is called nano banana?

That’s about generating images, the other side is about understanding images.

i assumed nano banana was just a tool that gemini 3 used though i don't know

Gemini 3 Pro's text encoder powers Nano Banana Pro, but it has its own image decoding model that decodes the generated image tokens into an actual image, which appears to be the more pertinent issue in this case.

You're being reductive to the point that you're saying "LLMs are an algorithm like auto complete/search engine, therefore they're the same."

That's not how it works. They're different approaches to how they handle the same inputs.


i would totally agree that they’re different approaches

i wouldn’t conclude “therefore they’re the same”. they’re clearly not the same

if it’s a different approach to search and scripting, does that not mean it is a kind of search and scripting?


dang (the head moderator of Hacker News) has said multiple times that HN prefers human-only comments.

> once most of its users realise that it offers them no actual practical advantages over Pandas

What? Speed and better nested data support (arrays/JSON) alone are extremely useful to every data scientist.

My produtivity skyrocketed after switching from pandas to polars.


What can you do in more easily in pandas than polars?

For noncoding tasks, Gemini atleast allows for easier grounding with Google Search.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: