Why would that be annoying? It’s much easier to understand, predict and truncate...

rcoveson · on May 10, 2023

Yeah, everybody agrees on what a character is, right? It's just {an ASCII byte|a UTF8 code unit|a UTF16 code unit|a Unicode code point|a Unicode grapheme}.

sheepscreek · on May 10, 2023

And we think tokens solve that problem? Spoiler alert: they don’t

https://www.reddit.com/r/OpenAI/comments/124v2oi/hindi_8_tim...

est31 · on May 11, 2023

They don't but Google could have been more precise with which of the definitions listed by GP they mean by "character".

ntonozzi · on May 10, 2023

I’m not saying it’s easy but it’s much better than tokens IMO. I think bytes would be understandable too.

criddell · on May 10, 2023

Bytes are understandable but make no sense from a business point of view. If you submit the same simple query with UTF-8 and UTF-32, the latter will cost 4x as much.

xyzzyz · on May 10, 2023

No API accepts input in UTF-32. Nobody uses this on the internet.

geysersam · on May 10, 2023

At least there are standards for characters. Nothing like that for tokens.