Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Why would that be annoying? It’s much easier to understand, predict and truncate appropriately than having to explain all of these different tokenization schemes to devs.


Yeah, everybody agrees on what a character is, right? It's just {an ASCII byte|a UTF8 code unit|a UTF16 code unit|a Unicode code point|a Unicode grapheme}.


And we think tokens solve that problem? Spoiler alert: they don’t

https://www.reddit.com/r/OpenAI/comments/124v2oi/hindi_8_tim...


They don't but Google could have been more precise with which of the definitions listed by GP they mean by "character".


I’m not saying it’s easy but it’s much better than tokens IMO. I think bytes would be understandable too.


Bytes are understandable but make no sense from a business point of view. If you submit the same simple query with UTF-8 and UTF-32, the latter will cost 4x as much.


No API accepts input in UTF-32. Nobody uses this on the internet.


At least there are standards for characters. Nothing like that for tokens.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: