Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's not wild. "Predict the next word" does not imply a bar on intelligence; a more intelligent prediction that incorporates more detail from the descriptions of the world that were in the training data will be a better prediction. People are drawing a straight line because the main advance to get to GPT-4 was throwing more compute at "predict the next word", and they conclude that adding another order of magnitude of compute might be all it takes to get to superhuman level. It's not "but what if we had a better algorithm", because the algorithm didn't change in the first place. Only the size of the model did.


> Predict the next word

Are there any papers testing how good humans are at predicting the next word?

I presume us humans fail badly:

1. as the variance in input gets higher?

2. Poor at regurgitating common texts (e.g. I couldn't complete a known poem).

3. When context starts to get more specific (majority of people couldn't complete JSON)?


The following blogpost by an OpenAI employee can lead us to compare patterns and transistors.

https://nonint.com/2023/06/10/the-it-in-ai-models-is-the-dat... The ultimate model, in his (author's) sense, would suss out all patterns and then patterns among those patterns and so on, so that it delivers on compute and compression efficiency.

To achieve compute and compression efficiency, it means LLM models have to cluster all similar patterns together and deduplicate them. This also means successively levels of pattern recognition to be done i.e. patterns among patterns among patterns and so on , so as to do the deduplication across all hierarchy it is constructed. Full trees or hierarchies won't get deduplicated but relevant regions / portions of those trees will, which implies fusing together in ideas space. This means root levels will be the most abstract patterns. This representation also means appropriate cross-pollination among different fields of studies further increasing effectiveness.

This reminds me of a point which my electronics professor made on why making transistors smaller has all the benefits and only few disadvantages. Think of these patterns as transistors. The more deduplicated and closely packed they are, the more beneficial they will be. Of course, this "packing together" is happening in mathematical space.

Another thing which patterns among patterns among patterns reminds me of homotopies. This brilliant video by PBS Infinite Series is amazing. As I can see, compressing homotopies is what LLMs do, replace homotopies with patterns. https://www.youtube.com/watch?v=N7wNWQ4aTLQ


There's entire studies on it, I saw a lecture by some English professor who explained how the brain isn't fast enough to parse words in real time, so runs multiple predictions of what the sentence will be in parallel and at the end jettisons the wrong ones and goes with the correct one.

From this, we get comedy. A funny statement is one that ends in an unpredictable manner and surprises the listener brain because it doesn't have the meaning of that one already calculated, and hence why it can take a while to "get the joke"




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: