Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I contend that most human knowledge is not written down or if it is written down it’s not publicly available on the internet and so does not exist in these datasets.

There’s so much subtle knowledge like the way a mother learns to calm her child or the way a carpenter learns to work different kinds of wood which may be written down in part, but may also be learned through lived experience or transferred from human to human such that little of it gets written down and posted online.



That's where humans suck. The classic "you're not doing it right" then proceeds to quickly show how to do it without verbalizing any info on learning process, pitfalls, failure modes, etc, as if just showing it was enough for themselves to learn. Most people do[n't do] that, not even a sign of reflection.

My worst case was with a guy who asked me to write an arbitrage betting bot. When I asked how to calculate coeffs, he pointed at two values and said "look, there <x>, there <y> thinks for a minute then it's <z>!". When I asked how exactly did he calculate it, he simply repeated with different numbers.


People often don't know how to verbalize them in the first place. Some of these topics are very complex, but our intuition gets us halfway there.

Once upon a time I was good at a video game. Everyone realized that positioning is extremely important in this game.

I have good positioning in that game and was asked many times to make a guide about positioning. I never did, because I don't really know how. There is too much information they you need to convey to cover all the various situations.

I think you would first have to come up with a framework on positioning to be able to really teach this to someone else. Some kind of base truths/patterns that you can then use to convey the meaning. I believe the same thing applies to a lot of these processes that aren't verbalized.


Often for this kind of problem writing a closed form solution is simply intractable. However, it's often still possible to express the cost function of at least a big portion of what goes into a human-optimal solution. From here you can sample your space, do gradient descent or whatever to find some acceptable solution that has a more human-intuitive property.


It's not necessarily that it's intractable - just that a thing can be very hard to describe, under some circumstances.

Imagine someone learning English has written "The experiment reached it's conclusion" and you have to correct their grammar. Almost any english speaker can correct "it's" to "its" but unless they (and the person they're correcting) know a bunch of terms like 'noun' and 'pronoun' and 'possessive' they'll have a very hard time explaining why.


They may not even know why and it may be okay -- they speak it somehow, right? In this case, the language is both a set or rules and a systematization of a pre-existing phenomenon. There's enough ephemeral, hard to explain concepts, but most humans just aren't used to explain it even to themselves.

For example, I've never learned English, anywhere. I know it from .txt and .ng documents and a couple of dictionaries I had back in the DOS days. I'm an uneducated text-native, basically. But here's what I'd say to that newbie:

- Usually we use "...'s" for attribution like in "human's cat = cat of a human". But "it" and other special words like "that", "there", etc are an exception. We write "it's" as short for "it is", sometimes "it has". But we write "its", "theirs" for attribution, like in "its paw" = "paw of it" ~~ "cat's paw" = "paw of a cat". There's more to this, but you can ignore it for now.


> When I asked how exactly did he calculate it, he simply repeated with different numbers.

Now you know how an LLM feels during training!


Probably during inference, as well.


I wouldn't say this is where humans suck. On the contrary, this how we find human language is such a fantastic tool to serialize and deserialize human mental processes.

Language is so good, that an artificial language tool, without any understanding of these mental processes, can appear semi-intelligent to us.

A few people unable to do this serialization doesn't mean much on the larger scale. Just that their ideas and mental processes will be forgotten.


For sure agree, however as the storage of information evolves, it’s becoming more efficient over time

From oral tradition to tablets to scrolls to books to mass produced books to digital and now these LLMs, I think it’s still a good idea to preserve what we have the best we can. Not as a replacement, but a hedge against a potential library of Alexandria incident.

I could imagine a time in the near future where the models are domain-specific, and just like there are trusted encyclopedia publishers there are trusted model publishers that guarantee a certain level of accuracy.

It’s not like reading a book, but I for sure had an easier time learning golang talking with ChatGPT than a book


> a hedge against a potential library of Alexandria incident

What would cause a Library of Alexandria incident wiping out all human knowledge elsewhere, that would also allow you to run a local LLM?


To run a local LLM you need the device it currently runs on and electricity. There are actually quite a lot of ways to generate electricity, but to name one, a diesel generator that can run on vegetable oil.

What you're really asking is, what could cause a modern Library of Alexandria incident? But the fact is we keep the only copy of too many things on the servers of the major cloud providers. Which are then intended to have their own internal redundancy, but that doesn't protect you against a targeted attack or a systemic failure when all the copies are under the same roof and you lose every redundant copy at once from a single mistake replicated in a monoculture.


A more dooms-day prepping would call for some heavy lead-faraday cage to store the storage mediums in the event of an EMP/major solar flare.

Or more Sci-fi related, some hyper computer virus that ends up infecting all internet connected devices.

Not too far fetched if we can conceive of some AI enabled worm that mutates depending on the target, I could imagine a model of sorts being feasible within the next 5-10 years


I think you underestimate the amount of information contained in books and the extent to which our society (as a whole) depends on them.


society depends much more on social networks, mentorship and tacit knowledge than books. It's easy to test this. Just run the thought experiment by a few people, if you could get only one, would you take an Ivy league degree without the education or the education without the degree?

Venture capital in tech is a good example of this. The book knowledge is effectively globally distributed and almost free, effectively success happens in a few geographically concentrated counties.


By book, I mean, written in any form, study paper, blog, theses, books, etc. I don't understand you comparison.

Same for your example, no logical link between the effect and the consequences.


> I contend that most human knowledge is not written down

Yes - the available training data is essentially mostly a combination of declarative knowledge (facts - including human-generated artifacts) and procedural knowledge (how to do things). What is missing is the learning process of taking a description of how to do something, and trying to apply that yourself in a specific situation.

No amount of reading books, or reading other people's blogs on how they did something, can avoid the need for hands-on experience if you want to learn how to do it yourself.

It's not just a matter of information that might be missing or unclear in instructional material, including how to cope with every type of failure and unexpected outcome, but crucially how to do this yourself - if you are to be the actor, then it's the predictive process in your mind that matters.

Partly for this reason, and partly because current AI's (transformer-based LLMs) don't support online learning (try & fail skill acquisition), I think we're going to see two distinct phases of AI.

1) The current "GenAI" phase where AI can only produce mash-ups of things it saw in it's pre-training data, augmented by similar "book learning" provided in-context which can be utilized by in-context learning. I'd characterize what this type of AI to be useful for, and capable of, as "automation". Applying that book (incl. anecdotal) knowledge to new situations where mash-up is all you need.

2) The second phase is where we have something closer to AGI, even if still below human level, which is no longer just a pre-trained transformer, but also has online learning and is agentic - taking actions predicated on innate traits like curiosity and boredom, so that given the book knowledge it can (& will!) then learn to apply that by experimentation/practice and learning from its own mistakes.

There will no doubt be advances beyond this "phase two" as well, but it seems we're likely to be stuck at "phase one" for a while (even as models become much better at phase one capabilities), until architectures fundamentally advance beyond transformers to allow this type of on-the-job training and skill acquisition.


It's not even "human knowledge" that can't be written down - it seems all vertebrates understand causality, quantity (in the sense of intuitively understanding what numbers are), and object permanence. Good luck writing those concepts down in a way that GPT can use!

In general AI in 2024 is not even close to understanding these ideas, nor does any AI developer have a clue how to build an AI with this understanding. The best we can do is imitating object permanence for a small subset of perceptible objects, a limitation not found in dogs or spiders.


I'd content that those are skills (gained through experience) rather than knowledge (gained through rote learning).


I think it’s worth expanding your definition of knowledge.


Yes but it contains enough hints to help someone find their way on the these types of tasks.


Wait till all the videos ever created are tokenized and ingested into a training dataset. Carpentry techniques are certainly there. The subtleties of parenting maybe harder to derive from that, but maybe lots of little snippets of people’s lives will add up to a general understanding of parenting. There have certainly been bigger surprises in the field.


What about smells or tastes? Or feelings?

I can't help but feel we're at the "aliens watch people eat from space and recreate chemically identical food that has no taste" phase of AI development.


If the food is chemically identical then the taste would be the same though, since taste (and smell) is about chemistry. I do get what you're saying though.



An interesting thought experiment, but there's a flaw in it, an implicit fallacy that's probably a straw man. On its own, the argument would likely stand that Mary gains new knowledge on actually being exposed to color.

However, there is a broader context: this is supporting an argument against physicalism, and in this light it falls apart. There are a couple of missing bits required to complete the experiment in this context. The understanding that knowledge comes in 2 varieties: direct (actual experience) and indirect (description by one with the actual experience using shared language). This understanding brings proper clarity to the original argument, as we are aware - I think - that language is used to create compressed representations of things; something like a perceptual hash function.

The other key bit, which I guess we've only considered and extensively explored after the argument was formulated, is that all information coming in via the senses goes to the brain as electrical signals. And we actually have experimental data showing that sensory information can be emulated using machines. Thus, the original argument, to be relevant to the context, should be completed by giving Mary access to a machine that she can program to emulate the electrical signals that represent color experience.

I posit that without access to that hypothetical machine, given the context of the experiment, it cannot be said that Mary has "learned everything there is to learn about color". And once she has comprehensively and correctly utilized said machine on herself, she will gain no new knowledge when she is exposed to the world of color. Therefore this experiment cannot be used as an argument against physicalism as originally intended.


Personally I don't think these complicated thought questions based on subjective experience enlighten us at all.


OK. I enjoyed the mental exercise though, thanks for that. Also, as someone who's formally studied philosophy, I'd say there is definitely value in thought experiments, particularly as I think we got to an objective level in this case, though we started with the subjective. And determining universal (objective) rules are valued as they usually help guide us to truth, and/or point to ideals to strive for.


> If the food is chemically identical…

If it were 99.9% chemically identical but they left out the salt and spices…


I'd say that, when it comes to chemistry, only 100% reproduction can be considered identical. Anything less is to be deemed similar to some degree.

And so without the correct amount of salt and/or spices, we're talking about food that's very similar, and not identical.


Their perception is very likely to be totally different.

* They might not perceive some substances at all, others that we don't notice might make it unpalatable.

* Some substances might be perceived differently than us, or be indistinguishable from others.

* And some might require getting used to.

Note that all of the above phenomena also occur in humans because of genetics, cultural background, or experiences!


This may come off as pedantic, but "identical" is a very strong term when it comes to something like chemistry. The smallest chemical difference can manifest as a large physical difference. Consider that genetically, humans are about 60% similar to the fruit fly, yet phenotically, the similarity could be considered under 1%.


Well, I have synesthetic smell/color senses, so I don’t even know what other humans experience, nor they me. But, I have described it in detail to many people and they seem to get the idea, and can even predict how certain smells will “look” to me. All that took was using words to describe things.


> All that took was using words to describe things.

All that took was words and a shared experience of smelling.


How rude, what do our bathing habits have to do with this? ;-)

But, fair point. The gist I was trying to get across is that I don't even know what a plant smells like to you, and you don't know what a plant smells like to me. Those aren't comparable with any objective data. We make guesses, and we try to get close with our descriptions, which are in words. That's the best we can do and we share our senses. Asking more from computers seems overly picky to me.


I think we can safely say that any taste, smell, sensation or emotion of any importance has been described 1000 times over in the text corpus of GPT. Even though it is fragmented, by sheer volume there is enough signal in the training set, otherwise it would not be able to generate coherent text. In this case I think the map (language) is asymptotically close to the territory (sensations & experience in general).


What makes you think they aren't already?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: