You'll have to show me a single case of an LLM being trained on data which exclu...

BoorishBears · on May 19, 2023

An LLM only trained on exam papers wouldn't even be an LLM.

You said above:

> I do not recall the lesson wherein I sampled from thousands of prior exam papers to generate my answer to the one I sat.

No. You instead sampled from thousands of prior conversations, textbooks, story books, movies, songs... to form how you write, how you think, how you understand.

What ever you studied specifically for that exam wasn't even the tip of the iceberg of what was required to write your answers, even understanding that what you studied was predicated on the efforts of others in a way that makes your weeks of cramming infinitesimally small in the scale of effort that goes into writing a single answer on that test.

mjburgess · on May 19, 2023

You will find no such things in the body or the brain

this dumb 'statisticalism' is false-- and a product just of the impulse for engineers to misexplain reality on the basis of their latest engineering successes

animals grow representations of their environments which are not summarises of cases --- they're sensory motor adaptions which enable coordination of the body and mind in every sense

BoorishBears · on May 19, 2023

You're responding to a comment no one wrote, the point wasn't to state about how those representations are formed: it was that you integrate many many environments to do anything useful. If someone were kept isolated enough from birth, regardless of inate intelligence they wouldn't be able to take that exam even if we gave them a year and make it open book: Genie took half a decade to reach a 2 year old's language skills.

But there's also a certain irony in your dig on engineers mis-explaining while trying to paint the links between "sensory motor adaptations [sic]" and statisticalism as a response to recent engineering successes...

https://pubmed.ncbi.nlm.nih.gov/16807063/

whimsicalism · on May 19, 2023

All of the GPT-4 exams it were evaluated on were not present in its training data.

But honestly, if you don't get it now, I can't hope to convince.

mjburgess · on May 19, 2023

It's always sampling from an existing distribution of relevant data points -- that's necessarily how its working.

If you want to claim the sample set is only mildy similar to exam questions -- so be it, that may be true. Or if you want to claim that its sampling method is attentive to structural associations in its sample set, so that its not lifting from "identical distributions" -- so be it.

So long as those "structural associations" are givens, and the data "givens", the process is just sampling from a domain of human effort without expending any of a similar kind.

If there had been no internet, ChatGPT would be a dumb mute -- because it has no capacity to generate data; it does not develop actual conceptualisations of the world -- it samples from the data shadows created by people.

To produce useful data requires expounding a tremendous effort -- growing an animal to cope with the world. It is this which is being laundered, unpaid and unacknowledged, through LLMs.

Whilst star-trek-huffing loons claim this stuff is doing the opposite -- a ideological delusion which benefits all those whose bank accounts are increased by the lie that "ChatGPT wrote this".

If we were prepared to price the data commons which has been created over the last 20-30 years of the internet, by everyone, its not hard to think training ChatGPT would cost a trillion.

How much labour went into creating that digital resource, and by how many, etc.?

whimsicalism · on May 19, 2023

> It's always sampling from an existing distribution of relevant data points -- that's necessarily how its working.

I work in this field and 'sampling from an existing distribution of relevant data points' is just wrong and you have no way to say that is 'necessarily how its working' apriori in a world where implicit regularization exists.

Not going to engage with the labor-theory of value bit because I think it is not particularly relevant to the disagreement I raised and not one with a 'right' answer.

mjburgess · on May 19, 2023

lol, i am not making an argument premised on the labour theory of value -- chatgpt is a proof against this theory : $20/mo for the labour of two generations

it is the perogative of states to make redress when labour is severely underpriced due to the falsity of this theory of value -- and had they , chatgpt would be exposed for what it is

regularisation is such a horrifyingly revealing term

reality isn't a regularisation of its measures -- the meaning of words is not a regularisation of their structure

such obscene statistical terminology should be obviously disqualifying here

our knowledge of the world isn't a statical regularisation of associations

that very framing exposes how deficient this line is

animals grow representations --- they do not regularise text token patterns

one is possible /only because/ of the former