Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

One thing I always find interesting but not discussed all that much at least in things I’ve read is - what happens in the spaces between the data? Obviously this is an incredibly high dimensional space which is only sparsely populated by the entirety of the English language; all tokens, etc. if the space is truly structured well enough, then there is a huge amount of interesting, implicit, almost platonic meaning occurring in the spaces between the data - synthetic? Dialectic? Idk. Anyways, I think those areas are a space that algorithmic intelligence will be able to develop its own notions of semantics and creativity in expression. Things that might typically be ineffable may find easy expression somewhere in embedding space. Heidegger’s thisness might be easily located somewhere in a latent representation… this is probably some linguistics 101 stuff but it’s still fascinating imo.


My intuition is that the voids in an embedding space are concepts which have essentially no meaning, so you will never find text that embeds into those spaces, and therefore they are not reachable.

For example take a syntactically plausible yet meaningless concept such as "the temperature of sorrowful liquid car parkings"[1]. That has nothing near it in embedding space I'd be prepared to guess. When you embed any corpus of text this phrase is going to drop into a big hole in the semantic space because while it has components which have some sort of meaning in each of your semantic dimensions, there isn't anything similar to the actual concept- there isn't any actual meaning there for something else to be similar to.

You need the spaces because there are so many possible different facets we are trying to capture when we talk about meaning but only a subset of those facets are applicable to the meaning of any one concept. So the dimensions in the embedding space are not independent or really orthogonal, and semantic concepts end up clustered in bunches with big gaps between them.

That's my intuition about it. When I get some time it's definitely something I want to study more.

[1] Off the top of my head but you can come up with an infinite number of similar examples


> the temperature of sorrowful liquid car parkings

This is quite a beautiful, strange (estranging?) clause - at least in the sense that we (or I) constantly struggle to find meaning and patterns in what might simply be plain noise (apophenic beauty?). It’s a similar form of intrigue that I and I think others often experience when reading the outputs of LLMs operating in the high-temperature regime, though of course we are just talking about embedding/embedding inversion here.

On a human level though, it makes me wonder why you picked that phrase. Did you roll dice in front of a dictionary? Play madlibs? Were they the first words that came to your mind? Or perhaps you went through several iterations to come up with the perfectly meaningless combination? Or perhaps you simply spilled your hot chocolate on your favorite pair of pants or dress while getting out of the car this morning (or perhaps as a child) and the memory has stuck with you… who knows! Only you!

In any case, my original point was simply that these interstitial points in embedding spaces can become ways of referring to or communicating ideas that we simply do not have the words for but which are none-the-less potentially useful in a communication between two entities that both have the ability to come to some roughly shared understanding of what is being referred to or expressed by that point in the embedding space. Regular languages of course invent new words all the time, and yet the points those new words map to in the embedding space always existed (eh not a great example because the shape of the embedding space might change as new words/tokens are introduced to the lexicon but I think the idea holds). Perhaps new words or phrases will come about to bring some point back into textual space; or perhaps that point will remain solely in the shared lexicon of the algorithmic systems using the latent space to communicate ideas. Again, who knows!

For instance, consider the midpoint of a segment connecting two ideas, or the centroid of any simplex in the embedding space… if we assume that there is some sort of well-defined semantic structure in the space, is it necessarily the case that the centroid must refer to something which equally represents all of the nodes, a kind of lowest-common semantic denominator? Obviously if the semantic structure only holds over local regions but breaks down globally this is not the case, but if all the points are within a region of relatively sound semantic structure, that seems plausible. We know what happens when you do a latent space traversal for a VAE which generates images, and it can be quite beautiful and strange (or boring and familiar by 2024, depending on your perspective), but some similarly weird process might be possible with embedding space traversals, if only we could some how phenomenologically if not linguistically decode those interpolating points.

> concepts which have essentially no meaning

This is a pretty strange idea to try to wrap your head around.


> it makes me wonder why you picked that phrase

It took me a few goes to refine the idea. I started with the word sorrowful and thought "ok what could not possibly be sorrowful?" -> a car parking space.

Ok then what attributes could a car parking not have -> being liquid

Then once I had got the idea then I wanted some other physical attribute this nonexistant thing might have and that got me to temperature.

I agree with your idea that it's quite interesting to think about properties of concepts we are currently unable to communicate at all in our language. For example if my intuition is correct, even if you have two concepts which are completely meaningless you would be able to discern similarity/difference between them conceptually, and this is leading to your centroid idea. If we look at those centroids, some might land in semantically meaningful places ("Who knew? The average of tennis and squash is badminton!") whereas some might end up in this void space and that might be quite fascinating.

I've always thought[1] that creativity is essentially about making connections between concepts that had previously been thought to be unconnected and therefore it seems to me that some (not all) of these void spaces have potential to be joined in to the mainstream semantic space over time as people find ways to link these concepts to things we already have some meaning for. That's very interesting to me.

[1] After reading "The Act of Creation" by Koestler


> It took me a few goes to refine the idea. I started with the word sorrowful and thought "ok what could not possibly be sorrowful?" -> a car parking space. Ok then what attributes could a car parking not have -> being liquid. Then once I had got the idea then I wanted some other physical attribute this nonexistant thing might have and that got me to temperature.

Darn. I was really pulling for the hot cocoa theory.

Also, you clearly don’t live in New York City if you can’t fathom the idea of a parking space being associated with sorrow!


I strongly believe there's nothing there other than gibberish. Piping /dev/random to a word selector will probably enumerates everything inside that set. There's a reason we can translate between every language on earth. That's because it's the same earth and reality. So there's a common sets of concepts that gives us the foundational rules of languages. Which is the data that you're speaking about.


I think a concrete application of what your wondering is: What is the most useful word that doesn't exist?


This sums up what I wrote above (as well as in a longer reply to a reply) much more elegantly and clearly than I ever could. Thank you!

Edit: but I might exchange the word useful for something else… maybe not…


Now this is a fun idea. If you think of embeddings as a sort of quantization of latent space, what would happen if you “turned off” that quantization? It would obviously make no sense to us, as we can only understand the output of vectors that map to tokens in languages we speak, but you could imagine a language model writing something in a sort of platonic, infinitely precise language that another model with the same latent space could then interpret.


Ya I'm having my return to plato moment. It really feels like we are the dēmiurgós right now with AI systems. The nature of interpolation vs extrapolation and the exploration of latent spaces will answer a lot of philosophical questions that we didn't expect to be answered so quickly, and by computers of all things.


That reminds me of the crazy output you get when raising the temperature and letting the model deviate from regular language. E.g. https://news.ycombinator.com/item?id=38779818


The space is an uncountable set, at the limit. Mostly it’s noise. See: curse of dimensionality.


If I’m not mistaking, the coordinates in any given latent space (in this context) are countable, as there is a finite amount of dimentions. You can even only consider the space enveloped by the already explored coordinates (e.g. English words), to get a finite space which can be fully enumerated.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: