Thanks! I do have a section on this in the article "Why genetic algorithms aren't state of the art"
"Physics simulation involves discontinuities (contacts, friction regimes), long rollouts, and chaotic dynamics where small parameter changes lead to large outcome differences. Even with simulator internals, differentiating through thousands of unstable timesteps would yield noisy, high-variance gradients. Evolution is simpler and more robust for this regime."
"The real tradeoff is sample-efficient but complex (RL) vs compute hungry but simple (GA). DQN extracts learning signal from every timestep and assigns credit to individual actions."
Simply, it's when your output embedding matrix = input.
You save vocab_dim*model_dim params (ex. 617m for GPT-3).
But the residual stream means that the weight matrices are roughly connected via a matmul, which means they struggle to encode bigrams (commutative property enforces symmetry).
Attention + MLP adds nonlinearity, but it still means less expressivity.
Which is why they aren't SOTA, but are useful in smaller models.
There isn't a specific place, it's the general aesthetic. Maybe you do sound like an LLM :P I guess it's not unlikely to pick up some mannerisms from them when everyone is using them.
I guess I don't really mind the use of an LLM or not, it's more the style that sounds very samey with everything else. Whether it's an LLM or not is not very relevant, I guess.
"Physics simulation involves discontinuities (contacts, friction regimes), long rollouts, and chaotic dynamics where small parameter changes lead to large outcome differences. Even with simulator internals, differentiating through thousands of unstable timesteps would yield noisy, high-variance gradients. Evolution is simpler and more robust for this regime." "The real tradeoff is sample-efficient but complex (RL) vs compute hungry but simple (GA). DQN extracts learning signal from every timestep and assigns credit to individual actions."
DQN likely would have handled this much better.
reply