I updated the results, with just the Devstral part, but ran the full suite for it, and posted all the results file as well as a script to re-run the process.
The results are more spectacular...
The model pointed way better in gsm8k, but lost a bit on the other categories.
Fair point on the writing style, I used Claude extensively on this project, including drafting. The experiments and ideas are mine though.
On the prior art: you're right that layer duplication has been explored before. What I think is new here is the systematic sweep toolkit + validation on standard benchmarks (lm-eval BBH, GSM8K, MBPP) showing exactly which 3 layers matter for which model. The Devstral logical deduction result (0.22→0.76) was a surprise to me.
If there are ComfyUI nodes that do this for image models, I'd love links, the "cognitive modes" finding (different duplication patterns that leads to different capability profiles from the same weights) might be even more interesting for diffusion models.
I tried out the one I linked with sd1.5 today, moved the sliders around like a total noob and got pretty bad results but I found no way to "replay" any of the layers like the one you linked, so thanks for the link. Must take a lot of trial & errors haha. I'll check it out, assuming it works for the anima preview 2 too.
I explored that, again with Devstral, but the execution with 4 times the same circuit lead to less score on the tests.
I chat with the model to see if the thing was still working and seemed coherent to me, I didn't notice anything off.
I need to automate testing like that, where you pick the local maxima and then iterate over that picking layers to see if it's actually better, and then leave the thing running overnight
Can Karpathy's autoresearch be used on this to explore what works and what does not? That is supposed to automate research like this from what I understand.
The other interesting point is that right now I'm copy pasting the layers, but a patch in llama.cpp can make the same model now behave better by a fact of simply following a different "flow" without needing more vram...
if this is validated enough it can eventually lead to ship some kind of "mix" architecture with layers executed to fit some "vibe?"
Devstral was the first one I tried and optimize for math/eq, but that din't result in any better model, then I added the reason part, and that resulted in "better" model
I used the devstral with the vibe.cli and it look sharp to me, thing didn't fail, I also used the chat to "vibe" check it and look ok to me.
The other thing is that I pick a particular circuit and that was "good" but I don't know if it was a local maxima, I think I ran just like 10 sets of the "fast test harness" and pick the config that gave the most score... once I have that I use that model and run it against the llm_eval limited to only 50 tests... again for sake of speed, I didn't want to wait a week to discover the config was bad
I'm using the following configuration
--tasks gsm8k_cot,ifeval,mbpp,bbh_cot_fewshot_logical_deduction_five_objects,mbpp I did also try humaneval but something in the harness is missing and failed...
notice that I'm running 50 tests for each task, mostly because of time limitation as it takes like two hours to validate the run for the base model and the modified one.
I'll also try to publish the results of the small tests harness when I'm testing the multiple layers configurations, for reference this is phi-4-Q6_K.gguf, still running, I'm now giving more importance to the Reason factor, the reason factor comes from running a small subset of all the problems in the task config above
Initially I tried the approach of the highest math/eq but in resulted in models that were less capable overall with the exception of math, and math like in the original research is basically how good was the model at giving you the answer of a really though question, say the cubic root of some really large number... but that didn't translate to the model being better at other tasks...
Apples and oranges. Intel's mobile Skylake processors don't support more than 16GB of low-power RAM; Apple didn't have anything to do with that limitation.
I up voted your response as it's correct, but for anyone who doesn't want to believe, here's a fully referenced post I made a while back with specific references in the Intel documentation:
Now I know why both the new surface book and the mbook pro have such paltry ram configurations. Sad times when intel itself seems to be falling behind.
Intel's mobile processors like the ones in the Macbook Pro support up to 64 megs of ram. Their ultra low power cpus only support 16 gigs, but the line in question goes up to 64, even with ddr3l ram.
To me seems logical:
As we get older more and more energy would be used on preservation (read it as fix damage and less efficient process as result of age), therefore shrinking/eliminating everything not being used it's necessary.
Edit: TL;DR: From the PDF conslussions:
it is found that working hours up to 25–30
hours per week have a positive impact on cognition for males depending on the measure
and up to 22–27 hours for females. After that, working hours have a negative impact on
cognitive functioning.
Since the paper only looked at people above 40, I don't see how you can make any such conclusion. There is nothing about younger people. They may do better - they may also be the same or worse. They were not even included.
Nice catch, there are two parts in trying to support my theory, first included in my post:
As we get older more and more energy would be used on preservation (read it as fix damage and less efficient process as result of age), therefore shrinking/eliminating everything not being used it's necessary.
The second is an entry on how our bodies are machines oriented to try to avoid wasting energy... or better said preserving it... for that part my canonical reference would be the Algernon argument:
That is just your opinion - and it isn't even clear that it means anything. I don't see any supporting evidence. I'm not saying you are wrong (wrong with what, anyway? It's so vague and empty), I'm saying it's just some "statement", nothing more. Even so your list suffers from some severe selection bias: You chose exactly what supports your idea. What about greater "wisdom" of older people? Less desire to succeed at all cost, i.e. possibly more relaxed and willing to look at the big picture? Those two are just "statements", "ideas", so just like you :)
> The second is an entry on how our bodies are machines oriented to try to avoid wasting energy
Without even going into details about that sentence, that is a statement without a point. What exactly do you want to use it for? To show what? How?
The 1st part "our bodies are machines" is as trite a statement as it gets, pardon me for pointing this out.
The 2nd part "oriented to try to avoid wasting energy" is just as bad if not worse - if the main focus of our bodies was just that suicide and eternal sleep would be the best option to achieve that goal.
> As we get older more and more energy would be used on preservation...
> ...therefore shrinking/eliminating everything not being used it's necessary.
What is that even supposed to mean. Either part. Nor does it seem right (having taken medical courses such as physiology) - citation needed (after defining what you actually mean) for part 1, part 2 is completely unclear I'm sorry to say. What shrinks? What is eliminated?
> And he provided argumentation why this might be the case.
You are either a troll - and a bad one - or a troll. Posting a random link to something isn't "evidence". Not to mention that he didn't say anything, he just wrote "words". Impressive you are impressed.
> No, it's not his "opinion", it is his argument. And he provided argumentation
> why this might be the case.
So if I argue it's no longer subjective? I think you have the wrong idea about subjective/objective.
>Posting a random link to something isn't "evidence". Not to mention that he didn't say anything, he just wrote "words". Impressive you are impressed.
I didn't say that parent gave evidence. I say he gave an argument -- you know, premises and logical steps that can be followed (or refuted) to determine if something is true or not.
>So if I argue it's no longer subjective?
No, if someone puts forward an argument, it's by definition not subjective. An argument is something that can be evaluated.
Maybe you conflate arguments with opinions?
Of course an argument might be based on a subjective selection of premises, but that's beside the point. One can always refute the argument by pointing to issues in either its logic or its premises.
The results are more spectacular...
The model pointed way better in gsm8k, but lost a bit on the other categories.
reply