Hacker Newsnew | past | comments | ask | show | jobs | submit | xlayn's commentslogin

I updated the results, with just the Devstral part, but ran the full suite for it, and posted all the results file as well as a script to re-run the process.

The results are more spectacular...

The model pointed way better in gsm8k, but lost a bit on the other categories.


Fair point on the writing style, I used Claude extensively on this project, including drafting. The experiments and ideas are mine though.

On the prior art: you're right that layer duplication has been explored before. What I think is new here is the systematic sweep toolkit + validation on standard benchmarks (lm-eval BBH, GSM8K, MBPP) showing exactly which 3 layers matter for which model. The Devstral logical deduction result (0.22→0.76) was a surprise to me.

If there are ComfyUI nodes that do this for image models, I'd love links, the "cognitive modes" finding (different duplication patterns that leads to different capability profiles from the same weights) might be even more interesting for diffusion models.


I only know of this one: https://github.com/shootthesound/comfyUI-Realtime-Lora. Haven't played with any layer manipulation though.

I was thinking more like this one: https://github.com/AdamNizol/ComfyUI-Anima-Enhancer/

"It adds the Anima Layer Replay Patcher, which can enhance fine detail and coherence by replaying selected internal blocks during denoising."


I tried out the one I linked with sd1.5 today, moved the sliders around like a total noob and got pretty bad results but I found no way to "replay" any of the layers like the one you linked, so thanks for the link. Must take a lot of trial & errors haha. I'll check it out, assuming it works for the anima preview 2 too.

You can check here the results for Devstral, speed limits me, but these are the results for the first 50 tests of the command

  # Run lm-evaluation-harness
  lm_eval --model local-chat-completions \
      --model_args model=test,base_url=http://localhost:8089/v1/chat/completions,num_concurrent=1,max_retries=3,tokenized_requests=False \
      --tasks gsm8k_cot,ifeval,mbpp,bbh_cot_fewshot_logical_deduction_five_objects,mbpp \
      --apply_chat_template --limit 50 \
      --output_path ./eval_results

I explored that, again with Devstral, but the execution with 4 times the same circuit lead to less score on the tests.

I chat with the model to see if the thing was still working and seemed coherent to me, I didn't notice anything off.

I need to automate testing like that, where you pick the local maxima and then iterate over that picking layers to see if it's actually better, and then leave the thing running overnight


Can Karpathy's autoresearch be used on this to explore what works and what does not? That is supposed to automate research like this from what I understand.

The other interesting point is that right now I'm copy pasting the layers, but a patch in llama.cpp can make the same model now behave better by a fact of simply following a different "flow" without needing more vram...

if this is validated enough it can eventually lead to ship some kind of "mix" architecture with layers executed to fit some "vibe?"

Devstral was the first one I tried and optimize for math/eq, but that din't result in any better model, then I added the reason part, and that resulted in "better" model

I used the devstral with the vibe.cli and it look sharp to me, thing didn't fail, I also used the chat to "vibe" check it and look ok to me.

The other thing is that I pick a particular circuit and that was "good" but I don't know if it was a local maxima, I think I ran just like 10 sets of the "fast test harness" and pick the config that gave the most score... once I have that I use that model and run it against the llm_eval limited to only 50 tests... again for sake of speed, I didn't want to wait a week to discover the config was bad


I published the results for devstral... results folder of the github https://github.com/alainnothere/llm-circuit-finder/tree/main...

I'm using the following configuration --tasks gsm8k_cot,ifeval,mbpp,bbh_cot_fewshot_logical_deduction_five_objects,mbpp I did also try humaneval but something in the harness is missing and failed...

notice that I'm running 50 tests for each task, mostly because of time limitation as it takes like two hours to validate the run for the base model and the modified one.

I'll also try to publish the results of the small tests harness when I'm testing the multiple layers configurations, for reference this is phi-4-Q6_K.gguf, still running, I'm now giving more importance to the Reason factor, the reason factor comes from running a small subset of all the problems in the task config above

Initially I tried the approach of the highest math/eq but in resulted in models that were less capable overall with the exception of math, and math like in the original research is basically how good was the model at giving you the answer of a really though question, say the cubic root of some really large number... but that didn't translate to the model being better at other tasks...

  Config  | Lyr | Math   | EQ    | Reas   | Math Δ  | EQ Δ  | Reas Δ  | Comb Δ
  --------|-----|--------|-------|--------|---------|-------|---------|-------
  BASE    |   0 | 0.7405 | 94.49 | 94.12% |     --- |   --- |     --- |    ---
  (6,9)   |   3 | 0.7806 | 95.70 | 94.12% | +0.0401 | +1.21 |  +0.00% |  +1.21
  (9,12)  |   3 | 0.7247 | 95.04 | 94.12% | -0.0158 | +0.55 |  +0.00% |  +0.55
  (12,15) |   3 | 0.7258 | 94.14 | 88.24% | -0.0147 | -0.35 |  -5.88% |  -6.23
  (15,18) |   3 | 0.7493 | 95.74 | 88.24% | +0.0088 | +1.25 |  -5.88% |  -4.63
  (18,21) |   3 | 0.7204 | 93.40 | 94.12% | -0.0201 | -1.09 |  +0.00% |  -1.09
  (21,24) |   3 | 0.7107 | 92.97 | 88.24% | -0.0298 | -1.52 |  -5.88% |  -7.41
  (24,27) |   3 | 0.6487 | 95.27 | 88.24% | -0.0918 | +0.78 |  -5.88% |  -5.10
  (27,30) |   3 | 0.7180 | 94.65 | 88.24% | -0.0225 | +0.16 |  -5.88% |  -5.73
  (30,33) |   3 | 0.7139 | 94.02 | 94.12% | -0.0266 | -0.47 |  +0.00% |  -0.47
  (33,36) |   3 | 0.7104 | 94.53 | 94.12% | -0.0301 | +0.04 |  +0.00% |  +0.04
  (36,39) |   3 | 0.7017 | 94.69 | 94.12% | -0.0388 | +0.20 |  +0.00% |  +0.20
  (6,10)  |   4 | 0.8125 | 96.37 | 88.24% | +0.0720 | +1.88 |  -5.88% |  -4.01
  (9,13)  |   4 | 0.7598 | 95.08 | 94.12% | +0.0193 | +0.59 |  +0.00% |  +0.59
  (12,16) |   4 | 0.7482 | 93.71 | 88.24% | +0.0076 | -0.78 |  -5.88% |  -6.66
  (15,19) |   4 | 0.7617 | 95.16 | 82.35% | +0.0212 | +0.66 | -11.76% | -11.10
  (18,22) |   4 | 0.6902 | 92.27 | 88.24% | -0.0504 | -2.23 |  -5.88% |  -8.11
  (21,25) |   4 | 0.7288 | 94.10 | 88.24% | -0.0117 | -0.39 |  -5.88% |  -6.27
  (24,28) |   4 | 0.6823 | 94.57 | 88.24% | -0.0583 | +0.08 |  -5.88% |  -5.80
  (27,31) |   4 | 0.7224 | 94.41 | 82.35% | -0.0181 | -0.08 | -11.76% | -11.84
  (30,34) |   4 | 0.7070 | 94.73 | 94.12% | -0.0335 | +0.23 |  +0.00% |  +0.23
  (33,37) |   4 | 0.7009 | 94.38 |100.00% | -0.0396 | -0.12 |  +5.88% |  +5.77
  (36,40) |   4 | 0.7057 | 94.84 | 88.24% | -0.0348 | +0.35 |  -5.88% |  -5.53
  (6,11)  |   5 | 0.8168 | 95.62 |100.00% | +0.0762 | +1.13 |  +5.88% |  +7.02
  (9,14)  |   5 | 0.7245 | 95.23 | 88.24% | -0.0160 | +0.74 |  -5.88% |  -5.14
  (12,17) |   5 | 0.7825 | 94.88 | 88.24% | +0.0420 | +0.39 |  -5.88% |  -5.49
  (15,20) |   5 | 0.7832 | 95.86 | 88.24% | +0.0427 | +1.37 |  -5.88% |  -4.52
  (18,23) |   5 | 0.7208 | 92.42 | 88.24% | -0.0197 | -2.07 |  -5.88% |  -7.95
  (21,26) |   5 | 0.7055 | 92.89 | 88.24% | -0.0350 | -1.60 |  -5.88% |  -7.48
  (24,29) |   5 | 0.5825 | 95.04 | 94.12% | -0.1580 | +0.55 |  +0.00% |  +0.55
  (27,32) |   5 | 0.7088 | 94.18 | 88.24% | -0.0317 | -0.31 |  -5.88% |  -6.19
  (30,35) |   5 | 0.6787 | 94.69 | 88.24% | -0.0618 | +0.20 |  -5.88% |  -5.69
  (33,38) |   5 | 0.6650 | 94.96 | 88.24% | -0.0755 | +0.47 |  -5.88% |  -5.41
  (6,12)  |   6 | 0.7692 | 95.39 | 94.12% | +0.0287 | +0.90 |  +0.00% |  +0.90
  (9,15)  |   6 | 0.7405 | 94.65 | 94.12% | -0.0000 | +0.16 |  +0.00% |  +0.16
  (12,18) |   6 | 0.7582 | 94.57 | 88.24% | +0.0177 | +0.08 |  -5.88% |  -5.80
  (15,21) |   6 | 0.7828 | 93.52 | 88.24% | +0.0423 | -0.98 |  -5.88% |  -6.86
  (18,24) |   6 | 0.7308 | 92.93 | 94.12% | -0.0097 | -1.56 |  +0.00% |  -1.56
  (21,27) |   6 | 0.6791 | 92.54 | 82.35% | -0.0615 | -1.95 | -11.76% | -13.72

There is a performance improvement as per [0][1] the memory speed went up from 5500MT/s to 6400.

[0] https://www.steamdeck.com/en/tech [1] https://www.steamdeck.com/en/tech/deck


Sennheiser PC31, I'm not sure if the speakers on those are shared with the PX headphones, which sound by the way very nice for music.

https://www.amazon.com/Sennheiser-31-II-Binaural-Headset-Mic...

and in case your laptop/desk doesn't have mic and headphone jack you can use

https://www.amazon.com/Sabrent-External-Adapter-Windows-AU-M...

that works with Linux, not sure about windows/mac.

for the cellphone plantronics voyager legend, which is expensive but works every time very well.


HP Mac Mini. This is actually interesting, when Apple switched to max 16 Gb of ram on their PRO line, HP throws something with PRO graphics and Xeon.


Apples and oranges. Intel's mobile Skylake processors don't support more than 16GB of low-power RAM; Apple didn't have anything to do with that limitation.

This is a boxy desktop, not a mobile machine.


I up voted your response as it's correct, but for anyone who doesn't want to believe, here's a fully referenced post I made a while back with specific references in the Intel documentation:

https://news.ycombinator.com/item?id=12900834


Now I know why both the new surface book and the mbook pro have such paltry ram configurations. Sad times when intel itself seems to be falling behind.


Intel's mobile processors like the ones in the Macbook Pro support up to 64 megs of ram. Their ultra low power cpus only support 16 gigs, but the line in question goes up to 64, even with ddr3l ram.


Or Apple could have used pro RAM in the MacBook Pro.


Too bad this doesn't have a built in hires display and a keyboard so we could compare orange to oranges.


You're forgetting the cheap Chinese external power brick that you will be tethered to, also...


I've been looking around my Mac Mini and my friend's Mac Pro for hours now.

I couldn't find the hires display.


> ... when Apple switched to max 16 Gb of ram on their PRO line...

What PRO line do _you_ think he was talking about that is limited to 16GB?


To me seems logical: As we get older more and more energy would be used on preservation (read it as fix damage and less efficient process as result of age), therefore shrinking/eliminating everything not being used it's necessary.

Edit: TL;DR: From the PDF conslussions:

it is found that working hours up to 25–30 hours per week have a positive impact on cognition for males depending on the measure and up to 22–27 hours for females. After that, working hours have a negative impact on cognitive functioning.


    > To me seems logical: As we get older...
Since the paper only looked at people above 40, I don't see how you can make any such conclusion. There is nothing about younger people. They may do better - they may also be the same or worse. They were not even included.


Nice catch, there are two parts in trying to support my theory, first included in my post:

As we get older more and more energy would be used on preservation (read it as fix damage and less efficient process as result of age), therefore shrinking/eliminating everything not being used it's necessary.

The second is an entry on how our bodies are machines oriented to try to avoid wasting energy... or better said preserving it... for that part my canonical reference would be the Algernon argument:

http://www.gwern.net/Drug%20heuristics


That is just your opinion - and it isn't even clear that it means anything. I don't see any supporting evidence. I'm not saying you are wrong (wrong with what, anyway? It's so vague and empty), I'm saying it's just some "statement", nothing more. Even so your list suffers from some severe selection bias: You chose exactly what supports your idea. What about greater "wisdom" of older people? Less desire to succeed at all cost, i.e. possibly more relaxed and willing to look at the big picture? Those two are just "statements", "ideas", so just like you :)

    > The second is an entry on how our bodies are machines oriented to try to avoid wasting energy
Without even going into details about that sentence, that is a statement without a point. What exactly do you want to use it for? To show what? How?

The 1st part "our bodies are machines" is as trite a statement as it gets, pardon me for pointing this out.

The 2nd part "oriented to try to avoid wasting energy" is just as bad if not worse - if the main focus of our bodies was just that suicide and eternal sleep would be the best option to achieve that goal.

    > As we get older more and more energy would be used on preservation...

    > ...therefore shrinking/eliminating everything not being used it's necessary.
What is that even supposed to mean. Either part. Nor does it seem right (having taken medical courses such as physiology) - citation needed (after defining what you actually mean) for part 1, part 2 is completely unclear I'm sorry to say. What shrinks? What is eliminated?


>That is just your opinion - and it isn't even clear that it means anything. I don't see any supporting evidence.

No, it's not his "opinion", it is his argument. And he provided argumentation why this might be the case.

It surely is not verified or fact, but it's not merely some subjective opinion.


    > And he provided argumentation why this might be the case.
You are either a troll - and a bad one - or a troll. Posting a random link to something isn't "evidence". Not to mention that he didn't say anything, he just wrote "words". Impressive you are impressed.

    > No, it's not his "opinion", it is his argument. And he provided argumentation
    > why this might be the case.
So if I argue it's no longer subjective? I think you have the wrong idea about subjective/objective.

As somebody else (jstanley) responded to a comment in another thread (https://news.ycombinator.com/item?id=12364193):

    > ...for the argument to have any weight you need to show that it is true, not simply state it.


>Posting a random link to something isn't "evidence". Not to mention that he didn't say anything, he just wrote "words". Impressive you are impressed.

I didn't say that parent gave evidence. I say he gave an argument -- you know, premises and logical steps that can be followed (or refuted) to determine if something is true or not.

>So if I argue it's no longer subjective?

No, if someone puts forward an argument, it's by definition not subjective. An argument is something that can be evaluated.

Maybe you conflate arguments with opinions?

Of course an argument might be based on a subjective selection of premises, but that's beside the point. One can always refute the argument by pointing to issues in either its logic or its premises.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: