Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This has 5 backend pipelines, which I would say is getting there speed-wise. I think AMD Zen has 10 execution pipelines to play with, although the exact structure is usually not commented upon so I don't know how long they are.

It's 32 bit so it's not desktop-class necessarily but this should blow a microcontroller out of the sky, for scale.



> This has 5 backend pipelines, which I would say is getting there speed-wise.

It's pretty easy to slap down pipelines. What is far harder is keeping them all fed and running without excessive stalling and bubbles whilst avoiding killing your max frequency and blowing through your area and power goals.


If I understand correctly, most designs effectively #define the number of frontend and backend pipelines. That means you are free to make as many as you like, for as much performance as you like, but some parameters might scale very badly outside what the core was designed for. For example, I would imagine the number of gates and power use to go up to unfeasibly high levels above the quoted numbers.


It's a bit harder than that. You have to worry about forwarding results from one pipeline to another bypassing the registers if you want to avoid having to deal with stalls waiting for results. The transistor cost of the bypass network grows as the square of the number of pipelines so it can be pretty significant, potentially being larger than the cost of the execution pipelines themselves.

Many modern designs aren't fully bypassed and involve clusters of pipelines to manage this. IBM's recent Power chips and Apple's ARM cores are particularly known to do this.


Does optimal pipeline topology vary with the specific workload or is more dependent on the instruction set semantics and result visibility?

Is this something that could be automatically optimized via simulation?

Is it something that could be made dynamic and shared across a bunch of schedulers so that cores could move between big/little during execution.


It's highly dependent on both workload and instruction set.

I'm sure it could be automatically optimized in theory, even without the solution being AI complete, but I don't think we have any idea how to do it right now.

No, not unless you're reflashing an FPGA. You'd have better luck sharing subcores for threadlets I think.


I've been wondering if it could make sense to put a small ALU with each register (when they are not virtual/renamed but maybe then too?). This would allow instructions that use the destination register as a source to only use one read port, potentially allowing higher IPC. Has anyone looked into this and if so what did the analysis show?


It has two integer pipes. This core is similar in performance to the dual-issue in-order Cortex-M7- also 2 DMIPS / MHz. I guess I view this core as a technology demonstration that you can have a practical (not too large, not too low MHz) out of order processor in an FPGA.

On the other hand, I think there is still space for a small in-order dual-issue FPGA RISC-V with 2 DMIPS / MHz performance.

Actually there is one:

https://opencores.org/projects/biriscv

1.9 DMIPS / MHz..


Do you really want pipelinning on a microcontroller though?


Why not? I think Cortex-M3 has a 3 stage pipeline, obviously I would not expect the 0.01$ chinese ones to have one.

It's worth saying here, that a big ooo CPU is pipelines differently to a small/old risc processor - even amongst discussions about compiler optimizations people still use terminology like pipeline stall, when a modern CPU has a pipeline that handles fetching an instruction window, finding dependencies, doing register renaming and execution, that pipeline is not like an old IF->IF->EX->MEM->WB - it won't stall in the same way a pentium 5 did. The execution pipes themselves have a more familiar structure.


Not sure about Cortex-M3, but I can confirm that the Cortex-M4 has a pretty basic 3 stage in order pipeline that gets flushed on branches. So unless there are caches between the core and memory (I think some STM32 have that), that CPU is still trivially deterministic.


Yes, m4 from stm have art accelerator (cache memory) which make them less deterministic (but much faster) for flash access. See https://www.st.com/en/microcontrollers-microprocessors/stm32...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: