As far as I am aware Intel has not yet announced chips or their plans for chips.
16 registers is not awful, but for sure 32 gives a lot more flexibility. Amd64 has gotten by with 16 for 20 years, and 32 bit Arm for a long time. Also VAX and 68000 just a few years before Arm. But it's tight -- Arm32 and Amd64 on Windows only allow four function arguments to be passed in registers, six in the SysV ABI for Amd64. VAX and 68000 ABIs sadly and (really, unnecessarily) used the stack, leaving a lot of performance on the table.
I'm amused that the Intel 4004 in 1971 had 16 GPRs, plus accumulator, double the 8086 and more than double the 8008/8080. Narrower, of course.
Have you tried spilling things to SSE registers instead of to the stack? I've never done something that might benefit from that on x86, but my understanding is that by the time you get to ... maybe Sandy Bridge? ... it's all one big register pool and the moves are just register renaming, not physical movement.
If I recall properly I was warned (kind of a long time ago) that moving data between the general registers and the vector registers is not that cheap, to a point it may be better to deal with the stack (~L1 cache memory).
Any leads on that topic? Because I suspect the answers to be micro-architecture dependent and hidden deep in those abominations of llvm and gcc.
I don't have. It's something I thought about trying for an update of rv8 (to assign almost all RISC-V registers to x86 registers), but have never got around to.