1fps is the kind of frame rate you should be getting with 20000 collisions, not 256. Your algorithm is the bottleneck, here, not the programming language.
You say "pixel-perfect". If you have enough spare memory, one simple algorithm would be: render an offscreen canvas of the whole arena, draw each sprite as a stencil in a different colour, and test against that. Linear time, and no need to segment anything. (You might need to use the high bits of the canvas, though: I don't know how anti-fingerprinting measures work, but I expect they replace the low bits of a canvas' data with noise.)
The last time this code ran was in 2012, and computers and JS engines were slower then.
Also, my Comfy language interpreter is easy to bog down with sprite signal processing. The only cure for that is improving interpreter performance. I need it to run good on tablets and phones, lots of tuning to do in general.
I like your algorithm idea! Stensils are indeed a fast way to work. I think I'll still do better on a worker thread with bit swizzling in the CPU rather than reading back stencil video memory. I'll have to spike it.
You say "pixel-perfect". If you have enough spare memory, one simple algorithm would be: render an offscreen canvas of the whole arena, draw each sprite as a stencil in a different colour, and test against that. Linear time, and no need to segment anything. (You might need to use the high bits of the canvas, though: I don't know how anti-fingerprinting measures work, but I expect they replace the low bits of a canvas' data with noise.)