I would guess contention on sync primitives in the dispatcher. The queues look like they need to be locked before they can be modified:
// Scheduling helpers. Sched must be locked.
static void gput(G*); // put/get on ghead/gtail
This is the comment on the lock:
in the uncontended case,
* as fast as spin locks (just a few user-level instructions),
* but on the contention path they sleep in the kernel.
* a zeroed Lock is unlocked (no need to initialize each lock).
A better approach for multicore would probably be work-stealing queues.
I would tender a guess that since go is compiled to native code, it uses a lot more syscalls to manage its threads. Stackless probably does a lot more of the management inside the vm, and doesn't need to make as many (or any) syscalls to do so.
This is just a guess from someone that knows nothing about either (stackless does run in a vm, right?), so take many large grains of salt.