"On a TriMedia VLIW with a load latency of three cycles and a jump latency of four cycles, the interpreter achieves a peak performance of four cycles per instruction and a sustained performance of 6.27 cycles per instruction. Experiments are described that demonstrate the compression quality of the system and the execution speed of the pipelined interpreter; these were found to be about five times more compact than native TriMedia code and a slowdown of about eight times, respectively."
They used pragmas. They used loop pipelining: http://en.wikipedia.org/wiki/Software_pipelining and compiler optimizations. Their results aren't that bad, they achieved CISC code density for 8 times slowdown.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.3...
"On a TriMedia VLIW with a load latency of three cycles and a jump latency of four cycles, the interpreter achieves a peak performance of four cycles per instruction and a sustained performance of 6.27 cycles per instruction. Experiments are described that demonstrate the compression quality of the system and the execution speed of the pipelined interpreter; these were found to be about five times more compact than native TriMedia code and a slowdown of about eight times, respectively."
They used pragmas. They used loop pipelining: http://en.wikipedia.org/wiki/Software_pipelining and compiler optimizations. Their results aren't that bad, they achieved CISC code density for 8 times slowdown.