Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: I wrote a WebAssembly Interpreter and Toolkit in C (github.com/fastvm)
105 points by 4984 on Jan 9, 2023 | hide | past | favorite | 28 comments


I made Web49 because there are not many good tools for WebAssembly out there. WABT is close, but the interpreter is too slow and the tools megabytes in size each. Wasm3 is a bit faster but only contains an interpreter, nothing else.

Tooling for WebAssembly is held mostly by the browser vendors. It is such a nice format to work with when one removes all the fluff. WebAssembly tooling should not take seconds to do what should take milliseconds, and it should be able to be used as a library, not just a command line program.

I developed a unique way to write interpreters based on threaded code jumps and basic block versioning when I made MiniVM (https://github.com/FastVM/minivm). It was both larger and more dynamic than WebAssembly. Web49 started as a way to compile WebAssembly to MiniVM, but soon pivoted into its own Interpreter and tooling. I could not be happier with it in its current form and am excited to see what else It can do, with more work.


> I developed a unique way to write interpreters based on threaded code jumps and basic block versioning when I made MiniVM (https://github.com/FastVM/minivm). It was both larger and more dynamic than WebAssembly.

I'd be very interested to read more about this. It looks like you are using "one big function" with computed goto (https://github.com/FastVM/Web49/blob/main/src/interp/interp....). My experience working on this problem led me to the same conclusion as Mike Pall, which is that compilers do not do well with this pattern (particularly when it comes to register allocation): http://lua-users.org/lists/lua-l/2011-02/msg00742.html

I'm curious how you worked around the problem of poor register allocation in the compiler. I've come to the conclusion that tail calls are the best solution to this problem: https://blog.reverberate.org/2021/04/21/musttail-efficient-i...


> that compilers do not do well with this pattern

As compared to hand-written assembly or the tailcall technique you describe. But (for the benefit of onlookers) a threaded switch, especially using (switch-like) computed gotos, is still more performant than a traditional function dispatch table.

Has there been any movement in GCC wrt the tailcalls feature?

One of the limitations with computed gotos is the inability to derive the address of a label from outside the function. You always end up with some amount of superfluous conditional code for selecting the address inside the function, or indexing through a table. Several years ago when exploring this space I discovered a hack, albeit it only works with GCC (IIRC), at least as of ~10 years ago. GCC supports inline function definitions, inline functions have visibility to goto labels (notwithstanding that you're not supposed to make use of them), and most surprisingly GCC also supports attaching __attribute__((constructor)) to inline function definitions. This means you can export a map of goto labels that can be used to initialize VM data structures, permitting (in theory) more efficient direct threading.

The tailcall technique is a much more sane and profitable approach, of course.


The goto labels can exported much more directly using inline asm. Further, inline asm can now represent control flow, so you can define the labels in inline asm and the computed jump at the end of an opcode. That's pretty robust to compiler transforms. Just looked up an interpreter in that style:

#define LABEL_START(TAG) ns_##TAG : __asm__(".p2align 3\n.Lstart_" #TAG ":" :::)

#define LABEL_END(TAG) __asm__(".Lend_" #TAG ":\n")

#define PROLOGUE(TAG) LABEL_START(TAG); ip++

#define EPILOGUE(TAG) __asm__ goto("\tjmpq %0\n" "\t.Lend_" #TAG ":\n"::"r"((void)decode(ip))::ALL_LABELS())

Followed by opcodes implemented in this fashion:

  {
    PROLOGUE(add);
    {
      apply_opcode_ADD(&s->data_stack);
    }
    EPILOGUE(add);
  }
Because the labels are defined in assembly, not in C, accessing them from outside the function is straightforward. I wrote a whole load of these at some point, there's probably a version of those macros somewhere that compiles to jumps through a C computed goto as well.


> My experience working on this problem led me to the same conclusion as Mike Pall, which is that compilers do not do well with this pattern

Note that that message is from twelve years ago. A lot's changed since then, not just in compilers but in CPUs. Branch prediction is a lot better now.


Mike's primary complaint is bad register allocation. It is very important to keep the most important state consistently in registers. In my experience, compilers still struggle to do good register allocation in big and branchy functions.

Even perfect branch prediction cannot solve the problem of unnecessary spills.


Very true. I imagine that grouping instructions that use the same registers into their own functions would help with that (arithmetic expressions tend to generate sequences like this). Then you loop within this function while the next instruction is in the same group, and only return to the outer global instruction loop otherwise. If you design the bytecode carefully, you can probably do group checks with a simple bitmask.


Does providing a hint to the compiler using the register keyword address the issue sufficiently?


No, most compilers ignore the register keyword, see: https://stackoverflow.com/a/10675111


Nearly. You need register and to also pass them into (potentially no-op) inline asm. `register int v("eax")` iirc, but it's been years since I did this.

The 'register' is indeed largely ignored, but it has the additional somewhat documented meaning of 'when this variable goes into inline asm, it needs to be in that register'. In between asm blocks it can be elsewhere - stack or whatever - but it still gives the regalloc a really clear guide to work from.


It's `register int v asm("eax")`. However they are very easily elided, especially after higher optimization levels; compilers are very open about this [1].

[1] https://gcc.gnu.org/onlinedocs/gcc/Local-Register-Variables....


I read a research paper that proved that the branch prediction issues are non-issue with modern predictors (eg. TTAGE). It is of course true that register spills happen, but it's not bad enough to want to write hand-written assembly. Especially when you simulate AOT-compiled code (eg. RISC-V and WASM), you will already be 3-10x faster than Lua already. For my purposes of using this kind of emulator for scripting, it is already fine.

Throw instruction counting into the mix, and you can even be faster than LuaJIT, although I'm not sure how it manages to screw up the counting so badly. I wrote a little bit about it here: https://medium.com/@fwsgonzo/time-to-first-instruction-53a04...


Any ideas on why miniwasm performs better on all the benchmarks except "trap," on which it performs decidedly worse?


The benchmarks were run on MacOS, and actually execute an interrupt for debugging, MacOS then checks if the process is being debugged. Wasm3 just exit(1) and prints a message.

And as to why the rest are faster, I spent much time optimizing the interpreter and learning what the best way to write interpreters is. Its mostly jump threading and Mixed Data.


I found that most Wasm interpreters are not particularly good at calls. Wizard is not as fast as wasm3 or wamr in raw speed, but is much faster on calls, particularly because it does not copy arguments (value stacks can be overlapped). But Wizard's primary motivation is to be memory efficient, so it interprets in-place. It also supports instrumentation.

Nice work!


Don't take this as anything other than speculation: I wonder if wasm3 is using musttail with opaque function calls in the instruction handlers. It will demolish performance, which is why I am only using computed gotos in mine (when available). Even switch-case is faster than musttail when you have to leave the tco-jumps. Which is (as an example) why one should not measure performance by fibonacci number generation. :)


> I wonder if wasm3 is using musttail with opaque function calls in the instruction handlers. It will demolish performance, which is why I am only using computed gotos in mine (when available). Even switch-case is faster than musttail when you have to leave the tco-jumps.

This doesn't match with my experience. After working on this problem a lot, I came to the conclusion that musttail with opaque function calls is one of the best ways of getting good code out of the compiler: https://blog.reverberate.org/2021/04/21/musttail-efficient-i...


I meant having an opaque function inside your instruction handler. My assembly looks like crap if something doesn't get inlined. Because I have no way of achieving this I simply cannot use TCO. It runs fibonacci faster, but anything that uses memory is way worse because it pushes and pops a ton of registers on the instruction handler itself, and not the slow-path opaque function.

An instruction handler here being a dispatch function. It handles a single instruction.

Reading your post it says so under Limitations. Opaque calls trashes performance. I guess we agree, but then again I was just reading my assembly, so I had no reason to doubt myself.


Yes our solution was to make all fallback functions into tail calls. It solves the problem, but requires a lot of discipline and can be a bit awkward.

I recently saw this, which is a very interesting approach for using non-tail-call fallback functions without trashing the code: https://chromium-review.googlesource.com/c/v8/v8/+/4116584


That's very interesting! Thanks!


Can this share memory with the host? wasm3 doesn't allow this[1] and requires you to allocate VM memory and pass it to the host, but that has several downsides (some buffers come from external sources so this requires a memcpy; the VM memory location isn't stable so you can't store a pointer to it on the host; etc.).

I'm really interested in a fast interpreter-only Wasm VM that can allow the host to share some of its memory with the VM.

[1]: https://github.com/wasm3/wasm3/issues/114


I'd love to see a benchmark of this vs. libwasm https://github.com/SerenityOS/serenity/tree/master/Userland/...


How does it compare to Wasmtime? Do you have a list of supported wasm features?


I am afraid it doesn't work on linux.


Worked flawlessly for me on linux. The readme should mention 'emcc' and 'wasm3' as prerequisites for following the bench.py instructions, though.


I'm on macOS, and if I do this:

    > git clone ...
    > make -j
    > ./bin/wasm2wat ./test/core/address.wast
I get:

    >>> ./bin/wat2wasm test/core/address.wast 
    unexpected word: `` byte=256


Impressive!

Will work stop on MiniVM as result of Web49?


Has this been fuzz tested? Fuzz testing is one of the best ways of discovering security vulnerabilities in C code, and tools like libfuzzer and asan or AFL make it easier than ever.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: