X86 assembler in Bash (2001)

s_tec · on June 12, 2014

It warms my heart to see such a brilliant hack. It's especially impressive that this doesn't use any external tools like sed.

If you scroll to the bottom, you can see how it works. Each x86 instruction is actually a shell function, and the input to the the assembler is itself a shell script. Each line of the input script basically tells the assembler which bytes to output. There is some fun stuff for dealing with variable-length x86 jump instructions, but otherwise that's the basic idea.

The new Mill CPU architecture actually uses a similar idea in their assembler. Each of their instructions is a C++ function. To assemble Mill software, the first step is to run the "assembly language" through a C++ compiler. The resulting program emits the appropriate Mill machine code when it runs. This is an interesting approach, since it turns the assembler into a reusable piece of software.

Another side-benefit of this technique is that it gives the assembler a super-powerful macro language. In this case, the macro language is basically bash shell. So you want to emit 100 add instructions? Just put the instructions inside a bash `for` loop!

illumen · on June 12, 2014

I worked on a runtime assembler based on C++ that worked like this. The nice thing is you could dynamically generate code from C++. Which made hardware shader languages run at an acceptable speed on CPUs. Because rather than have branches in your code to support thousands of possible combinations of inputs, you would generate the exact code needed for that data and function.

Strangely I still sometimes use some of these assembler techniques in JavaScript to great effect. Want to loop faster? Use something like duffs device. Have lots of ifs/elses in a loop? Parse it out, and make specific code for the inputs at runtime. Not often, but sometimes it's been helpful to improve things.

fit2rule · on June 12, 2014

Warms my heart too, but what interests me most about this story is that I recall assemblers being written this way as a sort of a standard technique in the 70's and 80's, and it seems to have been a lost art, now recovered (or, at least in 2001) .. I remember some Tandem (or perhaps Wang?) machines used the shell to emit assembly instructions, and in the days of MIPS as a hardware manufacturer (RISCOS pizzabox) there were such assemblers, very rudimentary, for the boot console, so you could emit code to start the machines ..

{Hmm .. what is the term for this, it occurs so often I'm sure there must be a description of it, where new, old stuff becomes new and interesting again?}

mattgodbolt · on June 12, 2014

JavaScript emulators also use this internal code generator approach. For example the jsbeeb 6502 emulator code is generated from the opcodes text and then 'eval'ed to yield the actual code to run for each instruction.

See http://xania.org/201405/jsbeeb-getting-the-timings-right-CPU towards the end.

vidarh · on June 12, 2014

> and in the days of MIPS as a hardware manufacturer (RISCOS pizzabox)

I was very confused there for a moment - I'd never heard of RISC OS pizza box in any other context than the Acorn / ARM machines. In case anyone else is similarly confused: This is RISC/os the Unix for MIPS based systems.

fit2rule · on June 12, 2014

Right, sorry for the confusion, and thanks for clearing that up.

mzs · on June 12, 2014

Maybe you mean rom monitor?

4ad · on June 12, 2014

Related, aaa, by Henry Spencer[1]: http://doc.cat-v.org/henry_spencer/amazing_awk_assembler/

   "aaa" (the Amazing Awk Assembler) is a primitive assembler written entirely
   in awk and sed.  It was done for fun, to establish whether it was possible.
   It is; it works.

Also, awf, also by Henry Spencer: http://doc.cat-v.org/henry_spencer/awf/

    This is awf, the Amazingly Workable Formatter -- a "nroff -man" or
    (subset) "nroff -ms" clone written entirely in (old) awk.

[1] http://en.wikipedia.org/wiki/Henry_Spencer

rwmj · on June 12, 2014

On a similar topic, FORTH assemblers are interesting. You write FORTH code like:

    : RDTSC
       RDTSC
       EAX PUSH
       EDX PUSH
    ;CODE

which compiles to a wrapper that runs the rdtsc instruction and pushes the result (2 x 32 bit words) onto the FORTH stack.

Which reminds me, I must finish this one: http://git.annexia.org/?p=jonesforth.git;a=blob;f=jonesforth...

dvdkhlng · on June 12, 2014

[Edit: yes the code above seems to be valid for the author's non-standard "jonesforth" implementation of Forth. ;CODE is defined in the ANS Forth standard to mean something very different.]

Sorry for the nit-picking, but the code you give is most likely wrong. Assembler code in Forth is enclosed in CODE...END-CODE . ;CODE is used to attach machine-code run-time semantics to words created with CREATE [1].

  CODE RDTSC  ( -- d )
       RDTSC
       EAX PUSH
       EDX PUSH
  END-CODE

Here RDTSC and PUSH are not compiled but executed immediately to output the corresponding machine code to the current definition (which also uses the name RDTSC albeit in a different vocabulary).

You can generate machine code by invoking the Assembler's words from Forth words (a "word" is what you call "functions" in other languages), which can be used as a simple macro facility or as a facility to dynamically generate machine code, do automatic register allocation etc.

   ALSO ASSEMBLER
   : my-macro      RDTSC     EAX PUSH     EDX PUSH ;
   CODE RDTSC   my-macro  END-CODE

BTW for those interested, this is the source of the x86 assembler, written in Forth, that ships with GNU forth [2]. The amd64 version [3] even supports SSE. When writing assembler code in Gforth, the non-standard ABI-CODE facility [4] is preferable over CODE..END-CODE, BTW.

[1] http://www.forth200x.org/documents/forth13-1.pdf

[2] http://git.savannah.gnu.org/cgit/gforth.git/tree/arch/386/as...

[3] http://git.savannah.gnu.org/cgit/gforth.git/tree/arch/amd64/...

[4] http://www.complang.tuwien.ac.at/anton/euroforth/ef10/papers...