8086 Processor's microcode pipeline from die analysis

userbinator · on Jan 10, 2023

It's interesting to read these articles about the implementation of an actual CPU, compared to the very idealised perspective that a lot of CS (and even some hardware-oriented) textbooks have. There's a lot of complexity that those who are used to the latter would otherwise not have seen.

Also worth noting is that, like the Z80 and 8080/8085, but unlike the 6502, the designers of the 8086 opted for instructions that took more cycles and compensated for that with increased clock frequency, instead of the 6502 that has shorter instructions and a lower clock frequency.

and the hardware fills in the particular ALU operation from bits 5-3 of the machine instruction.

This is clearly seen when the instructions are represented in octal:

    0u4 alu AL, imm8
    0u5 alu AX, imm16

The ALU operations are respectively: ADD, OR, ADC, SBB, AND, SUB, XOR, CMP.

According to the datasheets, the 80186 can execute the same instruction in 3 cycles, and the 80286 can do it in 2.

MatteoFrigo · on Jan 11, 2023

If you are interested in this topic, the book "The Anatomy of a High-Performance Microprocessor: A Systems Perspective" by Bruce Shriver and Bennett Smith describes the architecture of the AMD K6 processor (late '90s) in detail, including the structure of the pipeline and microcode.

mariuz · on Jan 11, 2023

You can borrow it from archive.org

https://archive.org/details/anatomyofhighper0000shri

pkaye · on Jan 11, 2023

Thanks for the info. I like to read up on this kind of stuff.

anyfoo · on Jan 11, 2023

> instead of the 6502 that has shorter instructions and a lower clock frequency.

The 6502 is a simpler CPU, 8 bit wide with less registers and less instructions. It uses a PLA instead of microcode, presumably because it still could get away with that.

rep_lodsb · on Jan 11, 2023

>According to the datasheets, the 80186 can execute the same instruction in 3 cycles, and the 80286 can do it in 2.

I've disassembled the 80186 microcode, it's very similar to the 8086 but with some enhancements. Most importantly, multiply/divide and address calculations like [BX+SI+imm] are now handled in hardware rather than microcode. The "update flags" bit has another role when combined with a jump, it means the following µ-instruction will always be executed during the next cycle (like the branch delay slot on MIPS).

Sometimes a jump without this bit set seems to be needed, for no other reason but to delay execution. ALU with an immediate operand on the '186 uses this code:

    0  Q     -> tmpb  0 L16  1
    1  M     -> tmpa  1 XI   tmpa  NX
    2  SIGMA -> M     4 none  RNI    F

Here the first line can load either 8 or 16 bits into tmpb, with a conditional jump to the next line (causing a 1 cycle delay) if the operand is 16 bits. For 8 bit operands it should be three cycles total, maybe when pipelined also for 16 bits?

A more interesting example is BOUND, which compares a register with a (signed) lower and upper bound in memory, in a somewhat convoluted way:

    0  R     -> tmpa  6 R DD,P2
    1  OPR   -> tmpb  1 SUBT tmpa
    2  F     -> tmpc  0 UNC  9 F
    3  SIGMA -> tmpa  1 SHL  tmpa  F
    4  CR    -> OPR   5 UNC  INT_N
    5  R     -> tmpb  6 R DD,P0
    6  OPR   -> tmpa  1 SUBT tmpa
    7                 4 CF1   none
    8  SIGMA -> tmpa  1 SHL  tmpa  F
    9  SIGMA -> none  0 OF   12
   10  tmpc  -> F     0 CY   4
   11                 0 UNC  13
   12  tmpc  -> F     0 NCY  4
   13                 0 NF1  5
   14                 4 none  RNI

Step by step, the following happens:

  (0) move the register operand into tmpa, read lower bound from memory

  (1) move lower bound into tmpb, set ALU to subtract tmpa-tmpb

  (2) preserve old flags in tmpc; execute next line and then goto line 9

  (3) move ALU output into tmpa, shift it left and update flags

  (9) if overflow, goto line 12
      -> (12) if carry clear, goto line 4 (-> "out of bounds" exception); restore flags
      else
      -> (10) if carry set, goto line 4; restore flags
         (11) goto line 13

  (13) if internal flag clear (should be true), goto line 5

  (5) move register operand into tmpb, read upper upper bound from memory

  (6) move upper bound into tmpa, subtract tmpa-tmpb

  (7) toggle internal flag

  (8) move ALU output into tmpa, shift left & update flags

  (9-12) same as before

  (13) if internal flag clear (should be false), goto line 5

  (14) run next instruction

While the x86 instruction set can do both signed and unsigned comparisons, the microcode - for the 80186 at least - seems to be limited to testing carry and overflow flags individually (sign bit xor overflow means a < b for signed integers). The internal flag is used as a 1-bit "loop counter" It is normally cleared at first, except when there is a REP prefix. Executing "REP BOUND" on the 80816 will only compare against the lower bound!

Apparently keeping the microcode short was more important than efficiency, or they would have used a few more instructions instead of jumping backwards.

Starting with the '286, the REP prefix would be handled during decoding, either ignored or going to a completely different microcode address.

Also on the 80186 but no other processor, AAM with an immediate operand of zero does not cause an exception (because the check for a zero divisor is still done in microcode, and AAM doesn't do it since the only documented variant is the one dividing by 10 decimal). The result is AH=FF, AL=unchanged.

colejohnson66 · on Jan 11, 2023

Is your disassembled 80186 microcode available online somewhere, like the 8086's is?

rep_lodsb · on Jan 11, 2023

It's the last post in this thread about the 8086 microcode:

https://forum.vcfed.org/index.php?threads/8088-8086-microcod...

Note that some of the comments in that post may be slightly wrong, and maybe I didn't add enough explanation for some things. Also ignore the stuff about getting the 80286 bits out, I was overly optimistic on that...

kens · on Jan 10, 2023

Author here to answer all your 8086 questions :-)

rwmj · on Jan 10, 2023

When you look at the die shots, do you get any idea of how the masks were drawn in those days? I assume computers of the day would not have been powerful enough for a CAD tool to design a silicon or metal layer of that complexity. Were they designed in smaller parts and then stitched together into a final design? -- perhaps evident by different parts having different design styles, or not quite using all the space between functional units. Were all parts done on a computer, or some pieces hand-drawn?

kens · on Jan 10, 2023

There's a detailed article "Recollections of Early Chip Development at Intel". For the first 10 years, Intel drew schematics and layout by hand. In 1974 they started digitizing on a Calma GDS I system. For the 8086, they spent two weeks manually matching the schematics against the drawn plots looking for errors and found 20 errors.

They cut out the drawings on sheets of Rubylith to create the masks. The earliest chips used a "Coordinatograph", a tool where you'd manually enter the coordinates from the drawing and it would move the cutter appropriately. Then they moved to a Xynetics plotter with a knife. The Rubylith sheets would be sent to the mask vendor, who would photographically reduce it down to mask size.

Intel had simulation tools, but they could only simulate 5-20 transistors at a time. So they could only simulate small, critical chunks like parts of the ALU.

https://www.intel.com/content/dam/www/public/us/en/documents...

rwmj · on Jan 11, 2023

In case you haven't seen it (although I'm sure you have), Now The Chips Are Down had a rare look inside an Intel fab in the 1970s: https://en.wikipedia.org/wiki/Now_the_Chips_Are_Down There are copies of the film on the internet.

retrac · on Jan 10, 2023

> I assume computers of the day would not have been powerful enough for a CAD tool to design a silicon or metal layer of that complexity.

A mask can be stored as a bitmap image, at 1-bit depth, one image per layer. You only need the image large enough to resolve your smallest features. At 10,000 by 10,000 pixels, one layer only takes ~12 megabytes to store uncompressed. (It would be easily compressed, too.)

A late 70s VAX minicomputer would be able to store dozens of such mask images, and could compute the difference between two masks of that size in some tens of minutes, approximately. That'll tell you if any parts overlap.

Using CAD in that way in chip design, dates to at least the 1970s. From there, standard blocks for, basically, copy and paste, are an obvious improvement over scanning in cut-out designs, or setting individual pixels. Actually synthesizing the masks automatically, from high-level hardware description languages, is something that developed gradually over the 1980s.

collectordrx · on Jan 10, 2023

I can perhaps give some insight into this. One step in the mask making process was hand cutting the design into a masking film called Rubylith. Rubylith is made of a backing sheet coated in a UV resistant red film. The design was hand cut into this film on a large scale then used as a “master” to expose smaller copies.

jason_s · on Jan 11, 2023

https://www.embeddedrelated.com/showarticle/1453.php -- I wrote up a piece on the 6502 development in 1974-1975.

fsckboy · on Jan 10, 2023

sort of a side question, but with all these projects excavating old dies (which are intensely interesting, don't get me wrong), I'm curious if any of the engineers who worked on this stuff originally (such a short time ago) ever pop up to explain how and or why it was done how it was done this way or that?

kens · on Jan 10, 2023

I've talked to some of the engineers to get background on the design of the 8086 chip. They also wrote some papers at the time explaining why things were done particular ways. The 8086 patents are also very detailed; many older patents actually provide a lot of useful information.

Patent: https://patents.google.com/patent/US4449184A

db48x · on Jan 10, 2023

Everything stalls while waiting for memory.

jason_s · on Jan 11, 2023

Bill Mensch and Albert Charpentier have had discussions posted on YouTube about the 6502 and VIC-II chips, respectively.

toinewx · on Jan 11, 2023

bit off-topic but what you do is inspiring, you are the kind of person that makes me think "damn its crazy what human can do". We are lucky to have you

PaulHoule · on Jan 10, 2023

I like how the microcode is a bit like VLIW in that each instruction has a move AND and action that are done in two parallel systems simultaneously.

kens · on Jan 10, 2023

Yes, the 8086 microcode does get some parallelism, but it's a pretty short Very Long Instruction Word :-) An 8086 micro-instruction is just 21 bits long, while VLIW words are much longer. I'm studying the IBM System/360 Model 50 and its micro-instructions are 90 bits long. A Model 50 micro-instruction is so complex with 28 different fields that it isn't represented by a line of code, but an 11-line block of code.

PaulHoule · on Jan 10, 2023

The 8086 is just starting to pipeline so it is starting small.

Recently I did some reading about the Transport Triggered Architecture and for the first time thought "I could make a special purpose microprocessor" and might actually try it with an FPGA. That got me thinking a lot about processor architecture, particularly the ability for one (real or micro) instruction to simultaneously control multiple subsystems.

What I see as problematic though is that the TTA stalls when it is waiting for memory, I think a high performance system based on the TTA would need some kind of programmable memory access engine that would try to schedule fetches ahead of time programmatically... But I think then it is getting pretty hard.

LargoLasskhyfv · on Jan 10, 2023

Have you seen https://github.com/MicroCoreLabs/Projects from

http://www.microcorelabs.com/about.html ?

Some of them implement 'Microsequencing'

( https://en.wikipedia.org/wiki/Microsequencer ) which could maybe be useful?

Conceptually, I mean.

psychphysic · on Jan 10, 2023

Can't access :/ Ukraine's blocked?

kens · on Jan 10, 2023

That's strange. Can you send me your IP address and I'll take a look?

fnordpiglet · on Jan 10, 2023

[flagged]

kramerger · on Jan 10, 2023

Sorry, but you don't happen to be a famous bending robot by any chance?

On a more serious note, the pipeline layout could be a side effect of how Shima worked with logic and transistor layout simultaneously (he was not involved in 8086, but a lot of design came from his 8080)

https://en.m.wikipedia.org/wiki/Masatoshi_Shima

IYasha · on Jan 10, 2023

[flagged]

kens · on Jan 10, 2023

Archive.org is still working on this page, but it should be available soon :-)

kens · on Jan 11, 2023

https://web.archive.org/web/20230110191027/https://www.right...