It's interesting to read these articles about the implementation of an actual CPU, compared to the very idealised perspective that a lot of CS (and even some hardware-oriented) textbooks have. There's a lot of complexity that those who are used to the latter would otherwise not have seen.
Also worth noting is that, like the Z80 and 8080/8085, but unlike the 6502, the designers of the 8086 opted for instructions that took more cycles and compensated for that with increased clock frequency, instead of the 6502 that has shorter instructions and a lower clock frequency.
and the hardware fills in the particular ALU operation from bits 5-3 of the machine instruction.
This is clearly seen when the instructions are represented in octal:
0u4 alu AL, imm8
0u5 alu AX, imm16
The ALU operations are respectively: ADD, OR, ADC, SBB, AND, SUB, XOR, CMP.
According to the datasheets, the 80186 can execute the same instruction in 3 cycles, and the 80286 can do it in 2.
If you are interested in this topic, the book "The Anatomy of a High-Performance Microprocessor: A Systems Perspective" by Bruce Shriver and Bennett Smith describes the architecture of the AMD K6 processor (late '90s) in detail, including the structure of the pipeline and microcode.
> instead of the 6502 that has shorter instructions and a lower clock frequency.
The 6502 is a simpler CPU, 8 bit wide with less registers and less instructions. It uses a PLA instead of microcode, presumably because it still could get away with that.
>According to the datasheets, the 80186 can execute the same instruction in 3 cycles, and the 80286 can do it in 2.
I've disassembled the 80186 microcode, it's very similar to the 8086 but with some enhancements. Most importantly, multiply/divide and address calculations like [BX+SI+imm] are now handled in hardware rather than microcode. The "update flags" bit has another role when combined with a jump, it means the following µ-instruction will always be executed during the next cycle (like the branch delay slot on MIPS).
Sometimes a jump without this bit set seems to be needed, for no other reason but to delay execution. ALU with an immediate operand on the '186 uses this code:
0 Q -> tmpb 0 L16 1
1 M -> tmpa 1 XI tmpa NX
2 SIGMA -> M 4 none RNI F
Here the first line can load either 8 or 16 bits into tmpb, with a conditional jump to the next line (causing a 1 cycle delay) if the operand is 16 bits. For 8 bit operands it should be three cycles total, maybe when pipelined also for 16 bits?
A more interesting example is BOUND, which compares a register with a (signed) lower and upper bound in memory, in a somewhat convoluted way:
0 R -> tmpa 6 R DD,P2
1 OPR -> tmpb 1 SUBT tmpa
2 F -> tmpc 0 UNC 9 F
3 SIGMA -> tmpa 1 SHL tmpa F
4 CR -> OPR 5 UNC INT_N
5 R -> tmpb 6 R DD,P0
6 OPR -> tmpa 1 SUBT tmpa
7 4 CF1 none
8 SIGMA -> tmpa 1 SHL tmpa F
9 SIGMA -> none 0 OF 12
10 tmpc -> F 0 CY 4
11 0 UNC 13
12 tmpc -> F 0 NCY 4
13 0 NF1 5
14 4 none RNI
Step by step, the following happens:
(0) move the register operand into tmpa, read lower bound from memory
(1) move lower bound into tmpb, set ALU to subtract tmpa-tmpb
(2) preserve old flags in tmpc; execute next line and then goto line 9
(3) move ALU output into tmpa, shift it left and update flags
(9) if overflow, goto line 12
-> (12) if carry clear, goto line 4 (-> "out of bounds" exception); restore flags
else
-> (10) if carry set, goto line 4; restore flags
(11) goto line 13
(13) if internal flag clear (should be true), goto line 5
(5) move register operand into tmpb, read upper upper bound from memory
(6) move upper bound into tmpa, subtract tmpa-tmpb
(7) toggle internal flag
(8) move ALU output into tmpa, shift left & update flags
(9-12) same as before
(13) if internal flag clear (should be false), goto line 5
(14) run next instruction
While the x86 instruction set can do both signed and unsigned comparisons, the microcode - for the 80186 at least - seems to be limited to testing carry and overflow flags individually (sign bit xor overflow means a < b for signed integers). The internal flag is used as a 1-bit "loop counter" It is normally cleared at first, except when there is a REP prefix. Executing "REP BOUND" on the 80816 will only compare against the lower bound!
Apparently keeping the microcode short was more important than efficiency, or they would have used a few more instructions instead of jumping backwards.
Starting with the '286, the REP prefix would be handled during decoding, either ignored or going to a completely different microcode address.
Also on the 80186 but no other processor, AAM with an immediate operand of zero does not cause an exception (because the check for a zero divisor is still done in microcode, and AAM doesn't do it since the only documented variant is the one dividing by 10 decimal). The result is AH=FF, AL=unchanged.
Note that some of the comments in that post may be slightly wrong, and maybe I didn't add enough explanation for some things. Also ignore the stuff about getting the 80286 bits out, I was overly optimistic on that...
When you look at the die shots, do you get any idea of how the masks were drawn in those days? I assume computers of the day would not have been powerful enough for a CAD tool to design a silicon or metal layer of that complexity. Were they designed in smaller parts and then stitched together into a final design? -- perhaps evident by different parts having different design styles, or not quite using all the space between functional units. Were all parts done on a computer, or some pieces hand-drawn?
There's a detailed article "Recollections of Early Chip Development at Intel". For the first 10 years, Intel drew schematics and layout by hand. In 1974 they started digitizing on a Calma GDS I system. For the 8086, they spent two weeks manually matching the schematics against the drawn plots looking for errors and found 20 errors.
They cut out the drawings on sheets of Rubylith to create the masks. The earliest chips used a "Coordinatograph", a tool where you'd manually enter the coordinates from the drawing and it would move the cutter appropriately. Then they moved to a Xynetics plotter with a knife. The Rubylith sheets would be sent to the mask vendor, who would photographically reduce it down to mask size.
Intel had simulation tools, but they could only simulate 5-20 transistors at a time. So they could only simulate small, critical chunks like parts of the ALU.
In case you haven't seen it (although I'm sure you have), Now The Chips Are Down had a rare look inside an Intel fab in the 1970s: https://en.wikipedia.org/wiki/Now_the_Chips_Are_Down There are copies of the film on the internet.
> I assume computers of the day would not have been powerful enough for a CAD tool to design a silicon or metal layer of that complexity.
A mask can be stored as a bitmap image, at 1-bit depth, one image per layer. You only need the image large enough to resolve your smallest features. At 10,000 by 10,000 pixels, one layer only takes ~12 megabytes to store uncompressed. (It would be easily compressed, too.)
A late 70s VAX minicomputer would be able to store dozens of such mask images, and could compute the difference between two masks of that size in some tens of minutes, approximately. That'll tell you if any parts overlap.
Using CAD in that way in chip design, dates to at least the 1970s. From there, standard blocks for, basically, copy and paste, are an obvious improvement over scanning in cut-out designs, or setting individual pixels. Actually synthesizing the masks automatically, from high-level hardware description languages, is something that developed gradually over the 1980s.
I can perhaps give some insight into this. One step in the mask making process was hand cutting the design into a masking film called Rubylith. Rubylith is made of a backing sheet coated in a UV resistant red film. The design was hand cut into this film on a large scale then used as a “master” to expose smaller copies.
sort of a side question, but with all these projects excavating old dies (which are intensely interesting, don't get me wrong), I'm curious if any of the engineers who worked on this stuff originally (such a short time ago) ever pop up to explain how and or why it was done how it was done this way or that?
I've talked to some of the engineers to get background on the design of the 8086 chip. They also wrote some papers at the time explaining why things were done particular ways. The 8086 patents are also very detailed; many older patents actually provide a lot of useful information.
bit off-topic but what you do is inspiring, you are the kind of person that makes me think "damn its crazy what human can do". We are lucky to have you
Yes, the 8086 microcode does get some parallelism, but it's a pretty short Very Long Instruction Word :-) An 8086 micro-instruction is just 21 bits long, while VLIW words are much longer. I'm studying the IBM System/360 Model 50 and its micro-instructions are 90 bits long. A Model 50 micro-instruction is so complex with 28 different fields that it isn't represented by a line of code, but an 11-line block of code.
The 8086 is just starting to pipeline so it is starting small.
Recently I did some reading about the Transport Triggered Architecture and for the first time thought "I could make a special purpose microprocessor" and might actually try it with an FPGA. That got me thinking a lot about processor architecture, particularly the ability for one (real or micro) instruction to simultaneously control multiple subsystems.
What I see as problematic though is that the TTA stalls when it is waiting for memory, I think a high performance system based on the TTA would need some kind of programmable memory access engine that would try to schedule fetches ahead of time programmatically... But I think then it is getting pretty hard.
Sorry, but you don't happen to be a famous bending robot by any chance?
On a more serious note, the pipeline layout could be a side effect of how Shima worked with logic and transistor layout simultaneously (he was not involved in 8086, but a lot of design came from his 8080)
Also worth noting is that, like the Z80 and 8080/8085, but unlike the 6502, the designers of the 8086 opted for instructions that took more cycles and compensated for that with increased clock frequency, instead of the 6502 that has shorter instructions and a lower clock frequency.
and the hardware fills in the particular ALU operation from bits 5-3 of the machine instruction.
This is clearly seen when the instructions are represented in octal:
The ALU operations are respectively: ADD, OR, ADC, SBB, AND, SUB, XOR, CMP.According to the datasheets, the 80186 can execute the same instruction in 3 cycles, and the 80286 can do it in 2.