IMO the base+index encoding (in 16-bit x86) was a mistake, it should have simply allowed indirection through any single register (+ immediate offset), including SP.
• the address calculation would be simple enough to do hardwired even on the original 8086, instead of using slow microcode. The 8086 actually took a different amount of clock cycles depending on the registers used for base and index, because there wasn't enough ROM space to duplicate the code for every combination without using jumps!
• you could use [SP+xxxx] to access local variables (frame pointer omission)
• would be able to emulate the 8080's LDAX/STAX and XTHL in a single instruction
• with so few registers, you really want every one available for memory addressing. Especially AX which is used implicitly by instructions like LODSW and MUL.
TBH I'm still flabbergasted that modern SoCs will "just put a CPU there" rather than a DMA, because we can actually afford it. 6502 cores are sprinkled around like fairy dust.
I remember someone who was acutely familiar with the design of my first computer (the Ohio Scientific Superboard II, an awesome 1970s 6502 machine! http://oldcomputers.net/osi-600.html ) saying that he spotted the same hardware being used, in its entirety, as a printer buffer a few years later. (A printer buffer was basically a device used to cache the output of a computer to a slow printer so the original machine didn't have wait for the printer to finish. A glorified cable, in other words.)
I cut my computing teeth on a sibling of that machine, an Ohio Scientific C2-8P that my brother and I bought in 1979. I was 12, and reverse engineering the games that came on the cassette tapes with the machine are how I got started in software. The arduino uno (atmel/microchip AT328p) is a much more powerful computer lol.
The main reason why we have network is not for pc to pc communication but sharing printer and hard disk. Communication comes later. Do not underestimate the productivity gain of printer buffer.
Well, I was pre-emptively explaining it, because I realised in mid-anecdote that printers don't need that kind of thing these days. Maybe you are right, and it is obvious to someone skilled in the arts, as the patent applications say!
They do need them, it's just that the buffer is built in to the printer now because memory and integration are so cheap.
The giant Kyocera printer at work has a little screen where it shows you it not only loading up the print buffer, but purging the information from the disk afterward so there is no trace of PII remaining.
They have them, because, as you said, it's so cheap. But I think dannyobrien is correct; they probably don't need them, because modern OSes have print queues and the computer is no longer useless while printing.
Even if the printer is dumb and doesn’t interpret a page description language such as PostScript, I don’t think it’s feasible to get the timing right to drive modern printing hardware when the buffer is on some random other piece of hardware whose OS you don’t even control, even if the connection is by cable.
It’s nice to be able to run off a long document, pack up your laptop, and pick up the document on your way out the door to the (courtroom|client|boardroom|exam hall|___etc___).
1982 Victor 9000 / Sirius 1 designed by Chuck Peddle of Commodore/MOS 6502 fame had floppy capacity extended from standard IBM PC 180KB all the way up to 600KB per side. 1.2MB floppy in 1982 using same magnetic medium! How? Software controlled drive with 6502 doing GCR combined with Zoned Constant Linear Velocity and Zone bit recording. There were still hard drives being manufactured not using ZBR 8 years later (Seagate ST-157A)!
There was also an IBM PC 8088 clone in ~1984 with "dedicated Video DMA" implemented by Intel MCS-48 microcontroller (8042/8048, same chip as keyboard controllers at the time), sadly I cant remember the name and google fails me :( I do remember nobody used it for anything and the only coding example I could find at the time was a singular YT video where someone reverse engineered manufacturer demo and reused it for its own sprite engine.
Most FPGA projects need a processor anyway, usually adding a soft core, but a hard core is superior in area and performance -- though you cannot choose the type of core anymore, obviously. I'd argue this is a different scenario because they didn't "sprinkle cores" because they had the area but actually reduced the area needed.
In college (more than a decade ago for me, but still in this millennium) our MOS 6502 hardware/assembly programming lab was called "Microcontroller Design". This was as opposed to the later lab class "Microcomputer Design" which used the more recent Motorola 68k. I suppose the chip that powered the original Macintoshes is slightly closer to today's idea of a microcomputer than the 6502 and its original "microcomputer revolution" Apple II/TI Home Computer/Commodore 64. One of the takeaways from those courses was that the "microcomputer" has moved on so much from what you can breadboard in a lab (the Motorola 68k was one of the last "microcomputer" chipsets that you could breadboard and even that was still a hardware reliability challenge), but MOS 6502 was cemented as a "forever" tool as a "microcontroller". You can still get ICs in bread board forms and build it next to classic TTL logic chips and test/prototype it at human scale clock cycles (things you can see/read in an oscilloscope in real time). You can merge into an SoC core and run it at modern processor clock cycles. There are so many wild uses for the MOS 6502 as a "microcontroller" on the periphery of computing. Between MOS' bankruptcy and the many licensees of the 6502 and the many people that cut their programming teeth in the "microcomputer revolution" of the Apple II et al, the 65xx family likely really is a vestigial "forever chip".
As a programmer, and I've heard this sentiment from others too, I also think the 6502 was also the last "fun" assembly language to write directly without a higher level language, which adds to its legacy.
For a lot of things a 6502 or other 8bit cores are more than enough. Don't need to put a giant ARM or RISC-V into everything to drive a simple display.
For the core alone, sure, but don't you have to run them off static ram or something (that is, not slow DRAM) to get anywhere near those speeds in practice?
While it does exist, I think the answer is not really. Yes sometimes you don't need more than 8-bit but existing solutions from other vendors beat it with integrated RAM+Flash+IO etc
I'm not talking about adding the 32-bit ModRM & SIB byte to the 8086, just the single-register subset with no special cases. Decoding would be exactly the same as for register operands. And there would still be microcode, just not for adding the base+index register - maybe for adding an offset if it is present.
Yes, the 8086 was simple, but all the parts needed to do this should be present: whether an instruction uses register or memory operands is entirely under the control of hardwired logic, the microcode is the same for both (the address calculation subroutine runs first, but is invoked automatically by the hardware).
I would say Intel chose to make addressing more complicated in order to add a feature that complicated register allocation for very little usefulness. If you see something like [BX+SI] in disassembled code, in 99% of cases it's not an actual instruction, but data or misaligned code.
The base+index encoding is a significant code size reduction though, and as the element size is a restricted set of powers of two, the scaling is cheaper than a generic shift. So base+index uses less memory (less icache specifically), less instruction dispatch overhead, and is able to perform the needed math more efficiently. I assume in modern chips most of these costs aren't really an issue, but that wasn't the case back when the ISA was created.
Another big mistake was to make the paragraph size 4 bits. Should have been at least 8 -- that would have given the 8086 a 24-bit address space, which would have made things verrry interesting in the MSDOS / early Windows world.
(I know, they didn't have address pins on the 40-pin package. Still, an architecturally guaranteed 16MB of room to play in would have changed the face of the software industry in the 1980s).
> I know, they didn't have address pins on the 40-pin package.
You are right that doesn't have to be a limitation on internal architecture: The 386SX internally had the 386's 32-bit data bus only 24 bits were carried out to the external pins. This, along with squeezing the 32-bit data lines through a 16-bit data bus by doubling up requests, made it able to be used on cheaper motherboards based on 286 designs.
I'm not sure that this would have made enough difference to be worth the cost of the extra silicon at that time of the 8086's design though. While a flatter and more roomy address space would have been convenient to developers there would be little or no benefit for end users. The 386SX was a success because it could run software intended for the original 386 (then renamed 386DX) on cheaper hardware (and had the same MMU allowing virtual memory so could run larger software then it's 16Mb physical RAM limit made possible, at a further performance cost). Also the extra silicon requirements were not as expensive at the time the 386SX was being designed as when the 8086 was.
In fact there was a chip that did similar for the 8086: the 8088. It had an 8-but external address bus and broke up 16-bit requests to make them over it. Though unlike the 386SX it maintained the same address bus of its relative.
> an architecturally guaranteed 16MB of room to play in would have changed the face of the software industry in the 1980s
I don't think it would have, because such amounts of RAM were prohibitively expensive into the late 80s. It would have changed the way software accessed the available RAM, made life easier for developers and improved compatibility be removing various kludges, but largely computers would have looked and behaved similarly, having largely similar specs.
VisiCorp was developing their VisiOn platform in the 1984-ish timeframe. They were bending heaven and earth to fit into 640K.
Telling their corporate customers, "Buy another 512K of RAM" would probably have saved their company and let them ship a number of world-class productivity products, and given Windows some interesting competition.
It's easy to second guess this stuff, on the other hand 512K of RAM was about $120 in 1985. We shipped the Atari ST with that much, when having that amount of memory on a Mac was pretty rare.
Given that the 8086 has SI and DI with a +imm8 mode, like the (IX+imm8) and (IY+imm8) instructions on the Z80. I do wonder if Z80 CP/M machines were the real target rather than the 8080.
I did write an 8086 assembly based Z80 emulator (in the late 80's) that was surprisingly compact and had acceptable Z80 performance on a 12MhZ AMD 80286. It passed CP/M system calls to Concurrent DOS.
I should see if I still have the code and run an emulator torture test to see how bad it was.
I did see some opportunities for additional code-golf. Also noticed that EX AF,AF' did not properly exchange the flags (so that will need to get fixed!).
When I first got into assembly language programming, I used MS-DOS DEBUG as my assembler, since it was included on basically every PC in those days.
I was extremely frustrated because my textbook's later chapters used MASM as the assembler, which at the time cost some ghastly amount of money (somewhere around $180 if I recall correctly, which was a truly staggering number of birthday and Christmas presents).
Anyway, my point is, in 2022 everyone takes for granted that an assembler for any system that matters can be downloaded for free from the Internet -- or if it doesn't, one will soon be written by some enthusiast and posted on Gitlab.
But back in the 1980's and well into the 1990's, assemblers were expensive products that companies actually got away with charging a lot of money for. Presumably they made a significant amount of revenue from those sales.
So the use case that immediately springs to my mind is being able to prove a particular commercial binary was assembled by your assembler, in order to make a successful accusation of software piracy and extract a monetary settlement from its developer.
Back in the old Zortech days in the 1980s, a programmer sent us a macro assembler he had coded up. He wished us to include it with Zortech C++, and pay him a royalty.
My partner was impressed with it, and sent it along to me to evaluate.
My tests showed close compatibility. A bit too good. It didn't take me too long to figure out it was the Microsoft MASM program with the copyright notice patched out.
Dodged a bullet with that one.
The reason assemblers were expensive in those days is they were written in assembler, which is far more expensive than writing in C.
There were other reasons assemblers were expensive in those days, as you know:
① The number of customers was smaller than it is today, so to remain profitable the NRE cost had to be paid by a smaller number of sales.
② Legal proprietary PC software distribution was competing with user group meetings, Hamvention, and 1200-baud underground BBSes where people used names like "Warlord of Chaos", not pop-up PC repair shops, anonymous FTP sites, Eggdrop, Napster, BitTorrent, Debian, and GitHub. Faster and cheaper ways of sharing software both made it cheaper for enthusiasts to collaborate on writing free software and made illegal copies cheaper and more accessible. It's easier to stomach paying US$180 (say, US$500 today) for MASM when your alternative is to spend your weekend and US$60 making long-distance calls to BBSes that might or might not have a copy.
Yes, this is very much true. For the first three systems I owned I wrote my own assembler, just to save some money (which was incredibly tight in those days, as in: that was three months worth of food or so).
Time was worth relatively little in comparison and I figured I will at least know the instruction set inside-out.
Getting it to the point where it would assemble itself was pretty much the goal of the whole exercise. Hand assembly with Leventhal's book (which I still have :) ) to get it to work, usually one subroutine at the time. And then once it work things got much quicker, the bigger problem then was a proper editor.
Memory management was a tricky part and macro expansion only got done after the whole thing worked (that would have saved a lot of time but I couldn't see the wood for the trees at that point).
The nasty part was that instructions were variable length so you could get into these weird errors where your initial guess for a jump target was too small, which would then move everything else as well, which might lead to other jumps becoming too small. I don't have the code any more but I think I fixed that by making all the jumps long jumps initially. A bit wasteful but that way I could at least get it to work in one pass. Later on when I got better at assembly that sort of thing got fixed.
The end product was endless 'data' statements that were poked into memory at 48K and up, right over the basic interpreter, which we had figured out a way to page out and replace from software. This later led to an annotated version of the basic that came with that computer (a version of MS Basic) and subsequently a replacement interpreter with a lot of built in goodies (double speed tape interface, a whole raft of new statements and functions). Good times!
Shout out to my friend Henri G. who did at least half of that work, we had identical home computers (Dragon 32) and worked together a lot on this. He probably still has his in working order :)
> ... initial guess for a jump target was too small ...
Ye that one is annoying.
I am making a VM compiler right now, and I have the same problem. I would probably have to make some intermediate tree to solve it without making every jump a "far jump".
Sometimes you can get into a series of these where each subsequent increase in length results in a previous jump that worked before now being out of range. That can get quite tedious.
When I was in high school, I hand-assembled my 6502 code. I’d write it longhand out on graph paper, then write down addresses in the leftmost column, opcodes in the middle & finally I’d type in all the hex from the Apple Monitor. It was slow, painstaking and error-prone, but it made for clear thinking about how the program worked before I sat at the computer.
There’s a story about someone watching Woz entering hex into the computer directly to write a program, and every now and then he’d pause for a little. They asked him why and he said he was calculating the offset for a forward branch, which would mean he’d have to think of all the code up until the branch destination and count the bytes. Then he could carry on and enter those bytes.
That's pretty amazing. I would do something similar but easier: invert the branch do a jmp [word] and then continue as though nothing changed. Then, afer you know the destination address fill it into [word]. That way you don't end up running out of memory (your own, not the computer's).
It probably also really honed your ability to simulate a CPU in your head. That's invaluable when looking at code, to build up a mental model of what it really does (rather than what the comments say it does).
I did the same (after reverse-engineering the opcodes by filling memory with sequential bytes and listing the output) until I discovered that the Apple //e (and maybe others) included a built-in mini assembler: https://www.youtube.com/watch?v=yJp2TnKhnzY
The Mini Assembler was part of the Apple ][ ROM with Integer BASIC but not on the ][+ ROM with AppleSoft BASIC. I think I eventually got a floppy from someone at a math competition or something like that which had a loadable Integer BASIC image that could be loaded into the RAM of a 64K Apple ][+ or //e, but that was not long before I graduated high school and at a point where I was doing most of my programming on a VM/CMS system (and then later VAX/VMS).
The //e "Enhanced" model (which I had) had the mini assembler built-in. [1] The monitor even had a shortcut to start it -- "!".
(Had to Google to remember this bit of trivia again -- funnily enough turned up this [2] HN comment I replied to 9 years ago with nearly the exact same comment I made above.)
The company I worked for in the 1970s, Aph, developed a lot of embedded systems. Their secret weapon was they developed macro assemblers for the various microprocessors in BLISS on a PDP-10. (I think Dan O'Dowd, yes, that Dan! developed them.) With a proper macro assembler, as good as the ones on mainframes, that ran on a mainframe, you could develop code much faster.
There was the a86/d86 suite for assembly and disassembly which were shareware as long it wasn't used for commercial purposes. These worked well enough.
They were shareware, but that's not the same as freeware. He did expect payment for the software if you actually used it beyond evaluation, commercially or not. As I recall he wanted $50 for a license.
As incentive, he'd send you enhanced versions of the software (with 386 opcode support iirc... or was it 32-bit support?), and a ringbound hardcopy of the manuals.
Before that it was a case of writing code on graph paper, looking up the opcodes in the manual, and assembling by hand. Needless to say that writing programs was slow and difficult.
I think I jumped straight from paper -> A86, without using debug.com. Not sure why.
The cool thing to do with it is add it to your favorite text editor as a command - highlight some of your text and disassemble it! Great party game - what is your name disassembled? Valid code, or are you a seg fault?
Yes, it’s difficult these days to imagine the software situation of the early 1980s: how little software there was, how expensive it was, and how much effort was spent on copy protection that sometimes prevented legitimate customers from using it.
I was just reading a book called “Almost Perfect”, a first-hand history of WordPerfect by one of the founders. He mentions that they bought an IBM PC when it came out and wanted to port their word processor (over from the Data General minis that had been their initial platform). But they had to wait five months for an assembler to become available.
Imagine buying a $5,000 computer and you can’t even get an assembler for it for any money.
(WordPerfect was written entirely in x86 assembly, even when it was a billion-dollar product.)
Writing an assembler for the 8086 doesn't take five months! Especially if you have a Data General mini to run it on, which already had an assembler and scripting with domacros.
In fact, if memory serves, RDOS had FORTRAN, ALGOL, and BASIC available, even if most of the system software was written in assembler.
Writing an assembler in BASIC on either the Nova or the IBM PC would have been considerably less hassle than writing it in assembly and hand-assembling it.
Presumably they didn't want to spend programmer time writing their own macro assembler when they knew others are working on a commercial product. The DG version of WordPerfect was shipping already, so they probably rather spent the time on improving that and waited for the PC-DOS ecosystem to evolve a bit.
> $180 if I recall correctly, which was a truly staggering number of birthday and Christmas presents
I had the same problems, only I was programming a Laser 128 (an Apple II clone) that had a built in System Monitor, which let you directly change memory addresses, but no assembler.
So I photocopied the Assembly to Machine Language chart in a book I got from the library, and did it manually - I learned quite a lot. I remember writing a layout and edge detection routine for a Tetris clone I tried to write (never finished it). Basic was too slow for those critical parts, so I called out to the routine, but did most of the code in Basic.
I remember been utterly wowed that Linux came with a free compiler - I could only dream of having software like that back then.
If you sell a commercial compiler or assembler, and some company starts selling a product produced with your compiler/assembler, and you know you never sold them a license, you've got yourself someone to sue. For a less successful/commercial product, it could also just be for the joy of finding it used in the wild.
Similar fingerprinting techniques were also used by assembly coders as proof of copyright. When reverse-engineering to a spec, it's not unusual for routines, even long routines, as well as lookup tables, etc. to end up being coded the same, or nearly so, as the original software. This isn't necessarily infringing of copyright -- for example there are only a few sensible ways to check if a character is uppercase in ASCII. It can be hard to tell reverse-engineered and re-implemented code, from disassembled and reshuffled code (which is copyright infringement). If the suspect routine incorporates something like an idiosyncratic sequence of NOPs that serve no particular purpose, coincidence is a lot harder to argue.
Can anyone find a version of A86 that actually encodes some information in its output, or is this just a legend? It produces different opcodes for some instructions than MASM, but so do DEBUG and various other assemblers. And the same instruction repeated multiple times always assembles to the same opcode as far as I have seen.
To prove some program was made with A86, there would have to be a more specific signature that wouldn't match any other assembler, like a copyright message encoded bit-by-bit in the opcode choice.
Tried it out now with different instructions. There's a pattern, though it's not quite a steganographic message, just a different encoding chosen depending on the second register operand.
It was sold under a shareware model, and if you used it commercially you were supposed to buy the licence. I guess they randomly sampled other people's software to see if it looked like it was built with a86 or not, but honestly I'd never actually heard of them doing that and suspected that it was just enough of a threat to know it was possible that they hoped people would pay up.
I am looking into writing some basic compiler(s) but I really want to emphasize micro-optimizations, maybe even the stochastic ones -- don't ask why, it's just an obsession.
Question: is there an Arduino-like hardware that allows FULL visibility into the efficiency of every piece of code ran on it? On my first job, 20 years ago, there was a team working on embedded boards that maybe did something similar.
But again, I am looking for a piece of hardware -- and a software stack for it -- where you say "here's this chunk of assembly / machine code, run it and give me detailed statistics which instruction got ran the most times, which one took the longest time" etc.?
Using 64-bit registers, or any of R8-R15, requires a prefix byte. The single-byte INC/DEC opcodes have been repurposed for this. And for performance reasons having to do with how it affects the flags, you don't want to use INC/DEC (now two bytes) at all.
So now the very common operation of incrementing a register takes at least three bytes, four if it needs the prefix. Legacy instructions have been made invalid - depending more on the whims of Intel/AMD than whether they are actually useful - but still clutter up the opcode space.
Obviously it has expanded, and not saying "let's completely replace the ISA in 64bit modes" means that a bunch of cruft is going to be consuming the ISA space. At the same time there is a lot of code that isn't using 64bit values so the compact coding can still be used.
But while talking about ISA size and compactness, we need to compare to the real world alternative: the various ARM or RISC-V ISAs, etc are all 4byte per instruction with the occasional 2 byte variable length extensions*. 4 byte does seem to have been what has been selected for.
My experience with code size is wrapped up in jitting JS so I could believe my experience isn't as reflective of real world code size as it could be :D
* fun story: thumb2 had an early bug/errata where a 4 byte branch spanning a page boundary would jump to the wrong page.
I thought the trailing-off title was some sort of clickbait (which would be weird for Chen!); but actually the title is "Yes, the 8086 wanted to be mechanically translatable from the 8080, but why not add the ability to indirect through AX, CX and DX?", and I guess it just got asemantically chopped on HN's end. (To avoid my own clickbait, the one-sentence answer from the article, on which of course the rest of the article elaborates, is "Basically, because there was no room.")
There is a more-or-less comprehensive archive[1] up to 2019 (which should probably be scraped and hoarded). The article in question is there, comments included[2].
Side note: Michael Kaplan’s blog, which Microsoft took down in a remarkably shameful manner, has also been archived[3], while Eric Lippert reposted (most of?) his old articles on his personal WordPress instance[4].
It’s still there[1], although a number of the posts seem to be missing. Inexplicably, (most of) the comments are in place. (There are other Microsoft blogs that have disappeared without a trace, but not this one.)
I don't understand the reasoning behind this. Why do you need 5 bytes of unexecuted patch space before the program _and_ 2 bytes of patch space at the beginning of the program?
Wouldn't it be the same to have a single 5-byte effectless operation to patch a single long jump instead of needing space for two jumps?
Because you can’t atomically replace the NOPs. So there’s nothing to prevent you from inserting your patch while a thread partway through consuming the NOPs, resulting in a portion of your patch being decoded out of order.
Modern x86 processors decode multiple instructions per clock. By “slots”, I’m assuming he means entries in the dispatcher or reservation stations. But NOPs don’t even make it to there. As I said, the decoder that encounters it will probably swallow it and emit nothing.
Besides, it sounds like premature optimization. This isn’t the 1980s; An extra clock cycle per function call is not going to make or break your program.
There is a very good chance this dates back to 16-bit Windows. Even Windows 98 supported the 486 which was not capable of independent execution (that’s P5) or separate decode from execution (P5Pro there).
Bygone? It's a fairly regular occurrence where I search for some Win32 API and end up on Chen's blog. I get what you are saying about articles like this one though. I can't imagine much new work is happening with the 8086 (or 8088 or '186 or '286 or ...)
I mean even w.r.t. Windows stuff, quite a large part of what he talks about is variations on "here's an interesting old anecdote from 1995" or "You know that thing that's totally useless now and doesn't matter ? Here's why we absolutely needed it in 1989."
• the address calculation would be simple enough to do hardwired even on the original 8086, instead of using slow microcode. The 8086 actually took a different amount of clock cycles depending on the registers used for base and index, because there wasn't enough ROM space to duplicate the code for every combination without using jumps!
• you could use [SP+xxxx] to access local variables (frame pointer omission)
• would be able to emulate the 8080's LDAX/STAX and XTHL in a single instruction
• with so few registers, you really want every one available for memory addressing. Especially AX which is used implicitly by instructions like LODSW and MUL.