Intel/x86 seems to have missed pretty much every major trend but somehow manages to keep up.
At the time, Intel seemed to have pinned the 64-bit future on Itanium. Which, on paper, is a much better architecture. Of course instruction parallelism should be in the hands of the compiler -- it has so much richer information from the source-code so it can do a better job. The MMU design was very clever (self-mapped linear page tables, protection keys, TLB sharing, etc). Lots of cache, minimal core. Exposing register rotation and really interesting tracing tools. But AMD released x86/64, Intel had to copy it, and the death warrant on Itanium was signed so we never got to really explore those cool features. Just now we're getting things like more levels of page-tables and protection-key type things on x86.
Intel missed virtualisation not only in x86 (understandable given legacy) but on Itanium too, where arguably they should have thought about it during green-field development (it was closer but still had a few instructions that didn't trap properly for a hypervisor to work). The first versions of vmware -- doing binary translation of x86 code the fly, was really quite amazing at the time. Virtualisation happened despite x86, not in any way because of it. And you can pretty much trace the entire "cloud" back to that. VMX eventually got bolted on.
UEFI is ... interesting. It's better than a BIOS, but would anyone hold that up as a paragon of innovation?
I'm not so familiar with the SIMD bits of the chips, but I wouldn't be surprised if there's similar stories in that area of development.
Low power seems lost to ARM and friends. Now, with the "cloud", low power is as much a problem in the data centre as your phone. So it will be interesting to see how that plays out (and it's starting to).
x86's persistence is certainly remarkable, that's for sure
> Of course instruction parallelism should be in the hands of the compiler -- it has so much richer information from the source-code so it can do a better job.
This is not true. Yes, the compiler has a lot of useful information, but it does not have access to runtime information, which the CPU does.
A modern superscalar CPU with out-of-order execution can execute instructions speculatively hundreds of cycles before they are even used.
Instructions after conditional branches can be executed speculatively using branch prediction and loads can be executed before stores (even though they might access the same address!). If the processor later detects that the load was dependent on the store, it can "undo" that operation, without any of these effects ever being visible to the running program.
While you are right, the compiler should still definitely do most of the job.
There's simply no reason to keep track of dependencies, etc. at run-time, instead of doing this compile-time.
Do you have any examples of when this beats a C program compiles with `-march -O2`?
I've read about this promise of JIT optimization being able to generate optimal code on hot paths, but in practice I've never seen it work out that way. Good AOT compilers exist today and consistently generate optimized machine code.
Your C code is already executed by the CPU in a JIT optimized way because of the way the x86 architecture is implemented on the CPU. What more proof do you want?! :)
How so?
When the compiler generates code like
add eax, eax
add ebx, ebx
Of course it knows that they are independent.
They could then be represented in the ISA as a block of sorts.
Itanium failed because writing compilers which make VLIW efficient is probably not surmountable for human minds. Donald Knuth himself said that compilers which would make it perform even sufficiently were nearly impossible to write.
With it being borderline impossible to write good VLIW compilers, most of the instruction word was full of NOPs, which meant terrible code density, which meant I$ full of NOPs. A recipe for hot, slow computers which are depressing to program for.
It also failed because, at least 6 years before Itanium was even released (2001), there were better, faster options available.
In 1995, the Pentium Pro featured out-of-order and speculative execution[1]. This has two main advantages over the VLIW architecture (Itanium in particular):
1) The processor can use runtime information (such as past history of branches) to speculatively execute hundreds of instructions before they are needed.
2) When a stall happens due to a cache miss, in some cases the CPU can continue to execute instructions while waiting for the stall to end. This isn't possible with VLIW because the compiler can't predict cache hits/misses, this information is only available at runtime.
These two advantages combined made Itanium much slower than the then available x86 processors.
[1]: Note that Intel didn't invent these ideas. OoO was already known in the 1970s, although it was applied to mainframe supercomputers. I'm just using Pentium Pro as an example of a similar microprocessor.
>"Itanium failed because writing compilers which make VLIW efficient is probably not surmountable for human minds. Donald Knuth himself said that compilers which would make it perform even sufficiently were nearly impossible to write."
Wow, might you have any links regarding Knuth commenting on VLIW optimized compilers? I would love to read more about this.
What was Intel's view on the viability of it? It seems like quite a shortcoming.
One could do it mainframe style where the I/O was offloaded to one core with the rest doing compute. Those hundreds of thousands of interrupts never hit your single-CPU task. You can get a boost in performance even in your use-case.
I had thought of using multiple SoCs for compartmentalization, instead of VMs. Rather like a hardware version of Qubes. I was very inspired by Tinfoil Chat.[0] Using optoisolators, one could have a secure device with strictly text-only communication with other components.
But it's over my head. I lack knowledge and equipment for measuring RF leakage among SoCs. So it goes. Maybe the Qubes team will do it :)
" I was very inspired by Tinfoil Chat.[0] Using optoisolators, one could have a secure device with strictly text-only communication with other components."
Our discussions about physical and diode security on Schneier's blog were a small part of inspiration for Tinfoil Chat. Markus Ottela then showed up to discuss the design. He wisely incorporated each piece of feedback like switching from OTP to a polycipher and covert-channel mitigation at protocol level. Loved reading his regular updates he posted of the things he added. The design, not implementation, is one of the only things I'll call NSA proof for confidentiality and security but not availability. By NSA proof, I mean without TAO getting involved with really hands-on stuff.
One drawback was he didn't know system programming. So, reference implementation is in Python. People that know statically-verifiable C, Ada/SPARK, or Rust need to implement it on a minimal TCB. I'd start with a cut down OpenBSD first just because the NSA will go for 0-days in the lower layers and they have less. Make the implementation as portable on MIPS and ARM as possible so the hardware can be interchanged easily. Trimmed Linux next if drivers absolutely need it w/ Poly2 Architecture-style deletion of unnecessary code. If money comes in, implement it on a secure CPU such as CHERI w/ CHERIBSD.
That was my plan when I was talking to him. Lots of potential with TFC after it gets a proper review of protocol and implementation.
"People called the IO processor "the real OS" because they had never seen dedicated I/O processors before."
I didn't know that. That's funny as it's the confusion I'd expect where the old wisdom of I/O processors in non-server stuff was lost for a few generations. Then, the new one sees them to wonder if it's an extra core for apps, the OS, or whatever.
I think he was getting at how browsers nowadays put each tab in their own process. So on multi core systems, your "browser" would end up running on multiple cores
I still believe if AMD hadn't come up with x64 extensions, I would be typing this on an Itanium laptop now.
So it failed, because Intel wasn't the only one doing x86 processors, otherwise they would have been able to push it down our throats one way or the other.
WRT Itanium and all VLIW processors for that mater, you have to compile for the microarchitecture of the processor to fully take advantage of it. Which sinks the x86 binary portability that we've all come to know and love. Which leads to the fundamental ease that ubiquity brings to working with x86 as an end user or any software dev that doesn't really care about processor internals.
It sure would be interesting to see something else challenge x86, but since ARM/Power/RISC-V are all within the same architectural tradition I doubt they will provide it. SIMD as used by GPUs is useful only when you have very structured parallelism like arithmetic on large data sets.
I think if we're going to see a challenger worth noting they'll look like they're crazy with the hard uphill battle and novel tech they will have to produce.
If you're looking for innovation on vector operations (i.e. not packed SIMD or GPU-style SPMD), then RISC-V's vector extension (especially if you consider extensions to it for graphics or scientific compute or machine learning) should seem very promising. It's not technically groundbreaking (Cray did it first), but I think it has more potential than the endless treadmill of wider packed SIMD instructions (which dissipate an immense amount of power on intel's hardware), and it is amenable to more general tasks than SPMD is.
I wasn't aware that it was his thesis, thanks for pointing that out. There's also all the work on the Xhwacha RoCC (which I knew prof. Asanović was involved with). I think it all looks very promising and makes me eager to see what comes out of SiFive with Yunsup and Krste at or near the helm.
I assumed they just went for parity since they were on smaller budgets. You're saying they did something clever with a Cray reference. Could you elaborate on that?
> WRT Itanium and all VLIW processors for that mater, you have to compile for the microarchitecture of the processor to fully take advantage of it
True of course, but Itanium did think of this and had an "abstracted" VLIW where you worked with generic instruction bundles which the hardware was then more free to pull apart in different ways underneath. I do wonder, if we'd all taken a different path, if we'd be in more of a JIT-compiled type world. who knows?
>"It sure would be interesting to see something else challenge x86, but since ARM/Power/RISC-V are all within the same architectural tradition"
Can you elaborate on what you mean by "architectural tradition"? I am not familiar with this term. What is the commonality of all those different architectures?
Actually I was, CISC and RISC are both Von Neumann machines. ISA has more to do with encoding these days anyway.
Basically, despite how sophisticated the CPU of these machines becomes it is still trying to make the appearance of executing instructions in sequence.
Current x86 processors can execute many instructions in parallel under the correct conditions leading to Instructions Per Clock (IPC) greater much than one. However typical IPC hovers around 1 and some change, unless a lot of optimization effort is invested.
In order execution of an instruction sequence (or the appearance to the user of it) is the bottleneck that has led to single threaded CPU speed increases to all but disappear now that CPU frequency scaling has more or less halted.
Sure, everything now reduces to microcode and micro-ops that are completely RISC but in a historical context about CPU architecture X86 is the poster child for CISC.
and look at the tables there. You will see that among the processors on the "CISC side", the Intel i486 is nearer to RISC than many other CISC architectures.
Mill isn't too crazy from what I've seen, they seem acutely aware of the pitfalls of poor cache utilization and other things which were the bane of VLIW (outside of DSP). The future holds the truth on just how much that awareness translates into performance, but at least they're thinking a bit. The thing which concerns me about the Mill is the software-managed binary incompatibility between every model. They say this is manageable but I have some doubts about it, especially about how it plays with JITs.
They say that JITs will have to link to their proprietary system library in order to work, which seems like it could be devastating for uptake.
What is fun is that x86-64 ended up having CPU vendor dependent things like SYSENTER vs SYSCALL (and even SYSCALL ended up having a bunch of vendor dependent things about it too). Win10 had to add INT 2E to the 64-bit kernel for 32-bit system calls.
"so we never got to really explore those cool features"
The security stuff was explored pretty well by Secure64. I've always given them credit for at least trying to build bottom-up on it with above-average results in pentesting.
You make it sound like uranium nuclear reactors vs thorium nuclear reactors. Because of the arms race, uranium ones were developed first even though they are merely a (as one HN commenter put it) local optimum. And I remember I was one of the happy guys, rooting for AMD and enjoying cheaper processors.
At the time, Intel seemed to have pinned the 64-bit future on Itanium. Which, on paper, is a much better architecture. Of course instruction parallelism should be in the hands of the compiler -- it has so much richer information from the source-code so it can do a better job. The MMU design was very clever (self-mapped linear page tables, protection keys, TLB sharing, etc). Lots of cache, minimal core. Exposing register rotation and really interesting tracing tools. But AMD released x86/64, Intel had to copy it, and the death warrant on Itanium was signed so we never got to really explore those cool features. Just now we're getting things like more levels of page-tables and protection-key type things on x86.
Intel missed virtualisation not only in x86 (understandable given legacy) but on Itanium too, where arguably they should have thought about it during green-field development (it was closer but still had a few instructions that didn't trap properly for a hypervisor to work). The first versions of vmware -- doing binary translation of x86 code the fly, was really quite amazing at the time. Virtualisation happened despite x86, not in any way because of it. And you can pretty much trace the entire "cloud" back to that. VMX eventually got bolted on.
UEFI is ... interesting. It's better than a BIOS, but would anyone hold that up as a paragon of innovation?
I'm not so familiar with the SIMD bits of the chips, but I wouldn't be surprised if there's similar stories in that area of development.
Low power seems lost to ARM and friends. Now, with the "cloud", low power is as much a problem in the data centre as your phone. So it will be interesting to see how that plays out (and it's starting to).
x86's persistence is certainly remarkable, that's for sure