Intel/x86 seems to have missed pretty much every major trend but somehow manages...

frankpf · on June 9, 2017

> Of course instruction parallelism should be in the hands of the compiler -- it has so much richer information from the source-code so it can do a better job.

This is not true. Yes, the compiler has a lot of useful information, but it does not have access to runtime information, which the CPU does.

A modern superscalar CPU with out-of-order execution can execute instructions speculatively hundreds of cycles before they are even used.

Instructions after conditional branches can be executed speculatively using branch prediction and loads can be executed before stores (even though they might access the same address!). If the processor later detects that the load was dependent on the store, it can "undo" that operation, without any of these effects ever being visible to the running program.

Laaas · on June 9, 2017

While you are right, the compiler should still definitely do most of the job. There's simply no reason to keep track of dependencies, etc. at run-time, instead of doing this compile-time.

striking · on June 9, 2017

Some of that information doesn't exist at compile time...

jonhohle · on June 9, 2017

Do you have any examples of when this beats a C program compiles with `-march -O2`?

I've read about this promise of JIT optimization being able to generate optimal code on hot paths, but in practice I've never seen it work out that way. Good AOT compilers exist today and consistently generate optimized machine code.

ksk · on June 9, 2017

Your C code is already executed by the CPU in a JIT optimized way because of the way the x86 architecture is implemented on the CPU. What more proof do you want?! :)

Laaas · on June 9, 2017

I don't exactly know what you're searching for, but I've heard of LuaJIT beating GCC, etc. in some rare cases.

jonhohle · on June 10, 2017

I'm honestly searching for examples of a JIT beating an AOT because it had runtime information the AOT couldn't know.

Laaas · on June 9, 2017

How so? When the compiler generates code like add eax, eax add ebx, ebx Of course it knows that they are independent. They could then be represented in the ISA as a block of sorts.

microcolonel · on June 9, 2017

Itanium failed because writing compilers which make VLIW efficient is probably not surmountable for human minds. Donald Knuth himself said that compilers which would make it perform even sufficiently were nearly impossible to write.

With it being borderline impossible to write good VLIW compilers, most of the instruction word was full of NOPs, which meant terrible code density, which meant I$ full of NOPs. A recipe for hot, slow computers which are depressing to program for.

frankpf · on June 9, 2017

It also failed because, at least 6 years before Itanium was even released (2001), there were better, faster options available.

In 1995, the Pentium Pro featured out-of-order and speculative execution[1]. This has two main advantages over the VLIW architecture (Itanium in particular):

1) The processor can use runtime information (such as past history of branches) to speculatively execute hundreds of instructions before they are needed.

2) When a stall happens due to a cache miss, in some cases the CPU can continue to execute instructions while waiting for the stall to end. This isn't possible with VLIW because the compiler can't predict cache hits/misses, this information is only available at runtime.

These two advantages combined made Itanium much slower than the then available x86 processors.

[1]: Note that Intel didn't invent these ideas. OoO was already known in the 1970s, although it was applied to mainframe supercomputers. I'm just using Pentium Pro as an example of a similar microprocessor.

bogomipz · on June 9, 2017

>"Itanium failed because writing compilers which make VLIW efficient is probably not surmountable for human minds. Donald Knuth himself said that compilers which would make it perform even sufficiently were nearly impossible to write."

Wow, might you have any links regarding Knuth commenting on VLIW optimized compilers? I would love to read more about this.

What was Intel's view on the viability of it? It seems like quite a shortcoming.

nickpsecurity · on June 9, 2017

http://www.informit.com/articles/article.aspx?p=1193856

mirimir · on June 9, 2017

Only two apps that I use regularly use multiple cores: Excel and VirtualBox.

nickpsecurity · on June 9, 2017

One could do it mainframe style where the I/O was offloaded to one core with the rest doing compute. Those hundreds of thousands of interrupts never hit your single-CPU task. You can get a boost in performance even in your use-case.

https://en.wikipedia.org/wiki/Channel_I/O

Even some embedded SoC's are doing this now where there's a strong core and a cheap core w/ main function and I/O split among them.

mirimir · on June 9, 2017

I had thought of using multiple SoCs for compartmentalization, instead of VMs. Rather like a hardware version of Qubes. I was very inspired by Tinfoil Chat.[0] Using optoisolators, one could have a secure device with strictly text-only communication with other components.

But it's over my head. I lack knowledge and equipment for measuring RF leakage among SoCs. So it goes. Maybe the Qubes team will do it :)

0) https://github.com/maqp/tfc

nickpsecurity · on June 11, 2017

" I was very inspired by Tinfoil Chat.[0] Using optoisolators, one could have a secure device with strictly text-only communication with other components."

Our discussions about physical and diode security on Schneier's blog were a small part of inspiration for Tinfoil Chat. Markus Ottela then showed up to discuss the design. He wisely incorporated each piece of feedback like switching from OTP to a polycipher and covert-channel mitigation at protocol level. Loved reading his regular updates he posted of the things he added. The design, not implementation, is one of the only things I'll call NSA proof for confidentiality and security but not availability. By NSA proof, I mean without TAO getting involved with really hands-on stuff.

One drawback was he didn't know system programming. So, reference implementation is in Python. People that know statically-verifiable C, Ada/SPARK, or Rust need to implement it on a minimal TCB. I'd start with a cut down OpenBSD first just because the NSA will go for 0-days in the lower layers and they have less. Make the implementation as portable on MIPS and ARM as possible so the hardware can be interchanged easily. Trimmed Linux next if drivers absolutely need it w/ Poly2 Architecture-style deletion of unnecessary code. If money comes in, implement it on a secure CPU such as CHERI w/ CHERIBSD.

That was my plan when I was talking to him. Lots of potential with TFC after it gets a proper review of protocol and implementation.

monocasa · on June 9, 2017

That was the idea behind the Wii and the WiiU too.

People called the IO processor "the real OS" because they had never seen dedicated I/O processors before.

nickpsecurity · on June 11, 2017

"People called the IO processor "the real OS" because they had never seen dedicated I/O processors before."

I didn't know that. That's funny as it's the confusion I'd expect where the old wisdom of I/O processors in non-server stuff was lost for a few generations. Then, the new one sees them to wonder if it's an extra core for apps, the OS, or whatever.

bogomipz · on June 9, 2017

Interesting. Do you have links for any specific SoC vendors that are offering this?

monocasa · on June 9, 2017

Freescale's Vybrids come to mind.

baq · on June 9, 2017

and your browser :)

mirimir · on June 9, 2017

Really? Firefox? I hadn't checked. I use browsers in VMs, which typically have one core, not in hosts.

colejohnson66 · on June 9, 2017

I think he was getting at how browsers nowadays put each tab in their own process. So on multi core systems, your "browser" would end up running on multiple cores

mirimir · on June 10, 2017

Yes, I see. Cool.

nickpsecurity · on June 9, 2017

Especially when Mozilla's Quantum hits.

bogomipz · on June 9, 2017

What a great read, thanks!

CalChris · on June 9, 2017

Not sure I agree with that but the big idea from VLIW compilers, Fisher's trace scheduling [1], works just fine for superscalar schedulers as well.

[1] https://pdfs.semanticscholar.org/5698/09af0fcbe5a42371cea8d3...

pjmlp · on June 9, 2017

I still believe if AMD hadn't come up with x64 extensions, I would be typing this on an Itanium laptop now.

So it failed, because Intel wasn't the only one doing x86 processors, otherwise they would have been able to push it down our throats one way or the other.

jackyinger · on June 9, 2017

WRT Itanium and all VLIW processors for that mater, you have to compile for the microarchitecture of the processor to fully take advantage of it. Which sinks the x86 binary portability that we've all come to know and love. Which leads to the fundamental ease that ubiquity brings to working with x86 as an end user or any software dev that doesn't really care about processor internals.

It sure would be interesting to see something else challenge x86, but since ARM/Power/RISC-V are all within the same architectural tradition I doubt they will provide it. SIMD as used by GPUs is useful only when you have very structured parallelism like arithmetic on large data sets.

I think if we're going to see a challenger worth noting they'll look like they're crazy with the hard uphill battle and novel tech they will have to produce.

microcolonel · on June 9, 2017

If you're looking for innovation on vector operations (i.e. not packed SIMD or GPU-style SPMD), then RISC-V's vector extension (especially if you consider extensions to it for graphics or scientific compute or machine learning) should seem very promising. It's not technically groundbreaking (Cray did it first), but I think it has more potential than the endless treadmill of wider packed SIMD instructions (which dissipate an immense amount of power on intel's hardware), and it is amenable to more general tasks than SPMD is.

CalChris · on June 9, 2017

Well, that was the subject of Krste's thesis at Berkeley:

Vector Microprocessors

https://people.eecs.berkeley.edu/~krste/thesis.html

microcolonel · on June 9, 2017

I wasn't aware that it was his thesis, thanks for pointing that out. There's also all the work on the Xhwacha RoCC (which I knew prof. Asanović was involved with). I think it all looks very promising and makes me eager to see what comes out of SiFive with Yunsup and Krste at or near the helm.

nickpsecurity · on June 9, 2017

I assumed they just went for parity since they were on smaller budgets. You're saying they did something clever with a Cray reference. Could you elaborate on that?

delsarto · on June 9, 2017

> WRT Itanium and all VLIW processors for that mater, you have to compile for the microarchitecture of the processor to fully take advantage of it

True of course, but Itanium did think of this and had an "abstracted" VLIW where you worked with generic instruction bundles which the hardware was then more free to pull apart in different ways underneath. I do wonder, if we'd all taken a different path, if we'd be in more of a JIT-compiled type world. who knows?

gumby · on June 9, 2017

We are in a JIT world, it's just that the code is running on the IVM ("Intel Virtual machine) being interpreted by intel's microcode.

bogomipz · on June 9, 2017

>"It sure would be interesting to see something else challenge x86, but since ARM/Power/RISC-V are all within the same architectural tradition"

Can you elaborate on what you mean by "architectural tradition"? I am not familiar with this term. What is the commonality of all those different architectures?

tmccrmck · on June 9, 2017

There are three main conventional instruction set architectures: RISC, CISC, and VLIW.

ARM/Power/MIPS/RISC-V are all examples of a reduced instruction set architecture, RISC.

bogomipz · on June 9, 2017

Of course, I completely misread that. I thought the OP was including the x86(which is obviously CISC) in that. Thanks.

jackyinger · on June 9, 2017

Actually I was, CISC and RISC are both Von Neumann machines. ISA has more to do with encoding these days anyway.

Basically, despite how sophisticated the CPU of these machines becomes it is still trying to make the appearance of executing instructions in sequence.

Current x86 processors can execute many instructions in parallel under the correct conditions leading to Instructions Per Clock (IPC) greater much than one. However typical IPC hovers around 1 and some change, unless a lot of optimization effort is invested.

In order execution of an instruction sequence (or the appearance to the user of it) is the bottleneck that has led to single threaded CPU speed increases to all but disappear now that CPU frequency scaling has more or less halted.

pjmlp · on June 9, 2017

x86 stopped being a pure CISC long time ago.

bogomipz · on June 9, 2017

Sure, everything now reduces to microcode and micro-ops that are completely RISC but in a historical context about CPU architecture X86 is the poster child for CISC.

I think that's all the OP was getting at.

wolfgke · on June 9, 2017

> but in a historical context about CPU architecture X86 is the poster child for CISC

Rather the DEC VAX and perhaps MC68k are poster childs of CISC. For a discussion of RISC vs. CISC see

> http://userpages.umbc.edu/~vijay/mashey.on.risc.html

or its extended version

> http://yarchive.net/comp/risc_definition.html

and look at the tables there. You will see that among the processors on the "CISC side", the Intel i486 is nearer to RISC than many other CISC architectures.

jacobr1 · on June 9, 2017

Does the Mill architecture count as looking crazy enough?

microcolonel · on June 9, 2017

Mill isn't too crazy from what I've seen, they seem acutely aware of the pitfalls of poor cache utilization and other things which were the bane of VLIW (outside of DSP). The future holds the truth on just how much that awareness translates into performance, but at least they're thinking a bit. The thing which concerns me about the Mill is the software-managed binary incompatibility between every model. They say this is manageable but I have some doubts about it, especially about how it plays with JITs.

They say that JITs will have to link to their proprietary system library in order to work, which seems like it could be devastating for uptake.

GuB-42 · on June 9, 2017

I never really got UEFI. That's a bootloader, the only thing it has to do is :

- initialize the RAM and other essential subsystems

- initialize whatever is needed to access mass storage devices

- copy X bytes from a specific location on a specified mass storage device into RAM

- set the instruction pointer on the copied data

Everything else is the OS job. Even the BIOS does too much.

dingo_bat · on June 9, 2017

Who's gonna block you from booting Linux on your Surface Book then?

yuhong · on June 9, 2017

What is fun is that x86-64 ended up having CPU vendor dependent things like SYSENTER vs SYSCALL (and even SYSCALL ended up having a bunch of vendor dependent things about it too). Win10 had to add INT 2E to the 64-bit kernel for 32-bit system calls.

nickpsecurity · on June 9, 2017

"so we never got to really explore those cool features"

The security stuff was explored pretty well by Secure64. I've always given them credit for at least trying to build bottom-up on it with above-average results in pentesting.

https://www-ssl.intel.com/content/dam/www/public/us/en/docum...

b0rsuk · on June 9, 2017

You make it sound like uranium nuclear reactors vs thorium nuclear reactors. Because of the arms race, uranium ones were developed first even though they are merely a (as one HN commenter put it) local optimum. And I remember I was one of the happy guys, rooting for AMD and enjoying cheaper processors.

Was I foolish ?