>Even if Microsoft's x86-64 emulator will be slower than Rosetta2 Is there anyth...

DCKing · on Nov 27, 2020

> Is there anything stopping MS from copying Apple's approach?

One thing I don't completely understand about Rosetta2 is when it performs static recompilation and when it's doing dynamic recompilation. I'm under the impression that a lot of Rosetta2's magic is in static recompilation. I don't know in which cases it statically recompiles and in which cases it dynamically recompiles though. If anyone knows anything about this I'd really like to understand this better.

If my hunch about Rosetta2 focusing largely on static recompilation is correct, Microsoft can't copy this approach easily. Static recompilation is very hard to make reliable. It's probably doable in Apple's use case (doable, definitely not easy), where their only aim was probably to make everything that runs on Intel-based Catalina or Big Sur on Apple Silicon. However, Intel-based 64-bit Windows 10 can (attempt to) run all 32-bit Windows software made since Windows 95. That's a much wider and diverse range of software, and the only approach to make that reliable is to depend on the slower dynamic recompilation.

If my hunch about Rosetta2 is not correct, and it's achieving these performance numbers on pure dynamic recompilation, all I can say is hats off to Apple. Microsoft would do good to pay them a large amount of money to license their tech in that case (or to copy as much of their approach anyway).

jlokier · on Nov 27, 2020

It's probably doable because VMware did something very similar about 20 years ago, back before x86 had virtualisation hardware.

That was x86 on x86, but it took a similar level of sophistication then as with cross-architecture "software virtualisation" now to run all 32-bit Windows software, as well as all 16-bit software, other OSes such as Linux, BSD and MS-DOS and Netware, etc.

Considering all the things around like self-modifying code, generated code, thunks etc in Windows and other OSes and applications, VMware did a great job of scanning all code for entry points, intercepting every path that couldn't be executed directly, patching what needed to be patched.

I expect they can revive that approach with cross-architecture translation now, and I would be very surprised if VMware and Parallels are not working on exactly this for x86-on-ARM VMs right now.

Cu3PO42 · on Nov 27, 2020

I understand that Rosetta 2 uses static recompilation "where possible" and only falls back to dynamic when necessary.

This is purely speculative but I would assume that it statically recompiles the __TEXT segment of binaries or maybe whatever it can easily figure out is code by following all possible jumps, etc.

Then, whenever a page is mapped as executable they could then check if this is part of a memory mapped binary which is statically recompiled or otherwise recompile the code in the page and jump to the now native version.

That way JITs should work without fully falling back to interpreting instructions. Things would get tricky when you map some memory both writable and executable, but maybe they're pushing the limits on what the spec allows for instruction caches? I'd have to check my x86 reference manual. Or maybe this is another area where they have some hardware primitive to facilitate a partial recompilation.

inkyoto · on Nov 28, 2020

Static translation is, in fact, the simplest to accomplish, and it is time proven as well. This is what IBM have been doing with AS/400 for more than three decades. AS/400 runs on a virtual ISA, and applications are alwyas compiled into the virtual instruction set. Pointers, for instance, have been 128-bit since day 1 (1989 circa). Upon the first startup, virtual instructions are compiled into real CPU instructions, and the recompiled application image is stashed away in a cache. Consequent application restarts use the cached application image. IBM have been using POWER CPU's for the last few generations of AS/400, but they have also used drastically different CPU designs in the past as well. This allows to run applications that had been written in 1991 today, for example.

Dynamic binary translation (a predecessor of modern JIT) is a completely different story, though.

Now, a few years back, Apple introduced iOS app uploads in Bitcode. Bitcode is the LLVM's own intermediate representation of the abstracted LLVM ISA. Apps available in Bitcode, when downloaded, are statically compiled into the ISA of the actual CPU one's iPhone has. Again, apps uploaded in Bitcode in 2018, will run fast and effeciently on iPhone 18 as well. I could immediately draw parallels with AS/400 a few years back when Apple announced the Bitcode availability.

I could not quickly check whether Bitcode was available for OS X apps today as well, but it would be hard to imagine why Apple would not want to extend the approach to the OS X apps as well. This is where it can get really interesting as apps available through the OS X app store in Bitcode, could be developed and compiled on a device running on the Intel ISA, but they could also be statically translated into M1 (M2, M3 etc) ISA at the download time and, at least theoretically, they would not require a separate ARM / M1 app image (or a universal binary for that matter). Unless, of course, the app leans on something x86 specific.

I am on the hunch that the real disruption that Apple has inflicted upon the industry is not just a stupendously fast and power efficient chip (it is very nice but secondary), but the concept of the demise of the importance of hardware ISA's in general – use the intermediate code representation for app binary images and use whatever CPU / hardware ISA that allows to realise the product vision. It is ARM today but it can be anything else 10 years down the road. I wonder who can push the mainstream PC industry in a similar direction and successfully execute it, though.

aardvark179 · on Nov 28, 2020

Bitcode is an IR, but it is not architecture independent, and the code will have been compiled to a particular calling convention. Bitcode may allow Apple to apply different optimization passes for iPhone processors, but it’s not going to be hugely useful for Rosetta.

inkyoto · on Nov 29, 2020

Bitcode is the bitstream file format used to encode the LLVM IR. LLVM IR itself is architecture independent (unless the code uses inline assembly, which is architecture specific): https://llvm.org/docs/LangRef.html

clang compilers (or other LLVM based language/compiler frontends) generate the LLVM IR, apply optimisation passes and run the LLVM IR through the architecture specific backend that transforms the LLVM IR into the underlying hardware ISA (or into a target ISA when cross-compiling). The original meaning of LLVM was «Low Level Virtual Machine». It is a widespread approach found in many cross-platform / portable compilers, e.g. GNU has RTL.

Here is an example of the «Hello world» using the LLVM IR; there is nothing architecture specific about it:

------

@.str = private unnamed_addr constant [13 x i8] c"Hello World\0A\00", align 1

; Function Attrs: nounwind ssp uwtable

define i32 @main() #0 {

  %1 = alloca i32, align 4

  store i32 0, i32* %1

  %2 = call i32 (i8*, ...)* @printf(i8* getelementptr inbounds ([13 x i8]* @.str, i32 0, i32 0))

  ret i32 0

}

declare i32 @printf(i8*, ...) #1

------

Seamless tranlastion of Bitcode into an Intel or ARM ISA actually works, too: https://www.highcaffeinecontent.com/blog/20190518-Translatin...

Lastly, Bitcode and Rosetta are mutually exclusive. If the app is available in Bitcode, it is run through an optimising hardware ISA backend at the download time and then runs always natively, whereas Rosetta runs existing x86 apps through the static binary translation into ARM first (what Apple refer to as AOT - ahead of the time) and likely uses JIT when the recompiled application is run.

mhh__ · on Nov 28, 2020

Microsoft could easily afford to spend a hundred million dollars on making it work. Bringing up their own processor would be difficult to that degree, but now apple have done it they should be able to do it.

Interestingly, ARM to X86 is probably easier because ARM has a machine readable specification (I don't think Intel or AMD have an equivalent)

thatfrenchguy · on Nov 27, 2020

Technically Rosetta 2 can run 32-bit Windows applications made since Windows 95 as well with CrossOver :)

refulgentis · on Nov 27, 2020

M1 has instruction-level x86 emulation, so MS would need to convince Qualcomm and MediaTek it was in their interest to add that complexity, and embrace all the legal/technical pitfalls along the way.

monocasa · on Nov 27, 2020

It does not. The one piece I've seen that it does have is optional ability for the higher performance cores to retire stores with Total Store Order.

saagarjha · on Nov 28, 2020

M1 has the ability to do this on all eight cores.

loeg · on Nov 27, 2020

> M1 has instruction-level x86 emulation

Source?

jlokier · on Nov 27, 2020

It doesn't. But it does have a strong memory-ordering mode designed to match x86, which is a big help.

Without this mode, multi-threaded ARM code emulating x86 by translation needs to contain a very large number of barrier instructions.

refulgentis · on Nov 28, 2020

so, it does? thanks for clarifying!

saagarjha · on Nov 28, 2020

The emulation is not implemented in the processor, just the memory ordering.

refulgentis · on Nov 28, 2020

Gotcha, it just emulates the memory ordering? At the instruction level? thank you!

edit: dunno how to feel about getting downvoted into oblivion for this thread, on one hand, it seems really cynical for people to read me as...masterminding some way to make someone smart look foolish?...when I'm legit asking qs - I don't have a CS degree and clearly don't understand some of the formal terms around CPUs, but I'm trying! :)

jlokier · on Nov 28, 2020

I think the downvotes are because of your language.

First, you wrote "M1 has instruction-level x86 emulation" as an authoritative sounding statement, as if it's true.

When you also talked about Qualcomm and MediaTek having to add lots of complexity and deal with legal pitfalls, that added to the idea that you really did mean a full x86 instruction decoder, because that's where the legal pitfalls are expected to be.

But it's false, the M1 doesn't do instruction-level x86 emulation, so that's a minor irritation. Comments don't usually get downvoted over a simple mistake, but they do if it looks like someone is saying something really misleading.

But then in response to being told it doesn't do that, you wrote "so, it does? thanks for clarifying!".

That language read like you were being snarky. The phrasing "so, it does?" in response to a correct "It doesn't." is typical English snark, a way of sounding dismissive and challenging at the same time.

That interpretation invited a downvote.

Once you have been perceived as snarky, following it with "thanks for clarifying!" is doomed to come across as more snark. The exclamation mark makes this worse.

If you are legit just curious, then I think what has happened is you did not understand that "instruction-level x86 emulation" is a very different thing than "memory ordering", stating the former to be something the M1 does is misleading, and you did not realise you had written something that came across as snark.

I'll legit try to explain the difference in case you're interested.

A CPU doing x86 emulation at the instruction level would have the chip decode x86 machine instructions in hardware one by one and perform the operations they describe. The M1 doesn't have hardware to decode x86 though; neither does any other ARM CPU. It could be built, but nobody has, probably for legal reasons not technical reasons.

For memory ordering, we probably wouldn't call it emulation. All multi-core CPUs have a memory ordering model, but different architectures have different models. This means how memory accesses (reads and writes) in programs running on one core are observed by programs on other cores running in parallel and doing memory accesses to the same locations. (On single-core CPUs it is irrelevant because normal CPU cores always observe their own reads and writes in program order.)

It's probably not obvious how memory accesses can be observed out of order. It happens because at some level, reads and writes inside a CPU are not instantaneous. They become messages back and forth with the memory, messages take time to travel, and these messages get even more complicated when there are different kinds of caches all over the place as well. With multiple cores sending and receiving memory messages, each of these taking time to travel, the result is cores see memory messages in a different order from each other. Rather than explain, I'll link to the Wikipedia article: https://en.wikipedia.org/wiki/Memory_ordering

ARM generally has a weak memory model, which means the order of instructions in a program on one core has little effect on the order of memory accesses seen by other cores. This is too weak for some things to work (like locks between parallel threads), so there are also barrier instructions which force all memory accesses before and/or after the barrier to be visible to programs on other cores in the order of instructions, as long as the programs on the other cores also use barriers.

On x86 you don't need barriers, because the ISA is defined (due to the history of CPUs) so that all memory accesses by each core are visible to all the others in the exact order of instructions run by each core. (There are some exceptions but they don't matter). This is a great illusion, because in reality to stay fast there are memory access messages flying around inside the CPU and between the cores and caches, getting out of order in large queues. To maintain the effect as if everything is in order needs a bunch of extra transistors and logic on the chip, to keep track of which messages have to be queued up and waited for before others.

This is why when x86 programs are translated into ARM programs (by Rosetta 2 software), if it's for a multi-threaded program the ARM code needs a lot of barrier instructions all over the place, and these tend to be slow because the chip isn't optimised for lots of barriers.

On the M1, it has a non-standard (for ARM) operating mode where the weak order switches to program order, like x86. For this to work it needs extra logic on the chip. It seems that the M1 doesn't use this extra logic all the time though, only when Rosetta 2 is using it. Probably because it slows things down a bit to turn it on.

So when x86 programs are translated into ARM programs (by Rosetta 2 software) it doesn't need to include lots of barrier instructions. Maybe not any. The resulting code which runs on the M1 is ARM code (not x86 code), but it's special ARM code because if it's run in parallel with other ARM code accessing the same memory on another core, it only runs correctly when the M1 turns on the non-standard program-order memory mode, which slows the CPU a little.