There's no need for another build as BOLT runs directly on a compiled binary, and could be integrated into an existing build system. Operating directly on a machine code allows BOLT to boost performance on top of AutoFDO/PGO and LTO. Processing the binary directly requires a profile in a format different from what a compiler expects, as the compiler operates on source code and needs to attribute the profile to high-level constructs.
So what exactly would BOLT help with? I'm guessing mainly junky code or things a compiler cannot possibly optimize. Could you run BOLT on something that is known to be be well optimized?
If BOLT can work directly on binaries, is there anything stopping it from being integrated as a kernel module into the OS, so that binaries are continually being profiled and optimized?
It seems to me that an optimized OS image could be also be created.
That doesn't seem like it would be profitable. Profiling the running process, processing the profile, and then relinking the binary on a single host wouldn't pay off compared to profiling in the large, relinking the program once and redeploying it at scale. Peak optimization is expensive.
My phone runs approximately zero new binaries every day. I'd happily let my phone optimize itself while charging so that netflix, youtube, or other heavy CPU users use less battery.
I'm headed to work now but I'll try opera 12 and the oracle JVM later. As for the latter, I'm curious as to how much the running application will influence the final optimizations applied. Since you used it on hhvm, did you see anything interesting regarding this, or does of always optimize towards the same optimal layout?