Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Linux 5.8 Set to Optionally Flush the L1d Cache on Context Switch (phoronix.com)
164 points by blopeur on May 24, 2020 | hide | past | favorite | 148 comments


I'm not that old, but I've been closely following these trends since the late 1990s, and it seems to me like we are descending into madness now. We are wiping out years of single-core performance gains--which have been hard to come by over the last decade to begin with--through all of these mitigations. It seems to me like maybe the mental model is broken. If untrusted code is running on the same core/package/what have you, your security has already been breached.


I think we just stretched the 'unix' model in the wrong direction -- attempts to solve every problem, at every layer, all of the time. Trying to be everything to everyone never works out very well.

It would have been better to preserve users (UIDs) as the 'security' boundary for data. Leaving processes (PIDs) to offer the containment for safety (virtual memory etc.), and specifically not security of data.

Then this specific case would only need the cache flushed on context switching between UIDs, but not all processes.

Attempts to retro-fit security of data between processes are resulting in as much a negative as a positive -- eg. performance loss; or usability issues like regular users on Linux gdb'ing their own processes now has to be explicitly enabled by root.

Of course, the new thing is all this untrusted code we're running; understood.

I would suggest to preserve UIDs as the security boundary for data, then direct these issues through a mechanism that makes it as easy for an unprivileged user to 'fork' a UID for a purpose, just as they fork() a process. This gives a clear role to UIDs/PIDs and a sandboxing capability that makes use of all the existing implementation and boundaries.


From an end-user perspective, what really needs to happen is a new layer between UIDs and PIDs, corresponding to applications or services. It shouldn't matter whether under the hood an application uses multiple threads or multiple processes, as long as they're all contained to the same sandbox. I agree that fitting this model to current standard Unix capabilities probably means each application running under a separate UID—which is already common for services running on a server, but not for desktop/GUI applications.


> I agree that fitting this model to current standard Unix capabilities probably means each application running under a separate UID

Separate UIDs quickly get confusing too.

I think such separation would be better done via cgroups and namespaces instead.


No, I believe that misses my original point.

cgroups and namespaces are privileged mechanisms that only 'root' can use, and share UIDs and other scopes across them (UID namespaces are complex and a risk in themselves.) It's required to re-implement all the required policy.

Separate UIDs need get no more confusing than PIDs. I agree we really don't want /etc/passwd with a gazillion entries; we should only be managing them only as much as we "manage" temporary PIDs. But imagine a command to view the "UID" tree like the PID one.

I agree it might not be how we'd design it if we designed from scratch. But then we may have chosen cgroups that integrated into the process tree and several other design decisions would change.

The name "user" ID distorts the conversation because it's unintuitive, but I am suggesting that UIDs already embody much of the policy that is needed.


This is kind of what android does with selinux. Every app you install gets assigned a new UID on install. It seems to work quite well. Maybe we need this everywhere?


We do, potentially with some modifications to make it easier for applications to interact with each other in novel ways once the user grants permission.

Mobile operating systems are far ahead of desktop operating systems when it comes to making the "application" a first-class entity. macOS comes closest since most ordinary applications can be largely self-contained within the app bundle and the sandbox is steadily getting richer protection/isolation mechanisms. Linux applications have fairly well-controlled install and uninstall processes thanks to distro package managers, but post-install behavior cannot be managed with application granularity in any standard way (though there are some projects seeking to accomplish this, if they can first succeed in replacing existing distro package managers). And Windows is still largely allowing applications to spray files all over the disk and run whatever in the background.

The challenge for the desktop is that we don't really want to switch from a multi-user paradigm (with all applications in the same security domain) to a single-user paradigm with per-application security domains. We need a multi-user OS with per-application security domains, and mobile operating systems aren't quite there. (I've heard that Android can be multi-user, but I've never encountered that functionality in the wild.)


> what really needs to happen is a new layer between UIDs and PIDs, corresponding to applications or services

like cgroups?


Cgroups are more about resource allocation than access control, but they may still be of some use under the hood (as they are for eg. docker). They certainly aren't the abstraction the end-user needs in order to enforce separation between application sandboxes.


> It would have been better to preserve users (UIDs) as the 'security' boundary for data. Leaving processes (PIDs) to offer the containment for safety (virtual memory etc.), and specifically not security of data.

Maybe I'm misunderstanding, but to me what you suggests sound like it would allow malicious JS running in my browser should be able to snoop data from my secure password manager (running as a separate process), because they both run under the same UID?

Is that correct? Since most systems are pre-dominantly single-user systems, I honestly think the PID-isolation model makes more sense, as I'm not trying to defend against other users trying to spy on me, on my own laptop.


No, you're missing reading the whole post, specifically the last sentence. That's the one to quote me on.

Your browser has the role of bringing in untrusted code, and running it. The browser code would 'fork' a UID to run the untrusted code (and only that), and then we make good use of all the existing UID-based policy in the kernel.


> The browser code would 'fork' a UID to run the untrusted code (and only that), and then we make good use of all the existing UID-based policy in the kernel.

What would be the privilege set of that new UID though?

And it would be a poor UX if that separate UID had absolutely zero access to my (human) UID-secured files because then it wouldn't be able to access my browser cache and history - and I'd rather Chrome didn't decide to require each browser process to have its own non-shared cache. It's bad enough Chrome is now using 3.5GB for 4 windows (16 tabs total) on my desktop.


Try to not think in terms of whole applications. Chrome already has threads/processes dedicated to executing untrusted JavaScript code. What seems to be lacking is providing developers with an easy and non-root mechanism to run these as a different UID. To isolate the whole app should be possible, yes -- but it's also a sledgehammer.

Yes, for the reason you describe, inter-operability with your own UID is actually good. My previous job taught me just how much this can be a feature not a bug; we made extensive use of applications, plugins and various forms of IPC to allow desktop applications to inter-operate in powerful ways.

There are other mechanism already exist. When a process (or thread) is forked, various resources can be passed over the boundary; file descriptors, shared memory etc.. For example, where un-trusted code needs access to a file, the mechanism to do that is already there in a nice "opt-in" manner.


You keep pushing away namespaces but (as I'm sure you know) you can make a thread join a namespace. I do this with network, mount and user namespaces, to get some separation and be able to put strict firewall, mount rules in place.

Since you can put uid and gid in firewall rules it makes for interesting belts and suspenders component separation.

Combined with ZeroMQ, and SPARK and lots of generated code for the inter-thread comm and you can build modular designs.

Just don't look at you ip link / ifconfig :-)


Am I wrong to assume that cloud vendors are the most worried about these sort of exploits? Having exploits that would allow customers running on the same hardware to access each others' data seems like it would be disastrous for them. So much so that they're probably OK with paying the performance penalty (or, more accurately, passing the extra costs onto their customers!).


If cloud vendors are working on this, they must see it as a real threat, I don't think they would put this much work into reducing their cpu resources willingly?


Why not, it's the customer that ends up paying because there's less performance per machine, and so they may need more of them. If all cloud providers use this fix then there's really nothing you can do. Some may provide true 'bare-metal' machines but I'd imagine they're pricey and a pain to maintain / monitor.


In that scenario, competition would come from different architectures. If you have one that can partition caches according to security contexts or encrypt memory at the CPU/L1 boundary at a lower cost, then you have a competitive advantage.

Graviton 2, by ditching SMT, has one such mitigation.


To me, it sounds like it should help ARM64 gain some ground (efficient physically separate machines instead of VMs); not for HPC, but for more traditional workloads.


Do you care to elaborate? You can already do this on Linux with any architecture (x86 or anything else) using cpusets.


I imagine that it is likely that sharing most hardware will lead to side channel vulnerabilities, and the per-core cache is not special. Smaller, cheaper SoCs could allow for sharing less of that hardware.

The problem of course being that CPUs that rival performance of a decently-specced x86 VM are going to be pricy, mooting the point.


For high core count (server) CPUs all levels of cache are local to core(s). For example, in Intel's current plattform each core has a 1.x MB L3 slice attached to it. In AMD's Zen 2 design, each CCX (group of four cores) has 16 MiB of L3 attached to it.

Based on these architectural features Intel has had CAT, which essentially turns LLC slices into private caches for certain cores. That's intended for performance, but is now also relevant for security.


> it seems to me like we are descending into madness now

All of the speculative execution mitigations can be turned off with a single flag, I much prefer to have the option to turn them on/off, rather than either not having them at all or not having the ability to disable them.

So from my point of view, we're exactly where one would hope we would be in the face of these hardware flaws.


It may not be fully untrusted, but it's good to have barriers as well.

Also, this problem is fundamental and not limited to a single core, package or machine. Any time you are making opportunistic optimizations (of which caching is one), and the opportunities available depend on what happened outside of a privilege boundary, you can have this problem.

A similar thing happens at a higher level of abstraction with, for example, btree operation timings in a database.


I wonder if we can make some of these mitigations conditional. When I'm doing key handling or online banking I want max protection. When gaming I want max performance. And I don't want to reboot to switch up kernel parameters.


Some of the mitigations can be opted into on a per-process level via prctl


I've done hardly any work on kernels before; would it be possible to make such a major change in the fly?


I have not looked at the implementation, but I’m sure you could come up with a scheme that checked a flag on context switch and choose to clear the cache or not, which would mean you could ensure it takes effect the very next switch. In practice there may be practical or complexity constraints that would prevent this.


The current patch set is implemented as a prctl that you can turn on and off at will.


> If untrusted code is running on the same core/package/what have you, your security has already been breached.

This goes against decades of OS design, possibly all serious OS design once you pass over program loaders like MS-DOS and various ROM BASIC iterations. More to the point, there's enough advantages to being able to run untrusted code that a performance trade-off is worth it: If it comes down to being able to run your business at a penalty and having to close up shop, it doesn't take much to figure out what Amazon is going to do with AWS.


Does Xen have an option like this? In this arena, advice for the "paranoid" has a habit of becoming conventional wisdom. [1]

[1] https://www.theregister.co.uk/2018/06/20/openbsd_disables_in...


Transparent caches are always a mistake - there are only two hard problems in computer science, after all. The problem isn't untrusted code (because untrusted data is equivalent, and virtually everyone processes untrusted data at some point), the problem is hardware that's too clever by half, because it tries to guess what you meant instead of just doing what you told it.


I think I would come to exactly the opposite conclusion. Performant caching is so hard to do right that you'd be fooling to do your own rather than trust the vendor provided solution which has had far more testing than anything you could hope to do yourself.


That is not the opposite to Imm's point. Transparent caching is when the client and the server don't know it's happening.


You run untrusted code every day you browse the web.


And the reason it's mostly not an issue for companies and people to stay safe is that we have many layers of protection around that. Being able to run untrusted code is required in a world where the combined code you run on a daily basis containes millions/billions of lines of code coming from a wide variety of sources.

IMHO in the cloud a sane mechanism to protect users would be to guarantee exclusive access to CPUs. The primary risk in the cloud is sharing CPUs with other cloud users. Cache flushing sounds like a sane thing to do in non performance critical applications where CPU sharing happens. There's a performance penalty but if you'd care about that you'd be using a different instance type anyway where you are guaranteed to have dedicated CPUs.

For bare metal/desktop use, flushing caches makes less sense.


One of the recent times where I pointed this out somebody retorted that the browsers got rid of js access to high resolution timers, making a timing attack infeasible. Is this true?


Not true, at most it increases the number of samples required. But the are other ways to simulate timers with adequate resolution.


There is a point where you re-key faster than your key leaks through that specific timing attack. Or re-randomize memory layout, etc.

Or where the payoff (due to limits/quotas) implies a cap on the effort that's still worth it. It's like Backups: you do as much as you want to pay for, considering the amount of risk you'll be left with.


Please post an example showing that this works. I'll run the example with OS-mitigations disabled to confirm.


This is a browser checker that does not use the high precision timers: https://jsfiddle.net/lukelol/43015xpv/1/


It shows "Browser not exploitable." even after rebooting with mitigations=off.


Congratulations!

Now pray that there is not another exploit that can be used. :)

Like many things; security is an onion, there are layers to it and removing some of those layers can be fine because the outer layers may protect you, but ultimately you just increase risk.


Why did you even post if you were going to resort to "but can you prove that there isn't an exploit?" anyway? I could have saved myself two reboots if you had started with that.


I assume good faith, you asked for an exploit that works without precision timers to test, I provided one.

Don’t get mad about the fact it didn’t work, I’m glad it didn’t work, but you can’t walk away with the knowledge that something like that will _never_ work.

I’m not sure what you were trying to prove, but you can’t tell people to prove you wrong without any knowledge of what browser you’re running with, or what version or what you’re expecting to see.

A “fully working” exploit for the most modern browser is probably possible frankly, but it’s not something that anyone is looking at with seriousness because everyone has mitigation’s enabled anyway. It’s the very definition of high work low reward.


>I assume good faith, you asked for an exploit that works without precision timers to test, I provided one.

No you didn't. You posted some code that doesn't do anything.

>I’m not sure what you were trying to prove, but you can’t tell people to prove you wrong

I didn't ask anyone to prove me wrong. I asked georgyo to prove his claim that an exploit is possible. You chimed in and posted some nonsense code which obviously does nothing and never did anything even before Meltdown and Specter were mitigated anywhere because even the original native PoC's were much more complicated.

>A “fully working” exploit for the most modern browser is probably possible frankly

Stop talking out of your ass.


You broke the site guidelines badly here. If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and sticking to the rules when posting here, we'd be grateful.


(Such as a loop.)


Not without shared memory.


There are apparently ways to get around even this requirement.


Are you able to substantiate that with a link or an explanation?


I just realized that my comment made it appear as if you can get a high-resolution timer without SharedArrayBuffer; I don't know of anyone who has done that yet. But people have made exploits that don't rely on it: https://alephsecurity.com/2018/06/26/spectre-browser-query-c...


And I think that is madness. The web should have never gone down that route. People should push back.


So all webpages should be simple forms, not even validation?

I remember those days, and actually shipped "dynamic" webpages before JavaScript/AJAX/DOM. I'll take cache flushing.

Imagine if every like/upvote button required a screen refresh... imagine Google maps...


You are indeed very good at imagining a world where the only choices are the web exactly as it was in the 90s, and exactly as it is today. That's a rare display of creative ability, but also a false dichotomy.


Illuminate it for us, then. What is the middle ground that permits SOME computation in the client, yet is so restricted that it cannot take advantage of timing or cache attacks to steal information from other processes running on the same machine.

The difficulty with both timing and cache attacks is that a "sandbox" approach is not possible... at least not without special hardware and OS level support like the ability to tell the OS to flush caches on context switches for certain processes.


Before JavaScript got fast and feature rich we had Java Applets. We also had Flash for a longer time. They are not fundamentally different from JavaScript: if our CPU runs code from the net there could be exploits.

So maybe the thin client model? All the code runs on the server and the UI is streamed to the client? But the lag would be higher and the cost for the web server would probably have prevented any internet boom.


> What is the middle ground that permits SOME computation in the client

You already went down the wrong road there. Can you imagine how to add functionality declaratively?

The middle ground is extending user agents to support features that improve the browsing experience and enable new functionality, without turning it into an arbitrary application delivery framework. And yes, this means rejecting and scoping out some things that people do in browsers today.

Just like in the past, you as a developer would get to choose whether to write a website (ubiquitous, accessible, runs on grandma's old potato, relatively easy and cheap to maintain; these are the things that made the web popular and businesses around the world decided that it's ok to to make their product a website even if that meant they couldn't have all the features you could with a desktop application), a native application (more effort, more complex, more expensive, more invasive, more friction, more concerning w.r.t. security), or both.

To illuminate you, let's pick some examples from the false-dichotomy post...

Consider form validation. In fact, this is already done (and could always be extended to support more cases). HTML5 has built-in form validation that works without javascript. And of course it's still perfectly backwards compatible with old browsers that don't do validation; they might send you invalid fields, but you will have to validate them server side anyway because you can't trust the client.

https://developer.mozilla.org/en-US/docs/Learn/Forms/Form_va...

Static maps have already been done. Not a super smooth experience, but you could always improve that by speccing a zoomable & pannable tiled image element that'll send requests to the specified URL when you pan outside of the loaded area or zoom in. Add a set of loadable elements that are embedded into this image element and you get something that starts to resemble SVG. No JS required.

Information about points of interest could already be shown with the hover selector, but there's no reason we couldn't spec an element whose visibility can be toggled with a click, no js needed.

Upvote/downvote buttons just need an attribute that tells the browser to post the request but stay on the current page. (This also degrades trivially with browsers that don't support the attribute) You could even toggle the visibility of the arrows after posting; similar CSS selectors for checked inputs already exist.

In general, there's no reason we can't have post or get requests that display the response in a new element without reloading the entire page. Semantically, not very different from target="_blank" or whatever you use to load something in a new tab / windows, except this time you want the target to be an element.

(At this point I'd also like to note that frames exist and yes they suck but hilariously a lot of the new web does exactly the kind of stateful non-linkable things that framesets were derided for; only worse, because you actually could right-click a frame and link directly to it, but you can't right-click and link the arbitrary DOM that was cooked by your client-side javascript)

Going with the tiled image element theme, there's no reason we can't have more elements that instruct the browser how to load more data on demand. These same elements could let the user agent decide whether to paginate or scroll infinitely, or how many items to display per page.

(There's no reason we can't load images progressively and on demand.. progressive JPEG exists already, but for some reason devs still insist on giving me a blur and nothing more will load unless I enable scripts)

The way we currently do things really sucks for the user (because they have very little control over how the script behaves; the user agent is degraded to a mere dumb client with little meaningful configurability) and it sucks for developers who would rather just focus on the content and let the browser provide whatever UX fits the user & their platform best.

Web devs are in a hurry to paper over the deficiencies of browsers but in doing so (and not fixing browsers), we end up with something worse and every goddamn website becomes a complex application. We're stuck in a worst-of-both-worlds state, where the browser runs applications that lack the power of desktop applications, yet are invasive, heavyweight -- don't run on grandmas old potato, complex & expensive to develop and maintain (every website shipping complex UI logic that should be part of the browser instead), increasingly less accessible and less reliable, less linkable & crawlable, less secure.. it's all I never wanted.


I'm not sure whether you care, but that page you linked to about form validation contains JS for some of the more advanced features someone might want to implement. Even then I'd say that it's wrong to think of input as "invalid" before I've started editing the field (unless of course I tried to submit it already), which is what this standard apparently does. I'm already annoyed with forms that mark a field as invalid because it's required the moment I focus on it, that would be much more annoying if several fields further down were also marked as invalid.

Your reaction will probably be that I'm missing the point, that if we went with a JS-less world there would be solutions for this. But I strongly suspect that the solution would be to not use HTML and instead use some other technology that was capable of general computation on the client.


I mean my premise is that we could have (and imho should have) categorically rejected "general computation on the client." In that scenario, the solution could look something like HTML5 form validation.

Nitpicking the details of how it is currently implemented is indeed beside the point. Ideally, the spec is made loose enough to give user agents & users the freedom to configure the behavior to their liking (and if someone can make the case for a particular behavior must be followed in some situations, then an optional attribute is added to "force" that behavior).

In general, I'm very tired of the status quo, which is that every site developer is responsible for providing good UX and people nag at them, when their preferences could be accommodated for by the browser itself. As long as the behavior stems from javascript, there's very little a browser can do to accommodate user preferences without breaking the web at large. You know, maybe I don't like form validation the way you'd implement it in JS.

People are so vested in the status quo that some of them even get angry when you e.g. suggest that they could use the browser's reader mode (instead of nagging at the site's author) to make a site readable for themselves. Bikeshedding about colours and fonts on front page HN postings happens all the time... of course, reader mode is a hack that fails very often, so disagreeing with that suggestion is somewhat justified. But really, we could've built the web around the user agent instead of vice versa, and then your web browser would be your reader mode by default. You could blame at your browser vendor or yourself first of all if the colors and fonts (or input form validator behavior before you've entered anything) don't please you.


> If untrusted code is running on the same core/package/what have you, your security has already been breached.

This used to only be true on x86--IBM and DEC took security seriously and bitched about this incessantly.

Nobody cared. x86 was cheap.

Eventually everybody just threw up their hands and went to superscalar, deeply predicting, out-of-order microprocessor architectures because that got you better benchmarketing

x86 was always insecure. It's just that nobody cared until The Cloud(tm). Malicious client-device Javascript just made it all worse.


Benchmarketing made me smile. Can't be forgotten now...


So, back to ASCII bbs browsing? Because any JS, weird Unicode, or media parsing is a potential occasion for untrusted cose execution


We can at least use iso 8859-1...


Yes, please!


1. It's optional. So you don't have to do it

2. It's for the paranoid. So you don't have to do it

3. It's on context switches. These aren't that common, and reloading it will pull entire cache lines in from the L2 cache which is pretty quick anyway.

But this I agree with:

> If untrusted code is running on the same core/package/what have you, your security has already been breached.

The biggest untrusted sod to be running is usually the browser. Of course if you're worried about that, how about turning off JS thereby blocking the biggest attack vector in it? (and loads of bloody irritating behaviour, as a blissful bonus).

Edit: why the world-is-going-to-shit attitude I keep seeing everywhere? The worst possible interpretation is put upon everything, instead of evaluating the risk/reward rationally then choosing appropriate actions.


To answer the bonus: Entlement (to growth)

Same as - Volkswagen (dirty cheating cars) - Samsung (exploding phones) - Boeing (falling planes)

Eventually tech companies will join them. The signs are already there, the dirty growth tricks at Google and Amazon. It'll take more time.

The issue is that gains at performance (...environmental performance, form factor, battery density, etc) are not linear anymore - the investments to make further gains becomes increasingly more expensive and time consuming and all the quasi market duopolists and too big too fail national infrastructure companies are not able to grow slower- stock market dynamics would punish them, execs wouldn't get their entitled pay day, politicians would lose jobs, tax income.

And so corners are cut (Boeing, Samsung, Intel) or performance tests are cheated (all of the above) and slowly infrastructure of dependent industries (cloud, transportation) is built on more and shaky ground.

So why? Market concentration, entitlement, too big to fail dynamics, endless growth doctrine.


> It's on context switches. These aren't that common

A context switch probably happens thousands of times per second on modern systems. What do you mean by "not that common"?


Easily that much and more but compare that to a CPU with say a 3GHz clock ie. 3 billion ticks per second, that's not much.

Plus the other overheads of switching are already there - it's not cheap. I don't expect the overhead of reloading from L2 cache to add much (to repeat, I'm not an expert though).


> It's optional.

Until it's proven not to be so, through another POC.

And then OPs point stands: Lots of Intel's performance gains since the 90s has been through out-of-order execution and branch-prediction.

If those improvements are deemed incompatible with being able to securely run JS in your browser, I would argue Intel is having a very fundamental problem now.

Hopefully AMD does better, but I don't think they are entirely immune to this category of security-issues either.


No, it's optional in the sense of you choosing to enable it.

I also suspect the performance hit will be minimal, however I'm not an expert.

> If those improvements are deemed incompatible with being able to securely run JS in your browser, I would argue Intel is having a very fundamental problem now.

Well it is, yet people are overwhelmingly willing to expose a turing complete language controlled by some 3rd party they know little or nothing about directly to the open internet. The problem there is nothing to do with hardware. It's people.

(agreed about AMD)


I am hoping control of these mitigations is given to the user in terms of easy to control knobs. I don't have the same concerns as your bigco cloud provider and would like to run my home machines without these mitigations in place. (I do my browsing etc on a laptop and have a workstation that I use mostly/purely for work. In some benchmarks for 10700k the chip has proven to be slower than 9900k and it is being speculated that mitigations in hardware are to blame. Well ... that is unfortunate.


Do any of the common browsers have an option to pin javascript to a dedicated CPU core?


I believe you can write a program to do so: http://man7.org/linux/man-pages/man2/sched_setaffinity.2.htm...

Just manually figure out all(or the chosen ones) thread ids spawned by the chrome process.


Sadly that wouldn't help much since lots of important stuff happens on the same thread(s) that run JS. You also have the issue of javascript from website 1 trying to steal data from website 2, in that case pinning them both to your JS core wouldn't actually protect you from the attack.

Among other things there are now mechanisms for doing page rasterization (PaintWorklet) and sound synthesis (AudioWorklet) in JS, so the overhead involved in making all your JS share a single core becomes more dramatic. There's also lots of stuff out there that uses Shared Workers and Workers to do background computation and that won't be background anymore if you pin them all to your JS core.


Sounds like that will require a 3-year effort to rewrite the javascript engine in that case.


We? What "we?"

I think it's much better to lay the blame at the feet of the companies who made these decisions, rather than generically blaming everybody.

Theo De Raadt called out Intel in particular when they started taking all kinds of crazy shortcuts like this, all the way back in 2007.


That's the fault of cpu manufacturers. Specifically Intel. Nobody likes it, but the mitigations are necessary. We live in a software world that relies upon the Internet.

Intel should be held accountable. A fine every quarter they continue to put national security at harm.


I disagree completely with this comment. Processors are expected to execute hostile code without dilluting memory between contexts. Especially in the early days of computing,multi-terminal access to a computer was common.

We routinely rely on virtualization and containers for isolation. Even in this very browser you're using, you are supposed to expect untrusted hostile code to run in other tabs. Imagine going to a website and their JS is reading memory from your password mananger, and would you then have a similar reaction?

Why do you need single core performance so badly. I can't even think of an example where single core performance has been an issue for me, and I routinely run tasks on decade old CPUs. And even if it was an issue, You are essentially saying security should be an after-thought,sorry but your reasonig is very dangerous. Would you get in a car where the engineer of the car thinks "if people are ramming your car on the freeway,you have bigger problems,let's focus on making it lightweight,fast and fuel efficient".

And to be frank with you, people that have been doing admin/engineering work since the 90's with that attitude are a bigger security threat to most orgs (and themselves) than any hacker (with the exception of the few orgs/people that receive targeted attacks frequently). The days of treating security as a perimeter issue have been long gone for about a decade now. Whether it is network, system or software security,the entry points and perimeters have been rendered meaningless (lookup zero trust, I think it applied here too).


> Why do you need single core performance so badly

Because on the cloud, performance is directly correlated to your monthly bills.


So is security, cloud cryptominer bots come to mind if you don't care about data security. I think you should take up billing issues with your provider, if the security mitigation slows down your processing speed then I hope your bill lowers proportionally as well.


> This flushing does address CVE-2020-0550 for snoop-assisted L1 data sampling but the main emphasis seems to be on the "yet to be discovered vulnerabilities."

I am unsure if this is just idle speculation (heh) that there may be issues in this area or there are issues that have been disclosed to vendors but not the public yet?


Yes, this is probably another coordinated disclosure where we see evidence in Linux kernel commits first. I noticed that Apple hasn't released the security credits page for iOS 13.5 yet; is that a normal delay or are we waiting on the disclosure of another processor bug (that presumably also affects ARM)?


Maybe, the notes will probably be published when macOS 10.15.5 drops (this week? But who knows if the new 0day in the unc0ver jailbreak will shake up things)


They're both out now.


Just as predicted :)


Interestingly, the security content for Xcode 11.5 is out already.


It looks like that's just updating Git to a new version to address an already-public bug. I guess they figured that there's no advantage to waiting to release that information.


It's really interesting. I remembered that Linux KAISER(now KPTI) patch appeared before Meltdown disclosure.

https://lwn.net/Articles/738975/


probably along the lines of OpenBSD disabling hyperthreading before it was found with some issues. https://www.theregister.co.uk/2018/06/20/openbsd_disables_in...

May seem over cautious, but certainly for most, a default position of over-cautions and for those that know how to play on the edge, well those would be able to compile their own kernel.

Given Linux in so many devices and so many blackbox left alone systems, not a bad default position to be taking - planning for the worst, expect the best.


The performance implications of this would be huge, and I hope it remains opt in for long time.


I don't see how the implications will be huge.

This is L1d cache which is just 48kB for Ice Lake. We are also talking about context switches which are not happening very frequently. Applications that are generating load don't context switch all the time because they are busy doing work.

Then, when you context switch it is likely the context to which you are switching would like to use that cache for something. By the time we switch to your original thread it is very likely L1d has already been filled with something else.

I am pretty sure you would not notice anything except for very special, rare situations.


If the 48 kB of cache has 64-byte lines, then it has 768 lines. If a line takes 5.3 ns to fetch from the L2 cache [1], then that's ~4 microseconds to fetch all of them. It's not as if the processor will stop and do that after a context switch (and it can overlap the loads with other work etc), but that's roughly the order of magnitude of the cost of an L1 cache flush

[1] https://stackoverflow.com/a/4087331/116639


Nope.

The cost would be right if the cache was usable after context switch. Since it is likely stale, the new context will be pulling new data into cache as if nothing really happened.


Well, the question is how many of those lines are already stale because of the work done by the other processes during the context switch.


Throughput isn't the inverse of latency; the throughput of L1 <-> L2 is 1 line per cycle. If IA32_FLUSH_CMD exists, probably a better order-of-magnitude estimate is ~200ns for writing back dirty lines to L2 during the switch.


Uh what? Context switches happen all the time and it's not the applications that decide that, it's the kernel. It will preempt processes at its own discretion and the more that are running the more context switches will happen. So as more processes are running (or the same processes start doing more work) and/or interrupts increase, the more performance will be affected by the extra work having to be done each context switch. As a desktop user you might not notice but if you just invested into some new server iron and it suddenly performs 10% worse, I wouldn't take that lightly.


High-performance applications are typically run with anticipation of how the kernel does context switching and are generally designed to accommodate this; on the flip side, the kernel's job is spend the most time executing as it can and it will try to not switch when it can avoid it.


Even on loaded servers, modern systems tend to be running with at least one idle core virtually always. True context switches are very rare -- CPU-bound processes tend to keep their processor for seconds at a time. Obviously there are software architectures that are exceptions, but all the big ones tend not to switch much. (Which isn't surprising, as cache flush or no, switchin has always been a slow process that software has tried to avoid).


If context switches are not happening, then what are your processes doing then? Shuffling memory around? Every time you do I/O a context switch happens (disk, network, ...). If your processes are not hitting disk or network, what are they doing then? Calculating something but keeping the results for itself?


Some of this is a terminology problem: properly a "context switch" refers to the kernel switching control between two user processes on the same CPU. If all you're doing is taking an interrupt in the kernel and returning to whatever was interrupted, that's about half the work of a "context switch" on most architectures (though still expensive, obviously!).

But FWIW: most HPC computing is, in fact, "shuffling memory around", yeah. Very few architectures are actually interrupt bound, and the ones that are work very hard to address that (because hardware interrupt parallelism is an even harder nut to crack than context switch overhead).


A context switch not only happens when switching between processes, it also happens when your process does a syscall (so basically whenever it wants to do anything I/O related).


syscalls are called mode switch.

Edit: I wonder why the downvotes. Switches between in and out of kernel have never been called context switches that happen between threads. I know no one who calls them 'context' switch as the context, i.e. registers that point to the thread/cpu core remain the same.


This is what GP meant by a "terminology problem", but syscalls are much simpler than a real context switch. They certainly won't flush the L1d cache as a result of this patch.


Also, keep in mind that there is technology such as io_uring, which was recently (over the last year) added to Linux.

It provides a command-queue/response-queue dual-ringbuffer interface to the kernel, mostly providing benefits in terms of less per-IO-op overhead and offering non-blocking buffered disk IO.

It can work in a zero-syscall steady state after program startup for applications such as (for example) web servers.


> Then, when you context switch it is likely the context to which you are switching would like to use that cache for something. By the time we switch to your original thread it is very likely L1d has already been filled with something else.

The other thread may have been doing work with memory on a GPU. The other thread may already have a hot cache at another layer. It's definitely not an edge case, or else the L1d cache would not have been designed to maintain state between context switches in the first place. There are going to be consequences to this.


I mean it depends on when you need to do it, right? If there are vulnerabilities that lead to private kernel data leaking to userspace through the L1D, you’re talking about needing to wipe out your data cache on every system call, which might need to happen millions of times per second.

Also, context switches can be very frequent in some designs. For example, in micro kernel systems you often have ping-ponging with processes communicating with servers via RPC. Wiping out your whole L1D every time that happens could be pretty unpleasant.


Do interrupt handlers cause context switches?

Presumably more of a problem if all cores are busy, which is more likely if there are few cores. Also dependent on the number of interrupts (e.g. high network traffic of small packets etc). Presumably not a problem if there is an idle core that can run the interrupt code.


If you have many interrupts due to network packet load, you're doing something wrong. Interrupt-based handling is slower and less efficient (than polling) after some throughput that's iirc about a few hundred Mbit/s/core.


I think you are confusing Linux’s epoll with the hardware network interface. Some hardware offloads a lot of processing to dedicated network card processors, other network hardware might just use hardware interrupts for the driver module.

Either way, I am sure there are plenty of devices that can cause a lot of interrupts (USB?), not just network IO. Presumably there is a way to monitor the count of interrupts per second in Linux?


You can sample /proc/interrupts to see.


No, I didn't confuse those. I am aware that hardware offloading is still quite luxurious, but even then it's bad to spam interrupts.


> We are also talking about context switches which are not happening very frequently

That depends on how your software is written. If, for example, you're running a web server that uses a thread-per-connection, you'll be context switching all over. Hi Apache!


Well, you are right on that one. I may have stressed this one too much, I guess.

The real reason flushing L1d is not going to be noticed is that even without flushing the cache is unusable after context switch. It is highly unlikely the next thread that gets ownership of the core will require exactly the data present in L1d.

On a busy web server the two most frequent reasons to switch context will be:

1. The thread is waiting on I/O so it yields the rest of its time share back.

2. The thread has finished processing request.

Now, if you imagine a thread that just did a bit of I/O returning its time so that OS is switching context to another thread... it is very unlikely any of the data in L1d has any meaning or worth for the other thread. Anything that the next thread will do will require fresh data at least from L3.

So L1d is practically worthless and blanking it isn't going to do anything noticeable.

(I have intentionally omitted all the interrupts happening in the meantime and OS also using the cache which is the proverbial nail in the coffin when it comes to usability of L1d after context switch)


What do you mean by "not happening very frequently?" The default timeslice is something on the order of 100ms, isn't it? And that's if the process doesn't yield. Clearing L1d every 100ms (at worst) seems pretty frequent to me.


The cost of clearing cache on context switch has to be put in context (hey, pun intended:)

100ms is a huge amount of time and 48kB is a tiny, tiny part of what processor does during 100ms. Gigabytes of data can be transferred during that time, 48kB isn't really much.

As I have pointed out, that cache has very little value over context switch anyway. The cost is removing data from cache that would be usable after we have returned to the original context. But it is already very likely the data in the cache is already for a completely different context and hence completely unusable.

Say you have apps A and B and OS.

You are running A which has 48kB of data in L1d. It switches to OS which causes some of L1d to be evicted and puts its own data there. Then it switches to B which is likely another process, this causes very likely entire L1d to be evicted unless this is extremely small process. Then we come to OS and again to A. By the time you are at A, there is no data from the original L1d state.

Cleaning L1d upfront on context switch is likely not hurting anything.


To further back this up with some math - L2 cache hits (what you'll hit on a L1D cache miss caused by clearing L1D cache) - are still in the mid/low single digit nanosecond ranges[1]. Say flushing the L1D causes another 1000 L2 cache misses[2] - maybe we got really lucky and the next thread was hashing all the exact same data at the exact same time or something equally unlikely? That'd still put us in the mid/low single digit microseconds range. On par with DDR4-1600 (12.8GB/s)'s 3.75us to read 48KB [3][4]. Let's more than double that and say it takes 10 microseconds = 0.01 milliseconds = 0.01% of 100 milliseconds.

Any noticable perf overhead is going to be from the act of cache flushing taking some super slow path for some reason, or much more frequent context switching than 100ms timeslices.

[1] https://stackoverflow.com/a/4087331

[2] 1000x 32-128B cachelines = 32-128KB, definitely in the ballpark to completely refill a 48KB L1D cache.

[3] https://en.wikipedia.org/wiki/DDR4_SDRAM#Modules

[4] https://www.wolframalpha.com/input/?i=48KB+%2F+12800+MB%2Fs


That sounds way too high. It's CONFIG_HZ, right? According to this[1], it can be 1ms to 10ms, with the default being 4ms.

[1]: https://github.com/torvalds/linux/blob/master/kernel/Kconfig...


On Windows it's around 1ms-15ms; I doubt on Linux it's very different. For reference 60fps gives 16.6ms.


> The default timeslice is something on the order of 100ms, isn't it?

Consider that a single core on a modern CPU running at 2 GHz can execute over 20k instructions in those 100ms.


20 Million instructions in 100ms. More if IPC is >1.


And that's why I shouldn't do math at night...

Anyway, 100ms is quite a lot in the life of a modern CPU.


Even 1ms is a lot. I have some experience with algorithmic trading. The application took messages off the network, processed them and responded to market within 5 microseconds. That's 1/200th of 1ms. This measured on a special type of switch (https://en.wikipedia.org/wiki/Cut-through_switching ).

Lots of stuff happens during those 5us. The message is read from the network device (directly by the application, no Linux or syscalls anywhere during those 5us). Then it is parsed, deduplicated (multiple multicast channels carry redundant copies of the messages), uncompressed (the payload is compressed with zlib), the uncompressed payload is parsed, interpreted (multiple types of messages). Business logic is executed to update state of the market in memory then to generate signals to listening algorithms. The algorithm is run to figure out whether it wants to execute an order. The order is verified against decision tree (for example to check whether it does not exceed available budget). The market order packet is created and sent over TCP.

Now imagine, all that stuff happens in 1/200th of 1ms. In comparison, transferring 48kB from L2 or L3 to L1 is pretty damn insignificant.


200 Million.


Yep, that’s what I meant! reminder to double check for typos when you’re correcting someone :)


Check your decimal point, you might want to add some zeroes.


L1D isn't _that_ bad I think. Most cache usage modeling I've seen assumes that L1 is completely stale across context switches anyway. The separate address spaces for user and kernel were probably a way worse hit on platforms that don't have ASIDs.

L2 would be absolute crazy town though.


Ideally one would desire an indefinitely large memory capacity … We are therefore forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible.

”Burks, Goldstine, and von Neumann, "Preliminary discussionof the logical design of an electronic computing instrument," 1946.


How often do you not overwrite all of L1 cache in a new context anyway?


If you're doing some quick system call and that system call does not use much RAM, then you'll have your cache almost ready after its return. Now if that's in some tight loop, that might influence performance. Probably in single digits of percents, but it's still degradation.


Last week Microsoft also rereleased the Intel microcode updates package [1][2]. I kinda expect to see a new CPU flaw in the next few days.

[1]https://support.microsoft.com/en-us/help/4497165/kb4497165-i...

[2]https://www.windowslatest.com/2020/05/21/windows-10-kb449716...


So with more cores and associated L1 cache, context switching would be potentially less I would of thought, small but maybe measurable.

Interestingly enough: https://www.theregister.co.uk/2020/05/24/linus_torvalds_adop...


Maybe time to redo the ring model with multi core cpu's now the norm.

Why one or two CPU cores couldn't be dedicated to the OS aspect and locked out of user-space of any form, certainly would be something worth exploring.


In general, dedicating some cores to userspace and some to kernel would mean sending a lot more data up to L3 for core-to-core communication, and I'm not sure that would be any better than flushing L1d. The exception would be with SMT, but then you're locked into a 1:1 ratio for user and kernel virtual cores, and still have to worry about side channel vulnerabilities.


That is true, which does somewhat put a whole new perspective upon this, will they flush l2,l3 next!

Be nice though to have a proper isolated core or two for the OS, after all - that is exactly what is done for enclave based security and management systems. Though some not all a great track record.


Or more rings. The problem is that we have this idea of running trusted and untrusted code and data in the same context as far as the CPU is concerned.

Add more fine grained memory partitions to let the memory hierarchy in on what you're doing.

Make Rings 1 & 2 Great Again


So.... when will we get a part of our money back, due to performance losses? (i'm talking about a case like Volkswagen)


VW broke the law by deliberately defeating regulations which applied to them. Is there a regulation that applies to Intel and AMD which requires them to allocate die space and performance to improving security?

I'm not even sure they could have delivered equal performance at the high end if they had included these mitigations earlier. Whether to produce processors which are safer or faster depends on what customers prefer, even now. Not everyone needs ultimate security or wants to pay for it (in money or performance). So unless the law says less-secure processors must never be sold, this situation was and still is inevitable.


Intel knew their security was broken, and they were still selling those affected CPUs.

Products were recalled for smaller issues than that.


Not CPUs -- they're rarely recalled. The Xeon in my computer has 50 pages of errata. It's common knowledge that CPUs are terribly buggy.


If a car lock failes, those cars get recalled, and the locks fixed.

https://www.recallmasters.com/mercedes-recalls-vehicles-defe...

Also, intel did replace CPUs with issues before:

https://en.wikipedia.org/wiki/Pentium_FDIV_bug

> On December 20, 1994, Intel offered to replace all flawed Pentium processors on the basis of request, in response to mounting public pressure.[5] Although it turned out that only a small fraction of Pentium owners bothered to get their chips replaced, the financial impact on the company was significant.[citation needed] On January 17, 1995, Intel announced "a pre-tax charge of $475 million against earnings, ostensibly the total cost associated with replacement of the flawed processors."[1] Some of the defective chips were later turned into key rings by Intel.[6]


You seem to be trying to litigate this on Hacker News, which is does not have the authority to issue you a refund. If you feel strongly about this, sue Intel. Most of us will not be joining you because we either rent CPUs from Amazon or are out a tiny amount of money for our personal rigs. Yup, this is "how they get you" and someone should be a check and balance on a defective product. But, most of us have no appetite to spend years or decades litigating this. We accept that our $200 wafer of silicon with 13nm features that can execute billions of mathematical operations is "good enough". Sometimes there are bugs. But we don't know how to make these things ourselves, so we deal with them and aren't really looking for a pound of flesh from Intel because the billions of instructions their CPUs can execute per second is a slightly lower number of billions.


> You seem to be trying to litigate this on Hacker News, which is does not have the authority to issue you a refund. If you feel strongly about this, sue Intel. Most of us will not be joining you because we either rent CPUs from Amazon or are out a tiny amount of money for our personal rigs. Yup, this is "how they get you" and someone should be a check and balance on a defective product. But, most of us have no appetite to spend years or decades litigating this.

Part of the initial stages of a class action is verifying that there does indeed exist a class.

There's value in him talking about his grievance publicly and not just rolling over when he gets screwed by a corporation, just because that can beat him one on one in a legal brawl.

> We accept that our $200 wafer of silicon with 13nm features that can execute billions of mathematical operations is "good enough". Sometimes there are bugs. But we don't know how to make these things ourselves, so we deal with them and aren't really looking for a pound of flesh from Intel because the billions of instructions their CPUs can execute per second is a slightly lower number of billions.

I also can't build a modern car. Or insulin. Or a million other devices in my life. The bar isn't "you can only complain if you can make it yourself better", it's "you can complain if what was sold to you didn't meet it's advertised specifications".


>part of our money back

A replacement CPU would make more sense. It should support the same operating systems and motherboards. I wonder if Intel would be asked by a lot of people to replace their CPUs.


If history is any tell Intel allocated $475M pre-tax charge against earnings when they did Pentium FDIV replacements -- https://en.wikipedia.org/wiki/Pentium_FDIV_bug . But I assume most people are more likely to raise a ruckus for potentially invalid computations, than for subtle potentially leaky computations.


Does a replacement CPU with similar performance even exist?


[flagged]


Please consider the idea that the coronavirus threat may be practically little precisely because of the ongoing mitigation efforts.


Thanks Intel.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: