This is L1d cache which is just 48kB for Ice Lake. We are also talking about context switches which are not happening very frequently. Applications that are generating load don't context switch all the time because they are busy doing work.
Then, when you context switch it is likely the context to which you are switching would like to use that cache for something. By the time we switch to your original thread it is very likely L1d has already been filled with something else.
I am pretty sure you would not notice anything except for very special, rare situations.
If the 48 kB of cache has 64-byte lines, then it has 768 lines. If a line takes 5.3 ns to fetch from the L2 cache [1], then that's ~4 microseconds to fetch all of them. It's not as if the processor will stop and do that after a context switch (and it can overlap the loads with other work etc), but that's roughly the order of magnitude of the cost of an L1 cache flush
The cost would be right if the cache was usable after context switch. Since it is likely stale, the new context will be pulling new data into cache as if nothing really happened.
Throughput isn't the inverse of latency; the throughput of L1 <-> L2 is 1 line per cycle. If IA32_FLUSH_CMD exists, probably a better order-of-magnitude estimate is ~200ns for writing back dirty lines to L2 during the switch.
Uh what? Context switches happen all the time and it's not the applications that decide that, it's the kernel. It will preempt processes at its own discretion and the more that are running the more context switches will happen. So as more processes are running (or the same processes start doing more work) and/or interrupts increase, the more performance will be affected by the extra work having to be done each context switch.
As a desktop user you might not notice but if you just invested into some new server iron and it suddenly performs 10% worse, I wouldn't take that lightly.
High-performance applications are typically run with anticipation of how the kernel does context switching and are generally designed to accommodate this; on the flip side, the kernel's job is spend the most time executing as it can and it will try to not switch when it can avoid it.
Even on loaded servers, modern systems tend to be running with at least one idle core virtually always. True context switches are very rare -- CPU-bound processes tend to keep their processor for seconds at a time. Obviously there are software architectures that are exceptions, but all the big ones tend not to switch much. (Which isn't surprising, as cache flush or no, switchin has always been a slow process that software has tried to avoid).
If context switches are not happening, then what are your processes doing then? Shuffling memory around? Every time you do I/O a context switch happens (disk, network, ...). If your processes are not hitting disk or network, what are they doing then? Calculating something but keeping the results for itself?
Some of this is a terminology problem: properly a "context switch" refers to the kernel switching control between two user processes on the same CPU. If all you're doing is taking an interrupt in the kernel and returning to whatever was interrupted, that's about half the work of a "context switch" on most architectures (though still expensive, obviously!).
But FWIW: most HPC computing is, in fact, "shuffling memory around", yeah. Very few architectures are actually interrupt bound, and the ones that are work very hard to address that (because hardware interrupt parallelism is an even harder nut to crack than context switch overhead).
A context switch not only happens when switching between processes, it also happens when your process does a syscall (so basically whenever it wants to do anything I/O related).
Edit: I wonder why the downvotes. Switches between in and out of kernel have never been called context switches that happen between threads. I know no one who calls them 'context' switch as the context, i.e. registers that point to the thread/cpu core remain the same.
This is what GP meant by a "terminology problem", but syscalls are much simpler than a real context switch. They certainly won't flush the L1d cache as a result of this patch.
Also, keep in mind that there is technology such as io_uring, which was recently (over the last year) added to Linux.
It provides a command-queue/response-queue dual-ringbuffer interface to the kernel, mostly providing benefits in terms of less per-IO-op overhead and offering non-blocking buffered disk IO.
It can work in a zero-syscall steady state after program startup for applications such as (for example) web servers.
> Then, when you context switch it is likely the context to which you are switching would like to use that cache for something. By the time we switch to your original thread it is very likely L1d has already been filled with something else.
The other thread may have been doing work with memory on a GPU. The other thread may already have a hot cache at another layer. It's definitely not an edge case, or else the L1d cache would not have been designed to maintain state between context switches in the first place. There are going to be consequences to this.
I mean it depends on when you need to do it, right? If there are vulnerabilities that lead to private kernel data leaking to userspace through the L1D, you’re talking about needing to wipe out your data cache on every system call, which might need to happen millions of times per second.
Also, context switches can be very frequent in some designs. For example, in micro kernel systems you often have ping-ponging with processes communicating with servers via RPC. Wiping out your whole L1D every time that happens could be pretty unpleasant.
Presumably more of a problem if all cores are busy, which is more likely if there are few cores. Also dependent on the number of interrupts (e.g. high network traffic of small packets etc). Presumably not a problem if there is an idle core that can run the interrupt code.
If you have many interrupts due to network packet load, you're doing something wrong. Interrupt-based handling is slower and less efficient (than polling) after some throughput that's iirc about a few hundred Mbit/s/core.
I think you are confusing Linux’s epoll with the hardware network interface. Some hardware offloads a lot of processing to dedicated network card processors, other network hardware might just use hardware interrupts for the driver module.
Either way, I am sure there are plenty of devices that can cause a lot of interrupts (USB?), not just network IO. Presumably there is a way to monitor the count of interrupts per second in Linux?
> We are also talking about context switches which are not happening very frequently
That depends on how your software is written. If, for example, you're running a web server that uses a thread-per-connection, you'll be context switching all over. Hi Apache!
Well, you are right on that one. I may have stressed this one too much, I guess.
The real reason flushing L1d is not going to be noticed is that even without flushing the cache is unusable after context switch. It is highly unlikely the next thread that gets ownership of the core will require exactly the data present in L1d.
On a busy web server the two most frequent reasons to switch context will be:
1. The thread is waiting on I/O so it yields the rest of its time share back.
2. The thread has finished processing request.
Now, if you imagine a thread that just did a bit of I/O returning its time so that OS is switching context to another thread... it is very unlikely any of the data in L1d has any meaning or worth for the other thread. Anything that the next thread will do will require fresh data at least from L3.
So L1d is practically worthless and blanking it isn't going to do anything noticeable.
(I have intentionally omitted all the interrupts happening in the meantime and OS also using the cache which is the proverbial nail in the coffin when it comes to usability of L1d after context switch)
What do you mean by "not happening very frequently?" The default timeslice is something on the order of 100ms, isn't it? And that's if the process doesn't yield. Clearing L1d every 100ms (at worst) seems pretty frequent to me.
The cost of clearing cache on context switch has to be put in context (hey, pun intended:)
100ms is a huge amount of time and 48kB is a tiny, tiny part of what processor does during 100ms. Gigabytes of data can be transferred during that time, 48kB isn't really much.
As I have pointed out, that cache has very little value over context switch anyway. The cost is removing data from cache that would be usable after we have returned to the original context. But it is already very likely the data in the cache is already for a completely different context and hence completely unusable.
Say you have apps A and B and OS.
You are running A which has 48kB of data in L1d. It switches to OS which causes some of L1d to be evicted and puts its own data there. Then it switches to B which is likely another process, this causes very likely entire L1d to be evicted unless this is extremely small process. Then we come to OS and again to A. By the time you are at A, there is no data from the original L1d state.
Cleaning L1d upfront on context switch is likely not hurting anything.
To further back this up with some math - L2 cache hits (what you'll hit on a L1D cache miss caused by clearing L1D cache) - are still in the mid/low single digit nanosecond ranges[1]. Say flushing the L1D causes another 1000 L2 cache misses[2] - maybe we got really lucky and the next thread was hashing all the exact same data at the exact same time or something equally unlikely? That'd still put us in the mid/low single digit microseconds range. On par with DDR4-1600 (12.8GB/s)'s 3.75us to read 48KB [3][4]. Let's more than double that and say it takes 10 microseconds = 0.01 milliseconds = 0.01% of 100 milliseconds.
Any noticable perf overhead is going to be from the act of cache flushing taking some super slow path for some reason, or much more frequent context switching than 100ms timeslices.
Even 1ms is a lot. I have some experience with algorithmic trading. The application took messages off the network, processed them and responded to market within 5 microseconds. That's 1/200th of 1ms. This measured on a special type of switch (https://en.wikipedia.org/wiki/Cut-through_switching ).
Lots of stuff happens during those 5us. The message is read from the network device (directly by the application, no Linux or syscalls anywhere during those 5us). Then it is parsed, deduplicated (multiple multicast channels carry redundant copies of the messages), uncompressed (the payload is compressed with zlib), the uncompressed payload is parsed, interpreted (multiple types of messages). Business logic is executed to update state of the market in memory then to generate signals to listening algorithms. The algorithm is run to figure out whether it wants to execute an order. The order is verified against decision tree (for example to check whether it does not exceed available budget). The market order packet is created and sent over TCP.
Now imagine, all that stuff happens in 1/200th of 1ms. In comparison, transferring 48kB from L2 or L3 to L1 is pretty damn insignificant.
This is L1d cache which is just 48kB for Ice Lake. We are also talking about context switches which are not happening very frequently. Applications that are generating load don't context switch all the time because they are busy doing work.
Then, when you context switch it is likely the context to which you are switching would like to use that cache for something. By the time we switch to your original thread it is very likely L1d has already been filled with something else.
I am pretty sure you would not notice anything except for very special, rare situations.