I don't see how the implications will be huge. This is L1d cache which is just 4...

twic · on May 25, 2020

If the 48 kB of cache has 64-byte lines, then it has 768 lines. If a line takes 5.3 ns to fetch from the L2 cache [1], then that's ~4 microseconds to fetch all of them. It's not as if the processor will stop and do that after a context switch (and it can overlap the loads with other work etc), but that's roughly the order of magnitude of the cost of an L1 cache flush

[1] https://stackoverflow.com/a/4087331/116639

lmilcin · on May 25, 2020

Nope.

The cost would be right if the cache was usable after context switch. Since it is likely stale, the new context will be pulling new data into cache as if nothing really happened.

monocasa · on May 25, 2020

Well, the question is how many of those lines are already stale because of the work done by the other processes during the context switch.

brigade · on May 25, 2020

Throughput isn't the inverse of latency; the throughput of L1 <-> L2 is 1 line per cycle. If IA32_FLUSH_CMD exists, probably a better order-of-magnitude estimate is ~200ns for writing back dirty lines to L2 during the switch.

RedShift1 · on May 25, 2020

Uh what? Context switches happen all the time and it's not the applications that decide that, it's the kernel. It will preempt processes at its own discretion and the more that are running the more context switches will happen. So as more processes are running (or the same processes start doing more work) and/or interrupts increase, the more performance will be affected by the extra work having to be done each context switch. As a desktop user you might not notice but if you just invested into some new server iron and it suddenly performs 10% worse, I wouldn't take that lightly.

saagarjha · on May 25, 2020

High-performance applications are typically run with anticipation of how the kernel does context switching and are generally designed to accommodate this; on the flip side, the kernel's job is spend the most time executing as it can and it will try to not switch when it can avoid it.

ajross · on May 25, 2020

Even on loaded servers, modern systems tend to be running with at least one idle core virtually always. True context switches are very rare -- CPU-bound processes tend to keep their processor for seconds at a time. Obviously there are software architectures that are exceptions, but all the big ones tend not to switch much. (Which isn't surprising, as cache flush or no, switchin has always been a slow process that software has tried to avoid).

RedShift1 · on May 25, 2020

If context switches are not happening, then what are your processes doing then? Shuffling memory around? Every time you do I/O a context switch happens (disk, network, ...). If your processes are not hitting disk or network, what are they doing then? Calculating something but keeping the results for itself?

ajross · on May 25, 2020

Some of this is a terminology problem: properly a "context switch" refers to the kernel switching control between two user processes on the same CPU. If all you're doing is taking an interrupt in the kernel and returning to whatever was interrupted, that's about half the work of a "context switch" on most architectures (though still expensive, obviously!).

But FWIW: most HPC computing is, in fact, "shuffling memory around", yeah. Very few architectures are actually interrupt bound, and the ones that are work very hard to address that (because hardware interrupt parallelism is an even harder nut to crack than context switch overhead).

RedShift1 · on May 25, 2020

A context switch not only happens when switching between processes, it also happens when your process does a syscall (so basically whenever it wants to do anything I/O related).

xxs · on May 25, 2020

syscalls are called mode switch.

Edit: I wonder why the downvotes. Switches between in and out of kernel have never been called context switches that happen between threads. I know no one who calls them 'context' switch as the context, i.e. registers that point to the thread/cpu core remain the same.

benchaney · on May 25, 2020

This is what GP meant by a "terminology problem", but syscalls are much simpler than a real context switch. They certainly won't flush the L1d cache as a result of this patch.

namibj · on May 25, 2020

Also, keep in mind that there is technology such as io_uring, which was recently (over the last year) added to Linux.

It provides a command-queue/response-queue dual-ringbuffer interface to the kernel, mostly providing benefits in terms of less per-IO-op overhead and offering non-blocking buffered disk IO.

It can work in a zero-syscall steady state after program startup for applications such as (for example) web servers.

zelly · on May 25, 2020

> Then, when you context switch it is likely the context to which you are switching would like to use that cache for something. By the time we switch to your original thread it is very likely L1d has already been filled with something else.

The other thread may have been doing work with memory on a GPU. The other thread may already have a hot cache at another layer. It's definitely not an edge case, or else the L1d cache would not have been designed to maintain state between context switches in the first place. There are going to be consequences to this.

rayiner · on May 25, 2020

I mean it depends on when you need to do it, right? If there are vulnerabilities that lead to private kernel data leaking to userspace through the L1D, you’re talking about needing to wipe out your data cache on every system call, which might need to happen millions of times per second.

Also, context switches can be very frequent in some designs. For example, in micro kernel systems you often have ping-ponging with processes communicating with servers via RPC. Wiping out your whole L1D every time that happens could be pretty unpleasant.

robocat · on May 25, 2020

Do interrupt handlers cause context switches?

Presumably more of a problem if all cores are busy, which is more likely if there are few cores. Also dependent on the number of interrupts (e.g. high network traffic of small packets etc). Presumably not a problem if there is an idle core that can run the interrupt code.

namibj · on May 25, 2020

If you have many interrupts due to network packet load, you're doing something wrong. Interrupt-based handling is slower and less efficient (than polling) after some throughput that's iirc about a few hundred Mbit/s/core.

robocat · on May 25, 2020

I think you are confusing Linux’s epoll with the hardware network interface. Some hardware offloads a lot of processing to dedicated network card processors, other network hardware might just use hardware interrupts for the driver module.

Either way, I am sure there are plenty of devices that can cause a lot of interrupts (USB?), not just network IO. Presumably there is a way to monitor the count of interrupts per second in Linux?

caf · on May 25, 2020

You can sample /proc/interrupts to see.

namibj · on May 25, 2020

No, I didn't confuse those. I am aware that hardware offloading is still quite luxurious, but even then it's bad to spam interrupts.

dirtydroog · on May 25, 2020

> We are also talking about context switches which are not happening very frequently

That depends on how your software is written. If, for example, you're running a web server that uses a thread-per-connection, you'll be context switching all over. Hi Apache!

lmilcin · on May 25, 2020

Well, you are right on that one. I may have stressed this one too much, I guess.

The real reason flushing L1d is not going to be noticed is that even without flushing the cache is unusable after context switch. It is highly unlikely the next thread that gets ownership of the core will require exactly the data present in L1d.

On a busy web server the two most frequent reasons to switch context will be:

1. The thread is waiting on I/O so it yields the rest of its time share back.

2. The thread has finished processing request.

Now, if you imagine a thread that just did a bit of I/O returning its time so that OS is switching context to another thread... it is very unlikely any of the data in L1d has any meaning or worth for the other thread. Anything that the next thread will do will require fresh data at least from L3.

So L1d is practically worthless and blanking it isn't going to do anything noticeable.

(I have intentionally omitted all the interrupts happening in the meantime and OS also using the cache which is the proverbial nail in the coffin when it comes to usability of L1d after context switch)

ollien · on May 24, 2020

What do you mean by "not happening very frequently?" The default timeslice is something on the order of 100ms, isn't it? And that's if the process doesn't yield. Clearing L1d every 100ms (at worst) seems pretty frequent to me.

lmilcin · on May 24, 2020

The cost of clearing cache on context switch has to be put in context (hey, pun intended:)

100ms is a huge amount of time and 48kB is a tiny, tiny part of what processor does during 100ms. Gigabytes of data can be transferred during that time, 48kB isn't really much.

As I have pointed out, that cache has very little value over context switch anyway. The cost is removing data from cache that would be usable after we have returned to the original context. But it is already very likely the data in the cache is already for a completely different context and hence completely unusable.

Say you have apps A and B and OS.

You are running A which has 48kB of data in L1d. It switches to OS which causes some of L1d to be evicted and puts its own data there. Then it switches to B which is likely another process, this causes very likely entire L1d to be evicted unless this is extremely small process. Then we come to OS and again to A. By the time you are at A, there is no data from the original L1d state.

Cleaning L1d upfront on context switch is likely not hurting anything.

MaulingMonkey · on May 25, 2020

To further back this up with some math - L2 cache hits (what you'll hit on a L1D cache miss caused by clearing L1D cache) - are still in the mid/low single digit nanosecond ranges[1]. Say flushing the L1D causes another 1000 L2 cache misses[2] - maybe we got really lucky and the next thread was hashing all the exact same data at the exact same time or something equally unlikely? That'd still put us in the mid/low single digit microseconds range. On par with DDR4-1600 (12.8GB/s)'s 3.75us to read 48KB [3][4]. Let's more than double that and say it takes 10 microseconds = 0.01 milliseconds = 0.01% of 100 milliseconds.

Any noticable perf overhead is going to be from the act of cache flushing taking some super slow path for some reason, or much more frequent context switching than 100ms timeslices.

[1] https://stackoverflow.com/a/4087331

[2] 1000x 32-128B cachelines = 32-128KB, definitely in the ballpark to completely refill a 48KB L1D cache.

[3] https://en.wikipedia.org/wiki/DDR4_SDRAM#Modules

[4] https://www.wolframalpha.com/input/?i=48KB+%2F+12800+MB%2Fs

ken · on May 25, 2020

That sounds way too high. It's CONFIG_HZ, right? According to this[1], it can be 1ms to 10ms, with the default being 4ms.

[1]: https://github.com/torvalds/linux/blob/master/kernel/Kconfig...

dataflow · on May 25, 2020

On Windows it's around 1ms-15ms; I doubt on Linux it's very different. For reference 60fps gives 16.6ms.

magicalhippo · on May 25, 2020

> The default timeslice is something on the order of 100ms, isn't it?

Consider that a single core on a modern CPU running at 2 GHz can execute over 20k instructions in those 100ms.

sgerenser · on May 25, 2020

20 Million instructions in 100ms. More if IPC is >1.

magicalhippo · on May 25, 2020

And that's why I shouldn't do math at night...

Anyway, 100ms is quite a lot in the life of a modern CPU.

lmilcin · on May 25, 2020

Even 1ms is a lot. I have some experience with algorithmic trading. The application took messages off the network, processed them and responded to market within 5 microseconds. That's 1/200th of 1ms. This measured on a special type of switch (https://en.wikipedia.org/wiki/Cut-through_switching ).

Lots of stuff happens during those 5us. The message is read from the network device (directly by the application, no Linux or syscalls anywhere during those 5us). Then it is parsed, deduplicated (multiple multicast channels carry redundant copies of the messages), uncompressed (the payload is compressed with zlib), the uncompressed payload is parsed, interpreted (multiple types of messages). Business logic is executed to update state of the market in memory then to generate signals to listening algorithms. The algorithm is run to figure out whether it wants to execute an order. The order is verified against decision tree (for example to check whether it does not exceed available budget). The market order packet is created and sent over TCP.

Now imagine, all that stuff happens in 1/200th of 1ms. In comparison, transferring 48kB from L2 or L3 to L1 is pretty damn insignificant.

pansa2 · on May 25, 2020

200 Million.

sgerenser · on May 25, 2020

Yep, that’s what I meant! reminder to double check for typos when you’re correcting someone :)

neurocline · on May 25, 2020

Check your decimal point, you might want to add some zeroes.