Two new ways to read a file quickly

adrianmonk · on March 7, 2020

I'm not a kernel hacker, but the Linux kernel now has an in-kernel VM (BPF), doesn't it? As I understand, it started with letting you load programs into the kernel to filter network packets, but now it allows several other things.

So, readfile() combines 3 syscalls into 1 by manually implementing a new system call. Would another approach be to make these same calls (openat(), read(), and close()) and some others available to BPF programs?

The idea is to provide a more general mechanism, rather than solving this one case. Obviously, the kernel change would need to be feasible, and the result would need to be safe and fast.

Then a user mode program could effectively just write its own readfile(), or some other variation in case that exact sequence of calls isn't what it needs.

(An even more out-there idea is to have the kernel auto-detect situations like this, examine the native code, then after verifying that it's simple and harmless, just move a whole chunk of code or tight loop into the kernel. In effect, it would dynamically generate a new syscall.)

hinkley · on March 7, 2020

I remember a paper decades ago now about a way to reduce RPC call round-tripping by sending something akin to promises back and forth, so you could pass the result of one remote call to another one without transferring it across the network. Kinda think something similar might be useful here.

What typically ends up happening in “services” (or any distributed system) is that you start creating convenience APIs to speed things up that are mostly just combinations of other calls, and you stay with the ones that seem like they might be the most useful, so you don’t slam into combinatorics face first.

Given the amount of talk I’ve heard about system call overhead lately, it sounds like the magnitude of the delays is creeping up again (especially post-Meltdown).

exrook · on March 7, 2020

I don't know if this is what you were thinking of, but capnproto[0] does exactly that with a promise like interface.

Actually this may be the paper you were referencing, from the bottom of the capnproto page:

> Cap’n Proto’s RPC protocol is based heavily on CapTP[1], the distributed capability protocol used by the E programming language[2]. Lots of useful material for understanding capabilities can be found at those links.

[0] https://capnproto.org/rpc.html

[1] http://www.erights.org/elib/distrib/captp/index.html

[2] http://www.erights.org/index.html

hinkley · on March 7, 2020

Very interesting, but if memory serves, Microsoft had this idea maybe in ‘96. Though patent expiry-wise, the timing of cap’n proto might make a bit of sense.

barrkel · on March 7, 2020

E is from '97 and built on top of ideas implemented before that with extensions to Java; not sure of the date, but could easily predate '96.

hinkley · on March 7, 2020

Java was still pretty brand spanking new in ‘96. And RMI wasn’t inflicted on us until ‘97. But half the stuff we use was invented in the 70’s and explored in the 80’s anyway.

The classes in college that I have used the most by far are the series that covered logic and set theory, and the one on distributed computing. The latter does take some of the fun out of people rediscovering things though.

IIRC, Berkeley had an OS with swarm computing and process migration around ‘87. It wasn’t until 20 years later that the speed of disk vs networking had a similar imbalance again. And of course the mainframe guys must constantly think the rest of us are idiots.

barrkel · on March 8, 2020

Oh I know, I started programming in '92 or so.

The idea of creating a program and sending it across an abstraction boundary for execution to avoid the cost of continually navigating the abstraction boundary for each step in the program is reasonably trivial. It was an approach I used in my first job for optimizing access to data in serialized object heaps - rather than deserialize the whole session object heap for each web request (costly), I'd navigate the object graph in the binary blob, but this required following offsets, arrays etc. The idea of a 'data path' program for fetching data occurred almost immediately.

In the end a different approach using very short-lived objects which were little more than wrappers around offsets into the heap combined with type info turned out to be more usable, and with generational GC, was plenty fast too; the abstraction stack was thin, the main cost avoided was materializing a whole heap.

BubRoss · on March 7, 2020

Everyone is doomed to gradually rediscover QNX while never reinventing it.

aseipp · on March 7, 2020

The extent to which computer programmers literally don't read anything about computing history or alternative systems is quite funny. I can't wait until, sometime in the next few years, someone announces they're "modernizing the cloud operating system" and "unlocking the synergies between microservices" or whatever by... reinventing MsgSend, which we all knew about 30 years ago at this point.

To save you some time and test my prediction powers: they'll get $50 million in funding and almost flop until they find a way to stick Linux inside of a virtual machine (so it can act like a glorified driver for you, for device compatibility, because it turns out burning cash on trying to replicate 10 million lines of drivers isn't smart). It will grow Docker compatibility at some point. 90% of the development will come from one group of ~6 people. Then they'll get bought out and the project will die overnight. The end.

BubRoss · on March 7, 2020

I wish some day that I could be that nonsense person and Dunning Kreuger my way to mad money by using my powers of selling the past as the future.

Too · on March 7, 2020

Batching and pipelining calls is a quite common design in RPC, especially databases. Mongodb for example has a very easy to use aggregate([op1, op2, ...]) function which you use to build a pipeline of generic operations that all data flows through on the server side, eg input to op2 is the output of op1, and so on.

It doesn't allow any client logic between op1 and op2 if that's what you meant though?

hinkley · on March 7, 2020

> It doesn't allow any client logic between op1 and op2 if that's what you meant though?

Yeah that seems to be the trick doesn’t it? Is the next step beyond pipelining A) provide a way to run limited code on the server end or B), reformulate the API so that these situations don’t occur as readily.

Take a classic I/O pattern. Give me 5k of data from X, write all 5k out to Y. Except that’s not what actually happens. It tells you what it wrote and you have to do a conditional branch and arithmetic. I still recall feeling shocked and a little scandalized when I learned that these APIs worked like this. First, why didn’t you just do what I asked, and second, why am I the one held responsible for all of the bookkeeping when you are the one selectively following instructions?

I think one of the features of completion ports was a slightly better, “go do this and don’t bother coming back until you have”.

lokedhs · on March 7, 2020

In a client server application I designed as part of a previous job we did exactly this.

An RPC call was a list of functions and arguments, and there was a special tag that could refer back to a previous return value in the same set of calls.

All calls in the same request was executed in the same transaction. Them main goal was to avoid usingddistributed transactions, but also helped reducing latency over slow connections.

hinkley · on March 7, 2020

I think my comment got lost because I can’t find it, but I said something like: maybe we need a formal macroassembler for IPC/RPC calls instead of running “code” directly in the kernel.

d0mine · on March 7, 2020

It looks like inventing GraphQL for kernel:

On the web: instead of combining results of several RESTful API calls on the client, send a GraphQL request and combine the REST call on the server.

On a host: instead of combining results of several syscalls in user-space, send a program to an in-kernel interpreter that would combine the syscalls kernel-side aka define a syscall such as readfile() on the fly.

hinkley · on March 7, 2020

Yeah I had that though while writing up my first follow up, but the compositing here is between multiple, small endpoints and GraphQL is subsetting one large endpoint.

Maybe you could model kernel calls as a single large subsystem that allows multiple questions, but GraphQL seems a lot less imperative that syscalls so I’m having trouble picturing how that would translate.

masklinn · on March 7, 2020

> I remember a paper decades ago now about a way to reduce RPC call round-tripping by sending something akin to promises back and forth, so you could pass the result of one remote call to another one without transferring it across the network. Kinda think something similar might be useful here.

Those shortcuts are generally about the ability to batch independent concurrent operations though e.g. instead of

    call(foo)
    call(bar)
    call(baz)

you can

    multicall(foo, bar, baz)

which goes from 3 sequential round-trips to 1, without increasing "instantaneous" server load as much as performing concurrent calls (promise-style) would.

Here this is more of a pipeline desire and the issue there is it requires some error handling and adaptation between the calls, as the various functions are not always compatible (you often can't feed the output of one as the input of the next).

Someone · on March 7, 2020

In theory, the JVM could allow users to upload ‘arbitrary’ code, where the configuration of the JVM (custom class loader) enforces various restrictions as to want that ‘arbitrary‘ means.

I have never seen that implemented, though, possibly because of the experience with the increase in attack surface Java applets gave, or because it is hard to restrict memory and CPU usage of uploaded code.

hinkley · on March 7, 2020

I heard a presentation on Jini when it was young and this something they very much wanted to do, but I believe it got watered down to mostly sending code from the server to the client and not so much the other way around.

This is reminding me of a research hardware project from the same era where they took NUMA design about as far as you could go; they embedded simple (vector?) computing directly onto the RAM chips, so the code goes to the data instead of the other way around. One of their demos was splitting the screen up into cells and calculating pixels in parallel directly on the memory chips.

Elv13 · on March 7, 2020

eBPF isn't safe to load without root privilege. This means no containers and no users process. From what I saw so far, there is fear that letting Turing complete programs in the kernel would make some future CPU/hardware vulnerabilities more dangerous. While the compiler makes sure the program return, it cannot make sure it isn't nefarious.

ref: https://lwn.net/Articles/808048/

geofft · on March 7, 2020

eBPF isn't fundamentally unsafe (and, in particular, it's not Turing-complete, precisely because it's guaranteed that all eBPF programs terminate). It's just very ambitious and it exposes too much attack surface for people to feel comfortable making it available to unprivileged users.

However, we do have a similar system exposed to unprivileged users - classic BPF. You could imagine writing classic BPF programs to chain multiple system calls with just one context switch.

Elv13 · on March 7, 2020

> it's not Turing-complete

On their own, but there is more than 1 way to insert BPF programs.

If there is a key learning from ROP style exploits, is that you can assemble surprising constructs out of thin air. Linux 5.6 adds way for eBPF programs to write vtables. That one alone is an exploit waiting to happen. You just have to connect the calls just right and poof, eBPF programs calling each other: Turing completeness achieved.

monocasa · on March 7, 2020

> On their own, but there is more than 1 way to insert BPF programs.

And the recursion, even tail recursion, in BPF is limited for that very reason.

All eBPF programs have a statically verifiable worst case execution time.

Elv13 · on March 7, 2020

My point is that this limitation can be bypassed if you are smart enough. The Linux kernel is more complex than any single BPF program, it has side effects.

For example, let say that I am an attacker who which to execute a loop to brute force a key using a new side channel from kernel space. eBPF programs are now available to userspace with "insert new shiny here". I add one as a performance tracepoint, then one in the packet filter, then one modifying a vtable somewhere. Now, I manage to make them oscillate and call each other while never leaving kernel space. I now "deadlock" the kernel with my small BPF programs until I manage to extract what I am after, then let the "deadlock" go away.

It take only 1 bug to make this happen. Each BFP program individually return, but only to be called again. If you let users put these in the kernel, this will happen eventually. This kind of tricks has been used over and over with other technologies.

loeg · on March 7, 2020

One can easily imagine a restricted subset of eBPF that is not turing complete and could be safe for unprivileged users to use.

amscanne · on March 7, 2020

That’s what BPF is already. It’s not Turing complete. Unbounded loops and backward branching is not allowed.

The problem isn’t Turing completeness. It’s that there’s are too many micro-architectural side channels and unknowns.

monocasa · on March 7, 2020

They currently go out of their way to remove microarchitectural side channels. Yes, there's almost certainly others, but it's pretty clear that the end goal is to allow non root creation of bpf programs.

nickodell · on March 7, 2020

Is it really possible to remove the side channels? Even if you formally prove that it does not access memory outside of authorized regions, that doesn't prevent it from speculatively accessing unauthorized memory.

We've seen Spectre exploits through sandboxed Javascript. Unless you separated the eBPF VM address space from the kernel address space, I suspect you would have the exact same problems that browsers have had. But if you did put them in separate address spaces, then why bother with eBPF? You could run a normal userspace program.

monocasa · on March 7, 2020

eBPF right now issues fence instructions before computed address loads, in order to manually shutdown speculative instructions. They also do a cute trick to compute the address with a mask so that if a speculative access does happen, the speculatively computed offset is still in the valid range.

nickodell · on March 7, 2020

That's a very good point, and thank you for correcting me.

rjeli · on March 7, 2020

One possibility is integrating io_uring with bpf: https://lwn.net/Articles/810414/

fragmede · on March 8, 2020

BPF is not run inside of a virtual machine (the same way that Java is). It is not safe to load untrusted code into it.

dmitrygr · on March 7, 2020

> verifying that it's simple and harmless

I look forward to your PhD thesis disproving Rice's theorem

klyrs · on March 7, 2020

Funny thing about Rice's theorem. It says that semantic properties (such as halting) are undecidable in general. That says nothing about specific programs. It's easy to prove that certain programs halt; it's easy to prove that others never halt. In fact, there are whole classes of programs for which it is trivial to prove halting (e.g. bf programs that do not contain [ or ]), or non-halting (e.g. bf programs that contain a single + and the remaining instructions are [ or ]).

One need not waste their time proving an impossible generality along the path to proving a specific possibility.

dTal · on March 7, 2020

Yes, but if you enforce that the program be decidable, as part of a dynamic kernel API, then you are no longer offering a Turing-complete language. Are there so many use cases that would benefit from this nonetheless, that it wouldn't be better to just hard-code them?

klyrs · on March 7, 2020

That's not how I (generously) read GP's suggestion. It could be worthwhile to implement heuristics that can be used to prove certain classes safe; and those could be lifted to a "new syscall". If the heuristics give a non-result, gracefully fall back to not pulling userspace crap into the kernel.

Not saying I'd be interested in doing this work, but there's a vast middle ground between "the halting problem is undecideable" and "we can never know if while(1); will halt"

adrianmonk · on March 7, 2020

Exactly. That's what I meant. I tried to hint at that by saying you'd verify that it's simple and harmless, meaning you solve the simple cases and leave the others as-is. But then I also described it as a more general mechanism, and I guess people ran with that and kinda understandably interpreted it to mean a fully general mechanism.

About the halting problem, I'm not necessarily convinced that prevents this from being done. At least not in theory. (This is getting impractical, but it's interesting to think about.)

Let's say I have an infinite loop in user space. I then move the infinite loop to kernel space. What specific harm have I caused? I already had an infinite loop; now I still have one.

Though there might be some consequences other than burning CPU cycles. I think "kill -9" may not take effect until system calls return. Also, there's such a thing as scheduler priorities and CPU quotas, and those might be impacted.

Of course it would always be possible to just exclude anything with a loop from what you try to move into the kernel. Or it might also be possible to handle infinite loops with some kind of complicated thing involving a timeout and restarting system calls.

monocasa · on March 7, 2020

BPF isn't turning complete in very interesting ways that allows it to still be fairly general (but obviously not completely general).

teddyh · on March 7, 2020

No computer is actually Turing-complete, since no computer has infinite memory.

Math is different from reality.

dTal · on March 7, 2020

The snark was perhaps a touch excessive, but this is indeed an excellent reason why the radical suggestion proffered shall remain radical.

(isn't this basically just a kernel module?)

im3w1l · on March 7, 2020

It's very easy to misinterpret what a theorem means. And doing this can derail a conversation for decades.

I remember two such cases.

The theorem that two hidden layer neural networks can approximate any function - Common interpretation: two layers are enough.

And a proof of well I don't recall the details... being interpreted as imaging being limited by wavelength. Then came super resolution which worked around it by not using some of the implicit assumptions of the common interpretation.

So I think it's justified to take issue.

correct_horse · on March 7, 2020

If the problem is slow switching between user space and kernel space, why not have a syscall that allows you to batch process multiple syscalls?

It might work like this: user space allocates a buffer, fills it with syscall numbers and arguments, then passes it to the kernel. The kernel executes each syscall in series, placing the return values in the array originally containing arguments. This syscall would not return to user space until all syscalls in the buffer are processed.

Does this idea have any potential?

nominatronic · on March 7, 2020

This is basically the idea of io_uring [0], although that mechanism is specific to certain I/O operations rather than being a generic queuing system for syscalls.

[0] https://lwn.net/Articles/776703/

wahern · on March 7, 2020

Before io_uring there were many failed proposals for a general purpose, asynchronous interface to wrap blocking syscalls. One of these, "syslets", was literally a mechanism for batching arbitrary syscalls: https://lwn.net/Articles/221887/

> Syslets are small, simple, lightweight programs (consisting of system-calls, 'atoms') that the kernel can execute autonomously (and, not the least, asynchronously), without having to exit back into user-space. ....

> Syslets consist of 'syslet atoms', where each atom represents a single system-call. These atoms can be chained to each other: serially, in branches or in loops.

eternalban · on March 7, 2020

The user api: https://people.redhat.com/mingo/syslet-patches/patches/async...

User space memory mapping (naturally left to the user) seems to be the error prone bit here.

search for "sys_umem" here: https://people.redhat.com/mingo/syslet-patches/patches/async...

I understand that this is low level stuff for experienced developers, but this approach basically asks the programmer to play linker. Entirely doable but I wonder what it looks like at scale (LOC).

correct_horse · on March 7, 2020

Thanks for sharing. It's encouraging that something at least resembling what I came up with was considered.

anonymousDan · on March 7, 2020

Any thoughts on why they failed?

wahern · on March 9, 2020

I don't think the kernel complexity and maintenance burden was deemed worth the effort. But now 1) Spectre has made syscalls more expensive and 2) Linux has destroyed all the competition and so there's alot more pressure to add features for what effectively would only ever be used by niche use cases--you don't add io_uring to benefit the Node.js crowd, you add it for Amazon and Netflix and other consumers that have have no other avenues for optimization.

To be fair, io_uring seems to be a pretty sane design. But, for example, for many years anything with a shared ring buffer between userspace and the kernel was suspect. Now there are several such interfaces. Overall tolerance has grown, especially on the part of Linus. The kernel is much larger, and Linus seems to have become much more deferential to subsystem maintainers as his ability to understand and opine on code has slipped beyond his grasp.

JoshTriplett · on March 7, 2020

Long-term, io_uring will almost certainly going to provide more operations than just I/O.

the8472 · on March 7, 2020

Batched virtual memory updates (mmap/munmap/madvise) could be a boon for allocators and garbage collectors since the cost of TLB shootdowns could be amortized.

Similarly if all the syscalls done between fork and exec could be squeezed into a single uring chain without any intervening userspace code then the kernel could skip some costly virtual memory operations, bringing us closer to posix_spawn behavior.

JoshTriplett · on March 7, 2020

> Batched virtual memory updates (mmap/munmap/madvise) could be a boon for allocators and garbage collectors since the cost of TLB shootdowns could be amortized.

And ideally also the cost of mmap_sem (the lock for an address space's memory map), which can be one of the most contended locks in a workload.

> Similarly if all the syscalls done between fork and exec could be squeezed into a single uring chain without any intervening userspace code then the kernel could skip some costly virtual memory operations, bringing us closer to posix_spawn behavior.

I'd love to see a way of doing this with io_uring.

koolba · on March 7, 2020

The problem is that the calls are related. The first gives you the file handle, the second reads it, the third closes it. You need the handle in advance to queue the second and third.

This idea reminds me of pipelined Redis commands. It does work to boost total throughout, but it’s significantly more complicated to use in practice.

jzwinck · on March 7, 2020

Imagine a method like this:

    1. Enqueue A = openat(...)
    2. Enqueue B = read(A, buf, sz)
    3. Enqueue close(A)
    4. Commit, return B to userspace

A and B are placeholders, and A is never known to userspace, it is simply chained within the enqueued commands.

danobi · on March 7, 2020

What if openat() or read() fails? More generally, what if there's a chain of syscalls and you _want_ one of them to fail?

JoshTriplett · on March 7, 2020

With the "specific fd" approach, if openat fails, then read and close will harmlessly fail with EBADF ("Bad file descriptor"). Once you process the failure of openat, you'll just ignore the failure of read and close.

yxhuvud · on March 7, 2020

What will happen though, if the fd is already used by some other area of userspace? Like what if some lib you link to also happen to hardcore the use of the same fd. Sounds like unfunny problems to debug.

In the case of uring I suppose it is possible to link the calls to fail at the first failure.

JoshTriplett · on March 7, 2020

> What will happen though, if the fd is already used by some other area of userspace?

That's what the min_fd patch is for, and potentially other systems for reserving blocks of fds. Much like memory, you'll ask the kernel for a block of it and then do your own allocations out of that.

yxhuvud · on March 7, 2020

Ah, it will be possible to allocate multiple blocks of file descriptors? Ok, then things start to make sense.

riffraff · on March 7, 2020

This seems exactly the io_uring approach with magic fd described in the article.

naasking · on March 7, 2020

> You need the handle in advance to queue the second and third.

Think about how pipelined commands work in a command shell if you want to understand how to construct a program that can define a sequence of operations on objects before they're defined.

correct_horse · on March 7, 2020

> You need the handle in advance to queue the second and third

whoops. I probably should have thought of that.

willvarfar · on March 7, 2020

Are there any mainstream databases etc adopting io_uring yet?

I first heard about io_uring here on HN and a rust DB lib called Sled was used as an example.

But when do we get fast MySQL etc?

I recall a paper from 2012 or so that implemented a PoC syscall buffer so a bunch of syscalls could be scheduled and executed sequentially with just one syscall. They claimed 40%+ performance improvements in various programs, including MySQL. This was of course before meltdown and other cache mitigations were a thing.

So programs, including MySQL and Postgres (hey, I know something about various DB storage engines), would benefit massively from scheduled sequential syscalls (easy to adopt, big win) and true async io (io_uring, harder to adopt, even bigger win).

antpls · on March 7, 2020

At that point, wouldn't it be simpler to run mysql or postgresql on their own bare metal VM with a specialized uni-kernel that only runs the db process very efficiently (without notion of kernel space and user space and without context switching) ?

anarazel · on March 7, 2020

I am working on optional io_uring support for PostgreSQL. Hope to get that stable for PG 14 (PG 13 enters feature freeze soon).

zuzun · on March 7, 2020

RocksDB supports it for reading multiple chunks from a file.

mayoff · on March 7, 2020

This article says `readfile` would return the number of bytes read. It seems like it would be better for it to return the number of bytes in the file, but only write at most `bufsize` bytes into the buffer. That way, if the buffer isn't big enough, you can allocate a correctly-size buffer without needing to `stat` the file. This is how `snprintf` works.

im3w1l · on March 7, 2020

I think some "files" that are actually devices or pipes and stuff, for them the number of bytes in the file doesn't make sense, while bytes read does make sense. That may be why.

a1369209993 · on March 7, 2020

That wouldn't work well because you'd need to re-open and seek (or re-read) before continuing.

Ideally readfile would take and return a filefd (default -1); if present, skip the open and just read (closing on EOF). This also fixes the EINTR/SA_RESTART issue loeg[0] raised. From a C perspective, it does have a problem of needing to return multiple values (both a fd and a bytecount), though.

0: https://news.ycombinator.com/item?id=22509329

yxhuvud · on March 7, 2020

O_SPECIFIED seems like a strange solution to me.

Sure, I see why the kernel people wouldn't mind having people specify their own fd numbers, but doing the accounting necessary for that (to avoid conflicts) in userspace doesn't seem nice. Especially when there will be libraries that also may make use of the functionality.

Is there even any way to see what file descriptors that are open without doing a syscall?

StrangeDoctor · on March 7, 2020

It looks like it’s intended to be used with a batched creation of file descriptors that are known to allocated but not yet used (to cut down on total syscalls?) https://lore.kernel.org/io-uring/20200211223235.GA25104@loca...

At least I think I’m reading this correctly.

tyingq · on March 7, 2020

"On its face, readfile() adds nothing new; an application can obtain the same result with calls to openat(), read(), and close(). But it reduces the number of system calls required from three to one"

A little surprised they didn't also roll in an optional lseek() offset in case you wanted a specific area of the file. Then it would roll 4 syscalls into 1. The somewhat similar sendfile() accepts an offset.

masklinn · on March 7, 2020

If you look at the thread, readfile() would be pretty specifically for extremely short files e.g. procfs and sysfs.

It would not be intended as a convenience API for slurping an entire file but as a performance API for very short reads (straight open(); read(); close() where the syscall overhead is enormous).

The implementation(s) by Greg Kroah-Hartman don't even bother looping until read returns empty or the buffer is full.

tyingq · on March 7, 2020

Sure, but it would be simple to add, and perhaps then useful for lots of small reads from bigger files. Or things like /proc/pid/mem.

trgfs85 · on March 8, 2020

A lot of comments in this topic seem to forget that syscalls don't always succeed. Interupts are the biggest issue. If an interrupt occurs and you read something you get a partial read. If you open something you tend to get EINTER(very system dependent.) You can't really queue those. And those shouldn't be the OS's job to deal with.

cryptonector · on March 7, 2020

> the cost of reading a lot of little files would be too high

The `readfile()` system call will reduce that by some factor, though probably not as high as a factor of one third (it replaces three system calls with two, assuming it turns out to be useful). But getting all the data in one call will be even faster.

Incidentally, a typical read-all-of-this-file function in user-land allocates a buffer of the right size after doing an `open()` and then `fstat()` to find the file's size. Whereas a system call to read a file in one go really can't do that unless it establishes a memory mapping (which is... expensive). So I rather question the utility of this thing. Also, since one might not know what size buffer to give it, it might as well have an option to return an open FD. Or the app should have to `stat()` then `readfile()`, thus replacing three syscalls (`open()`, `fstat()`, `close()` with two -- bummer. Or the caller can `readfile()` repeatedly if the buffer given was not large enough (oof).

No, I'm pretty sure this `readfile()` can't possibly make reading lots of tiny files fast enough to not want a better solution for getting mounted filesystem metadata out of the kernel.

ploxiln · on March 7, 2020

One of the cases posited is `ps` or `top` which (on linux) read /proc/[0-9]+/stat and /proc/[0-9]+/cmdline, and in these cases it knows the the max it will want to read (which is pretty small, maybe 512 bytes at most). So this can make `ps` or `top` 3x faster / more efficient.

bicolao · on March 7, 2020

There must be other use cases besides ps and top to justify new syscalls right? ps and top are not exactly slow to even make these optimizations visible.

ncmncm · on March 7, 2020

This sort of thing leads me to think Linux is well down the path to irrelevance. Io_uring, for all its baroque elaboration, saves millions or billions of system calls. Bloating the kernel to save two of three system calls on filesystem operations seems nothing short of foolish.

I would welcome being proved wrong.

userbinator · on March 7, 2020

The idea of making file reads stateless is interesting. I wonder if it's atomic too? I think that might be very useful to have.

loeg · on March 7, 2020

It does sort of raise the question of how interrupts are handled. Short reads / EINTR and suddenly this mechanism is less efficient than open+read+close. Depending on how SA_RESTART is implemented, it may not help. Do consumers have to mask signals around readfile(), and if so, are we saving any system calls?

longtermd · on March 7, 2020

TLDR: speed-up happens as common-combinations of syscalls (open, read, close) are combined to one single syscall (read file contents).

It violates the unix philosophy of "one call for one specific thing", of course, but massively speeds up performance in practical applications.

In fact, you could get a LOT of speedup at Linux/Microsoft/... (the OS level) by figuring out highly common combinations of syscalls typically all called in a row with mostly the same arguments, and combining them into one single syscall.

wahern · on March 7, 2020

Once upon a time Linux solved the "syscalls are slow" problem by making syscalls fast, rather than introducing convoluted APIs as a workaround, like other operating systems did.

Perhaps one day we can get back to that model, or maybe now that Linux has gone corporate those days are gone. Even when you can make something faster with a simple, straight-forward solution, there's always that one corporate sponsor who doesn't benefit. Before Linux would just tell them sod-off.

KarlKemp · on March 7, 2020

that’s (at best) a gross oversimplification of the process, idealizing the past while ignoring how expectation for the kernel have increased (I e. security & privacy), all colored by a layer of bad faith that, to be frank, seems to originate with some non-specific grievances.

wahern · on March 7, 2020

I'm simply channeling Linus Torvalds. Here's a sample of a rant/boast from 2000 where Linus defends the "heavy-weight" threading approach against claims that a proper user-space threading architecture could provide better performance by requiring fewer syscalls and less memory:

> Yes, you could do it differently. In fact, other OS's _do_ do it differently. But take a look at our process timings (ignoring wild claims by disgruntled SCO employees that can probably be filed under "L" for "Lies, Outright"). Think about WHY our system call latency beats everybody else on the planet. Think about WHY Linux is fast. It's because it's designed right.

Source: https://lkml.org/lkml/2000/8/25/80

The same architectural approach was taken with fork: Linux optimized the heck out of fork to the point where fork on Linux was faster than thread creation on Windows. On multiple occasions Linus has argued (and proven), that optimizing the simple but "heavy-weight" approach can reap dividends at least as great as more complex, more flexible architectures.

Of course, that was then and this is now. And many things have changed, including the success of Linux. I'm just suggesting, or perhaps implying, that there's more eagerness to entertain the replacement of traditional syscall semantics and other subsystems with more complex frameworks than there once used to be. In the context of Spectre, arguably it would have been easier for a younger Linus (and younger Linux) to refuse to rearchitect syscalls (with the concomitant additional kernel and semantic complexity), to tell everyone to sit tight and wait for AMD and Intel to come up with mitigating hardware fixes, and in the intervening years just take the performance hit.

nibbula · on March 7, 2020

I'm not a kernel hacker either, and even though a readfile system call might speed some things up, one of the reasons it's desired is really wallpapering over a deeper, much harder issue, that of it being a terrible hack to write the 'ps' command, and the more general issue of returning structured data from a unix-like kernel. Also it seems like a bad design not to know how big to make the buffer, which means if you want to reliably get the whole file you either have to put it in a loop reallocating the buffer, or do a stat system call before be lucky with the race condition.

To be fair, this has been an issue since the first unix kernels, and even though /proc and sysctl are an improvement over grepping kernel memory from user space (the incredibly hackish way old ps worked), in my opinion it's still a big mess in various ways.

Just in case you aren't aware, linux ps/top/etc. has to open possibly hundreds of fake files, parsing from text a bunch of stuff which may or may not be there or valid or the same data type depending on your kernel configuration or version. Just because you could write 'ps' with awk, doesn't make it good. I wouldn't object at all to the /proc interface, as a way of enabling simple tools, if there was also a decent function call interface.

Modern BSDs generally use sysctl, which although being better in theory, that you don't have the overhead of useless translating numbers back and forth from text and looking for space characters, it still has the problem that it's very dependent on subtle changes in the C data structures which can easily happen between version and architectures. But it also has the terrible drawback of not knowing how big the buffer should be and therefore having to be in an allocation fail loop, probably exactly when you don't want it to happen, when you system is overloaded with way too many processes. I really don't see why one couldn't pass a memory allocator function, which could be called in the same manner as a signal handler.

I don't know how windows stuff works inside, but to get process information you ask for a snapshot of the system from which you can return a set of handles to processes, which you can get information from. In other words, it allocates the appropriate things for you and has well defined set of information which you can query, all with a relatively simple function call interface from C. I wrote a 'ps' that works on linux/bsd/macos/windows, and even though I'm not at all fan of windows, the windows kernel does it better than all the unix kernels.

It's actually not that hard to make an acceptable C interface to to say process information, if you ignore the problems of data structure variance. But ignoring it is really the root of the issues, so it seems fairly pointless. sysctl sort of tries to have a metadata system, but it doesn't really address the data type problem well, and can end up using C structs from the kernel with same variance problems. For example task_struct in linux or struct proc on macos is kind of a big mess, but very important. You wouldn't want to restrict it from changing in any way, but you do want to get some of that data to users.

Making a good C metadata interface is complex, and although it's been done many times, it's quite tricky to do well, and you would really want a kernel interface to be done very well. I like to imagine that there could be something less bloated than gtk gobject, but more featurefull than sysctl. Unfortunately when I look at code involved I start feeling like C isn't really a good language for writing an operating system. But I still think that a well designed structured meta data system would be overall less overhead than /proc and could even achieve better speed and reliability.

shiblukhan · on March 7, 2020

Hmm, how is one actually supposed to use O_SPECIFICFD in real life though? I mean, picking some specific FD nr can work in trvial, single threaded programs maybe, but how is that supposed to work on typical generic userspace code that has many threads and many libraries/subsystems all running in the same address space (and fd space) that want to take possession of some fd? I mean are apps supposed to block fd ranges ahead of time, by dup()ing /dev/null a couple of times, or how is that supposed to work? not getting this...

ape4 · on March 7, 2020

Maybe the new readfile() could be used to implement php's readfile().

131hn · on March 7, 2020

PHP readfile is all about « streaming » GB size files to stdout, not so much returning a whole file in a single buffer (but file_get_contents is and might, indeed, benefit from the new readfile syscall )

ape4 · on March 7, 2020

Thanks, that was what I meant.