I find it surprising that a filesystem feature is so tightly coupled to a specific programming language. Why is it not an API that lets you begin/commit a "channel program" from any language, then execute zfs commands within session? Surely speed of execution is not so big of a constraint.
Think of it like a GPU compute shader. Like a shader, you have to submit "the whole thing" to the kernel at once, so that it has the program available all at once in its own context, and can guarantee it won't get blocked waiting for your app to submit "the rest of it."
To enable the client program to submit "the whole thing", you really need to represent that sequence of operations as some kind of instruction stream loaded into a buffer — either a text-based one (source code in some language), or bytecode in some ISA.
(Why not capture a sequence of API operations that do nothing when called, but instead build such an instruction stream; where "committing" them actually runs the instruction stream? Y'know, like a Redis pipeline? Because a "pipeline" like that can't branch based on the result of executing part of it, unless branching/looping/etc are also implemented as operations in this hypothetical fluent-API-DSL. At which point, you're just programming with the semantics of another language, with extra steps.)
And if a static analysis pass is going to occur to check the code for runtime complexity, then you're already transforming the input into a different verified output with an intermediate AST-like structure; so at that point you may as well accept a text-based format and expose a "compile" system call. Just as GPU drivers do for compute shaders.
I'm guessing that there's probably an internal ZFS API allowing you to instead pass in some kind of bytecode (LuaJIT bytecode?) and just have it directly static-analyzed and plopped into a kernel handle, rather than parsed, then static-analyzed, then codegen'ed. They'd definitely need something like that for testing.
This is a tricky problem, but it’s something all good transactional databases solve.
For example, the way this is handled in foundationdb is that you create a transaction, then within that transaction issue a series of read and write calls. The read calls are executed immediately, and they see the database as if other concurrent changes weren’t happening. The write calls get batched up and at the end of the transaction block they are submitted together - with information about which keys were read. On submission, the transaction block can fail with E_RETRY - which means a conflict happened, the transaction was entirely aborted and the userland program should run the transaction again from the start.
I’d absolutely adore an API like this for filesystem operations. There are all sorts of half baked, dirty workarounds to get around the lack of atomicity in posix filesystems. Eg, that whole “write a file then rename it” thing - which almost nobody implements correctly because it’s error prone and most correct solutions on top of posix have terrible performance.
I don't think you understand the point of the constraints ZFS has put in place here. They're not there for transactional isolation per se. They're there because these ZFS maintenance operations run as synchronous system calls that take per-filesystem global mutex locks. Blocking within one, with those locks acquired, and thunking back down to userspace, would mean that 1. a CPU core, and 2. the filesystem itself, would both be tied up indefinitely until that callback thunk returns. If it ever does!
Picture a database server where once you begin a transaction, the whole DB server goes single-threaded and doesn't serve any other clients until the TX completes. Fundamentally, at least for its more arcane global operations, that's exactly what a filesystem is. Usually this doesn't matter, because even these arcane filesystem operations take on the order of microseconds to complete (just with lots of non-locked pre/post execution overhead that inflates the total syscall time.) But a userspace program can do arbitrary-much stuff in a callback. It can even just get into an infinite loop, and never get back to the kernel. This is why there isn't such a thing as a system call with userspace callbacks! (Or any API equivalent to one—e.g. one with tx_begin + tx_commit calls, where tx_begin acquires a global kernel lock before returning to userspace, under the expectation that userspace call tx_commit to release the kernel lock.)
On the other hand, submitting a "callback program" as a whole to the kernel, allows the kernel to statically analyze the behavior of the "callback program" in advance; and only if it is determined to be "simple" (e.g. if it predictably halts after a short time as a static property), will the kernel issue the program a capability for actually running that "callback program."
This is how compute shaders work. This is how eBPF works. And this is (apparently) how this ZFS Lua thing works. They all do it for similar reasons: to ensure that the program is a complete, soft-realtime practical unit of work to be executing in some bounded-time context.
You do get atomicity from this, but it's not MVCC atomicity. It's more like Redis's atomicity: during the execution of a ZFS Lua program, all other clients making system calls against the filesystem will wait on acquiring the filesystem mutex, and so will block until the program completes. There's never a case where other activity might "interleave" with the Lua program's execution, and cause it to retry. There is no other activity. The only case in which the Lua program can actually fail (and so get rolled back), is if it dies/is killed in the middle of execution, due to e.g. a hardware power cut. That's the kind of atomicity that ZFS is concerned about—guaranteeing that changes that make it into the filesystem journal, are complete valid change units. In this case, one Lua program is one change unit in the journal.
> Blocking within one, with those locks acquired, and thunking back down to userspace, would mean that 1. a CPU core, and 2. the filesystem itself, would both be tied up indefinitely until that callback thunk returns. If it ever does!
In this theoretical design, you could just block all other administrative modifications. I don't think you need to tie up an entire CPU core, and I'm fairly sure that these zfs operations don't block regular reads and writes.
I think you had it right in your initial comment. There's no good way express branching with an implementation which incrementally submits operations to be committed as a batch. You'd have to take an admin lock on an entire zpool.
EDIT: talked to a zfs dev, said this would take the txg sync lock.
While there are ways to deschedule both userspace and kernel threads, there is no mechanism to deschedule a userspace thread while it's executing in the middle of kernel mode because of a blocking syscall.
Think of it like trying to deschedule a userspace thread in the middle of it having jumped to kernelspace to handle an interrupt. It just wouldn't work; that's not a pre-emptible state, not one that can be cleanly represented during a context switch with a PUSHA, not one where pre-emption would leave the kernel in a known state, etc.
So the CPU core is tied up because the original thread can't be descheduled, and instead would still be "stuck" in the middle of the system call, doing a busy-wait on the result of the callback. To make the callback actually happen in this hypothetical design, the execution of the callback would need to be scheduled onto another CPU core, using some system-global callback-scheduler like Apple's libdispatch.
Note that this is also why, in Linux, processes stuck in the D state are unkillable. They're stuck "inside" a blocking system call, and so cannot be descheduled, even by the process manager trying to hard-kill them (which, in the end, requires the system call to at least return to the kernel so that the kernel resources involved can reach a known postcondition state.)
And this is why innovations like io_uring make so much sense in Linux — they allow a userspace process to 1. make a long-running blocking syscall, while also 2. spawning a worker subprocess to communicate asynchronously with the logic inside the running syscall, by queuing messages back and forth through the kernel rings. (Picture, say, sendfile(2) messaging your worker to let you observe the progress of the operation, and/or to signal it on a channel to gracefully cancel the operation-in-progress.)
I'm not following what you're saying. Why do we need a callback?
In this imaginary design, the syscalls you make would look something like:
- BeginChannelTx -> return ChannelTxID
- ReadZFSProperties(ChannelTxID, params) -> return data
- DestroySomeDatasets(ChannelTxID, params) -> ok
- CommitChannelTx(ChannelTxID)
Notably, DestroySomeDatasets doesn't actually do any work. It merely records which datasets you want to destroy. There are no callbacks as far as I can see: there's no kernel thread waiting on a user thread to do something. This way also lets you express branching.
The drawback of this approach is you need a lock on all mutating administrative commands when you call BeginChannelTX. I talked to a ZFS dev, and he said that with ZFS' design, that's actually the txg sync lock. This means that while reads will proceed, writes will only proceed for a short period of time, and nothing will make it to disk. The overhead of making all these syscalls was also judged to be problematic.
Given that the whole point is to perform script-like operations as an atomic unit in the kernel, it seems to me you'd be reinventing a lot of script-like stuff by exposing a "bare" API for generating a "program" to be run.
For example, consider what the API must support to make a channel program that lists all the child datasets of a given dataset, and takes a snapshot of each child dataset that has a certain ZFS property.
The looping cannot be done by the user program, as that would introduce edge cases when datasets are created or destroyed between listing and generating the snapshot commands.
As such, picking Lua as the API isn't a bad choice I think.
I’m not familiar with the feature at all but I wonder whether your suggested alternative API would be powerful enough to atomically do things like:
1. Find the latest three snapshots, and then
2. delete those three snapshots.
With a generic API, that’d take two calls, with a possible TOCTOU in between so it wouldn’t be atomic.
But if you send that same program to the ZFS subsystem, and have it evaluate and execute the script, the subsystem would be able to guarantee atomicity, right?
it seems pretty trivial to just have some kind of "begin/end atomic section" flag or function rather than jamming in a whole special purpose scripting enviroment.
But what is the API supposed to do when it comes across a `begin` flag? Stop all filesystem activity in all programs until yours sends the `end` flag? What if that flag never arrives, e.g. due to a bug in your client code?
delay all commits to disk until a fixed timeout is reached for command submission, if the timeout is reached the call returns failed, the same problem exists with this special purpose scripting environment just in other forms.
I have a special place in my heart for these sort of "weird" features, which seem really exotic compared to the standard POSIX APIs we're all used to. I just enjoy thinking about alternatives and new ways of doing and architecting things. Even if I can't think of a practical use for such a feature, I always appreciate the novelty. It makes me wonder about the future. Are the typical POSIX-style APIs still going to be dominant 50 or 100 years from now, or will new APIs and ideas start to replace them?
It seems like CPU peripherals are gaining more and more intelligence. I was reading about PCIe's potential successor CXL and the amount of features that allow peripherals to do their own thing with a minimum of CPU intervention seems to be increasing.
So at some point I bet you'll be basically uploading BPF like code to peripherals directly in order to accomplish really fast I/O. Don't know how that's going to look at all from a post-POSIX API perspective.
> Channel Programs may only be run with root privileges
Compare to eBPF, which not only uses a JIT compiler in the kernel, but on popular Linux distributions like Fedora and Ubuntu permits unprivileged users to run eBPF programs.
I would never assume a lack of exploitable bugs in the fork of the Lua interpreter used in FreeBSD, but the fact is that eBPF has seen a steady, if infrequent, stream of actual exploits over the years and in particular unprivileged user exploits.
If you have to have root already this doesn’t buy you much though. Perhaps a hard-to-detect way for disgruntled bofhs to do bad things, that’s about it.
> kernel operations are faster than userland operations
Call me a pedant but this grinds my gears. The speed my CPU will carry out instructions is not determined by if it's in a kernel or user context. Avoiding making switches between those contexts certainly means less work though.
How do you feel when people say C is faster than bash? The speed your CPU carries out instructions is the same. A bash statement is usually more instructions than a C statement though.
For starters, it doesn’t run on the channel subsystem, but uses the CPU for it. Also, channel programs (the IBM ones) can do a lot of stuff - IBM’s ISAM uses self-modifying channel programs. These channels can’t be that flexible because they run within the kernel and being too flexible would be a security risk.
> For starters, it doesn’t run on the channel subsystem, but uses the CPU for it.
Low/mid range mainframes don't really make sense anymore, but channel programs did run on the main CPU on quite a few low/mid range mainframe systems.
> Also, channel programs (the IBM ones) can do a lot of stuff - IBM’s ISAM uses self-modifying channel programs. These channels can’t be that flexible because they run within the kernel and being too flexible would be a security risk.
There's no reason this couldn't do that either as long as it's run in a preemptable context (which I'm pretty sure is the case from looking at the example program). Schemes like eBPF mainly control code flow in pursuit of the goal of running in non preemptable contexts like interrupts.