So if I do fopen/fwrite/fsync/fclose, that is not enough? That is crazy, I think 90% of apps don't fsync the parent directory. Also, how many levels of parents do you need to fsync?
That should be only for creating files, and maybe updating their metadata (not sure about that one).
The confusion stems from people thinking that files and directories are more different than they are. Both are inodes, and both are basically containers for data. File inodes are containers for actual data, while directory inodes are containers for other inodes.
All inodes need to be fsynced when you write to them. For files this is obviously when you write data to them. For directories, this is any time you change the inodes they contain, since you’re effectively writing data to them.
You only need to sync the direct parent because the containers aren’t transitive; the grandparent directory only stores a reference to the parent directory, not to the files within the parent directory. It’s basically a graph.
> rename() doesn't cause a fsync of the parent directory?
It does not.
At best it will schedule a journal commit asynchronously (I recall that ext4 maintainer complained about adding this "workaround for buggy user code" on lkml). If you want to receive an IO error when renaming fails, make sure to call fsync() yourself.
I’m not well versed enough on the subject to know if there’s important nuance here that I’m missing… but that doesn’t sound like a contradiction to me. That’s effectively a map with O(n) access, right? I think I would more generally refer to that as a “collection”, but it certainly contains inodes as you describe it.
IMO it's better to think of a directory as "containing" only the named references to inodes. Other directories on the same filesystem may contain other named references to the same inodes ("hard links").
The inodes themselves are more like free floating anonymous objects independent of any particular directory, and might not have a named reference at all (O_TMPFILE; or the named reference was deleted but a file descriptor is still open; or orphaned inodes due to filesystem corruption or someone forgetting to fsync the directory after creating a file as masklinn pointed out - e2fsck will link these in /lost+found). This is also why chmod appears to affect other hard links in different directories: Because it actually modifies the inode (file permissions are recorded in that), not the named reference.
This definitely seems like a more meaningful distinction to me. It’s also closer to the mental model I had coming into the discussion, FWIW.
And I think it makes the collection/container terminology distinction sharper too. Depending on context, I think it’s usually reasonable (if imprecise) to describe a bucket of references or pointers to things as a collection of those things. But I don’t think it makes as much sense to call it a container of those things, except in a really abstract sense.
do you mean it's a basically a tree? Because if it were just a graph, you could still have edges from the grandparent to the grandchild in addition to the one from the former to the child
This ... depends. In normal POSIX land, hard links break the tree structure, for one, so you get a DAG but not a tree. I think some file systems do enforce tree structure, though - hard links are not supported everywhere.
It used to be possible ages ago to hard link to directories, which meant that you could have actual cycles and a recursive tree-walking algorithm would never terminated. (As far as I know you can still do this by editing the disk manually, although I think fsck will make a fuss if it detects this.)
You can still, with the right syscalls and drivers, do something like hard links on NTFS (I think they're technically called mount points but it's not the same thing as POSIX ones). I'm not sure if you can still make directory cycles, and you're probably a bad person if you do.
Crazy is the right term. File system APIs in general have too many sharp edges and need a ground-up rethink.
Consider S3-like protocols: these recognise that 99% of the time applications just want “create file with given contents” or “read back what they’ve previously written.”
The edge cases should be off the beaten path, not in your way tripping you up to when you want the simple scenario.
Aren't the edge cases features? An abstraction (or different API, sure) is in order to prevent footguns. However, this abstraction should not force fsyncs for example, due to the performance impact mentioned. It puts the choice of guaranteed writes vs performance to the developer.
A better abstraction, designed from the ground up, wouldn’t force fsyncs to work.
For example, write groups or barriers (like memory barriers) would be wonderful. Or a transaction api, or io completion ports like on windows.
In a database (and any other software designed for resiliency), you want the file contents to transition cleanly from state A to B to C, with no chance to end up in some intermediate state in the case of power loss. And you want to be notified when the data has durably written. It’s unnecessarily difficult to write code that does that on top of POSIX in an efficient way. Most code that interacts with files is either slow, wrong or both. All because the api is bad.
> It puts the choice of guaranteed writes vs performance to the developer.
Yes, and it's a completely false choice. This entire point of this thread is that fsync is an incredibly difficult API to use in a way that gets you the guarantees you need ("don't lose the writes to this file"). And that the consistency guarantees of specific filesystems, VFS, POSIX, and their interactions are not easy to understand even for the experienced -- and it can be catastrophic to get wrong.
It isn't actually a choice between "Speed vs correctness". That's a nice fairy tale where people get to pretend they know what they're up against and everyone has full information. Most programmers aren't going to have the attention to get this right, even good ones. So then it's just "99.9% chance you fucked it up and it's wrong" and then your users are recovering data from backups.
The filesystem is very easy to use for simple things by comparison to a DB, and it's more accessible from the shell. But you're right, the filesystem is very difficult to use in a power failure safe way. SQLite3 has great power failure recovery testing, so the advice to use SQLite3 for any but the simplest things is pretty good.
It'd be very nice to get some sort of async filesystem write barrier API. Something like `int fbarrier(int fd)` such that all writes anywhere in the filesystem will be sync'ed later when you `fsync()` that fd.
It would also be very nice to have an async `sync()`/`fsync()`. That may sound oxymoronic, but it's not. An async `sync()`/`fsync()` would schedule and even start the sync and then provide a completion notice so the application can do other work while the sync happens in the background. One can do sync operations in worker threads and then report completion, but it'd be nice to have this be a first class operation. Really, every system call that does "I/O" or is or can be slow should be / have been designed to be async-capable.
> So if I do fopen/fwrite/fsync/fclose, that is not enough?
That is my understanding.
> Also, how many levels of parents do you need to fsync?
Only one, at least if you didn't create the parent directory (if you did then you might have to fsync its parent, recursively). The fsync on the parent directory ensures the dir entry for your new file is flushed to disk.
I've never heard about this in my years of programming. I just tried to read through the Win32 documentation, as I've done several times over the years, and it mentions a lot of edge cases but not this that I could see.
Is this some Linux/Unix specific thing? Am I blind?
As far as I know, it's specific to the combination of certain POSIX-ish OS and file systems, like linux/Ext3. I have no clue what BSD does here, or whether ReiserFS is different.
Windows/NTFS is a different world, there are still edge cases that can go wrong but I don't think this particular one is a problem because FAT/NTFS is not inode-based.
I imagine if you looked at the SQLite source code you'd see different edge-case-handling code for different OSes.
The thing about Windows is that because the file open operation (`CreateFile*()`) by default prevents renames of the file, Windows apps have come to not depend so much on file renaming, which makes one of the biggest filesystem power failure hazards less of an issue on Windows. But not being able to rename files over others as easily as in POSIX really sucks. And this doesn't completely absolve Windows app devs of having to think about power failure recovery! "POSIX semantics" is often short-hand for shortcomings of POSIX that happen to also be there in the WIN32 APIs, such as the lack of a filesystem write barrier, file stat-like system calls that mix different kinds of metadata (which sucks for distributed filesystems protocols), and so on. And yes, you can open files in Windows such that rename-over is allowed, so you still have this problem.
Ya NTFS file naming and long path is a wreck. Unzipping files from mac/linux is a easy way to end up with missing data. Applications quite often break on long file paths especially Microsofts own stuff like powershell.
More specifically, these things are for trying to improve the behavior on unclean system shutdown (e.g. power loss) which is inherently chaotic and unless all parts (most critically the disk and its controller) are well behaved you don't have any real guarantees anyway.
Windows also doesn't guarantee that data is written to disk by the time WriteFile/CloseHandle returns, the Windows version of fsync is FlushFileBuffers.
One of the advantages of using a database for files is that it's relatively more likely that these platform-dependent considerations were indeed considered and you don't have to moonlight as an DBMS engineer while writing your application.
One reason Iike use sqlite so much in my personal projects, sometimes just using it to store various config data bits within row entries of the most generic table ever, works well with small projects, probably not so much with large scaled ones.
Great another case of simple thing everyone knows is simple but turns out to be horrifyingly complicated and heavily system dependent, but only on occasion so you can get 80% of the way through your career before encountering the gaps in your knowledge.
I guess I'll add it to the list.
Of course on the other hand, I was already thinking that I should just use SQLite for all my file handling needs. This little nugget makes me think that this was the correct intuition. [Queue horrifying revelations w.r.t. SQLite here.]
w.r.t SQLite, the only horrifying revelation I’ve had is that it allows NULLs in composite primary keys, which I’ve seen lead to some nasty bugs in practice.
You don't need even to bother with fsync unless you are developing a database or use the file system like a database.
There's a reason Apple made fsync useless and introduced F_FULLSYNC. They know developers have an incentive to overestimate the importance of their own files at the detriment to system responsiveness and power draw.
The problem is, Apple's fsync doesn't even introduce any useful semantics. All it says is "write this eventually".
What a lot of people really need is just a "provide ordering, don't leave me with inconsistent data" operation. If you lose power "after" a write but before the fsync, for any non-networked application that's no different than losing power before the write, as long as the filesystem doesn't introduce chaos.
That seems significantly harder to implement, though, given that dirty page cache entries don't have an order entry as far as I know, and so retroactively figuring out which writes not to reorder with others is anything but trivial.