So if I do fopen/fwrite/fsync/fclose, that is not enough? That is crazy, I think...

everforward · on Sept 10, 2024

That should be only for creating files, and maybe updating their metadata (not sure about that one).

The confusion stems from people thinking that files and directories are more different than they are. Both are inodes, and both are basically containers for data. File inodes are containers for actual data, while directory inodes are containers for other inodes.

All inodes need to be fsynced when you write to them. For files this is obviously when you write data to them. For directories, this is any time you change the inodes they contain, since you’re effectively writing data to them.

You only need to sync the direct parent because the containers aren’t transitive; the grandparent directory only stores a reference to the parent directory, not to the files within the parent directory. It’s basically a graph.

RenThraysk · on Sept 10, 2024

In the case of atomic file writing, the rename() doesn't cause a fsync of the parent directory?

altfredd · on Sept 10, 2024

> rename() doesn't cause a fsync of the parent directory?

It does not.

At best it will schedule a journal commit asynchronously (I recall that ext4 maintainer complained about adding this "workaround for buggy user code" on lkml). If you want to receive an IO error when renaming fails, make sure to call fsync() yourself.

holowoodman · on Sept 10, 2024

Depends on the file system. In NFS, yes. In others, maybe not.

guappa · on Sept 10, 2024

> while directory inodes are containers for other inodes.

Uh? No.

You can imagine them as a list of name,inode tuples.

eyelidlessness · on Sept 10, 2024

I’m not well versed enough on the subject to know if there’s important nuance here that I’m missing… but that doesn’t sound like a contradiction to me. That’s effectively a map with O(n) access, right? I think I would more generally refer to that as a “collection”, but it certainly contains inodes as you describe it.

flotzam · on Sept 10, 2024

IMO it's better to think of a directory as "containing" only the named references to inodes. Other directories on the same filesystem may contain other named references to the same inodes ("hard links").

The inodes themselves are more like free floating anonymous objects independent of any particular directory, and might not have a named reference at all (O_TMPFILE; or the named reference was deleted but a file descriptor is still open; or orphaned inodes due to filesystem corruption or someone forgetting to fsync the directory after creating a file as masklinn pointed out - e2fsck will link these in /lost+found). This is also why chmod appears to affect other hard links in different directories: Because it actually modifies the inode (file permissions are recorded in that), not the named reference.

eyelidlessness · on Sept 10, 2024

This definitely seems like a more meaningful distinction to me. It’s also closer to the mental model I had coming into the discussion, FWIW.

And I think it makes the collection/container terminology distinction sharper too. Depending on context, I think it’s usually reasonable (if imprecise) to describe a bucket of references or pointers to things as a collection of those things. But I don’t think it makes as much sense to call it a container of those things, except in a really abstract sense.

xdavidliu · on Sept 10, 2024

> It’s basically a graph.

do you mean it's a basically a tree? Because if it were just a graph, you could still have edges from the grandparent to the grandchild in addition to the one from the former to the child

red_admiral · on Sept 10, 2024

This ... depends. In normal POSIX land, hard links break the tree structure, for one, so you get a DAG but not a tree. I think some file systems do enforce tree structure, though - hard links are not supported everywhere.

It used to be possible ages ago to hard link to directories, which meant that you could have actual cycles and a recursive tree-walking algorithm would never terminated. (As far as I know you can still do this by editing the disk manually, although I think fsck will make a fuss if it detects this.)

You can still, with the right syscalls and drivers, do something like hard links on NTFS (I think they're technically called mount points but it's not the same thing as POSIX ones). I'm not sure if you can still make directory cycles, and you're probably a bad person if you do.

monocasa · on Sept 10, 2024

> , hard links break the tree structure, for one, so you get a DAG but not a tree.

More than DAGs, but instead pretty arbitrary graphs since you can express cycles with hard links.

Sebb767 · on Sept 12, 2024

> since you can express cycles with hard links

You can't if you only allow hard links to files.

account42 · on Sept 10, 2024

Logically, ".." is a hard link to the parent directory. This may or may not actually be the case in the filesystem on the disk.

nerdponx · on Sept 11, 2024

A tree is a specific kind of graph.

jiggawatts · on Sept 10, 2024

Crazy is the right term. File system APIs in general have too many sharp edges and need a ground-up rethink.

Consider S3-like protocols: these recognise that 99% of the time applications just want “create file with given contents” or “read back what they’ve previously written.”

The edge cases should be off the beaten path, not in your way tripping you up to when you want the simple scenario.

Cthulhu_ · on Sept 10, 2024

Aren't the edge cases features? An abstraction (or different API, sure) is in order to prevent footguns. However, this abstraction should not force fsyncs for example, due to the performance impact mentioned. It puts the choice of guaranteed writes vs performance to the developer.

josephg · on Sept 10, 2024

A better abstraction, designed from the ground up, wouldn’t force fsyncs to work.

For example, write groups or barriers (like memory barriers) would be wonderful. Or a transaction api, or io completion ports like on windows.

In a database (and any other software designed for resiliency), you want the file contents to transition cleanly from state A to B to C, with no chance to end up in some intermediate state in the case of power loss. And you want to be notified when the data has durably written. It’s unnecessarily difficult to write code that does that on top of POSIX in an efficient way. Most code that interacts with files is either slow, wrong or both. All because the api is bad.

aseipp · on Sept 10, 2024

> Aren't the edge cases features?

What features do you have in mind?

> It puts the choice of guaranteed writes vs performance to the developer.

Yes, and it's a completely false choice. This entire point of this thread is that fsync is an incredibly difficult API to use in a way that gets you the guarantees you need ("don't lose the writes to this file"). And that the consistency guarantees of specific filesystems, VFS, POSIX, and their interactions are not easy to understand even for the experienced -- and it can be catastrophic to get wrong.

It isn't actually a choice between "Speed vs correctness". That's a nice fairy tale where people get to pretend they know what they're up against and everyone has full information. Most programmers aren't going to have the attention to get this right, even good ones. So then it's just "99.9% chance you fucked it up and it's wrong" and then your users are recovering data from backups.

lionkor · on Sept 10, 2024

It sounds more like you are asking for an abstraction

lmz · on Sept 10, 2024

The file system is already an abstraction. I think they are asking if it's the right one.

TickleSteve · on Sept 10, 2024

Is the filesystem the correct abstraction? For most applications, a database-like API is more appropriate, hence SQLite.

cryptonector · on Sept 10, 2024

The filesystem is very easy to use for simple things by comparison to a DB, and it's more accessible from the shell. But you're right, the filesystem is very difficult to use in a power failure safe way. SQLite3 has great power failure recovery testing, so the advice to use SQLite3 for any but the simplest things is pretty good.

It'd be very nice to get some sort of async filesystem write barrier API. Something like `int fbarrier(int fd)` such that all writes anywhere in the filesystem will be sync'ed later when you `fsync()` that fd.

It would also be very nice to have an async `sync()`/`fsync()`. That may sound oxymoronic, but it's not. An async `sync()`/`fsync()` would schedule and even start the sync and then provide a completion notice so the application can do other work while the sync happens in the background. One can do sync operations in worker threads and then report completion, but it'd be nice to have this be a first class operation. Really, every system call that does "I/O" or is or can be slow should be / have been designed to be async-capable.

masklinn · on Sept 10, 2024

> So if I do fopen/fwrite/fsync/fclose, that is not enough?

That is my understanding.

> Also, how many levels of parents do you need to fsync?

Only one, at least if you didn't create the parent directory (if you did then you might have to fsync its parent, recursively). The fsync on the parent directory ensures the dir entry for your new file is flushed to disk.

magicalhippo · on Sept 10, 2024

I've never heard about this in my years of programming. I just tried to read through the Win32 documentation, as I've done several times over the years, and it mentions a lot of edge cases but not this that I could see.

Is this some Linux/Unix specific thing? Am I blind?

red_admiral · on Sept 10, 2024

As far as I know, it's specific to the combination of certain POSIX-ish OS and file systems, like linux/Ext3. I have no clue what BSD does here, or whether ReiserFS is different.

Windows/NTFS is a different world, there are still edge cases that can go wrong but I don't think this particular one is a problem because FAT/NTFS is not inode-based.

I imagine if you looked at the SQLite source code you'd see different edge-case-handling code for different OSes.

cryptonector · on Sept 10, 2024

NTFS has inodes.

The thing about Windows is that because the file open operation (`CreateFile*()`) by default prevents renames of the file, Windows apps have come to not depend so much on file renaming, which makes one of the biggest filesystem power failure hazards less of an issue on Windows. But not being able to rename files over others as easily as in POSIX really sucks. And this doesn't completely absolve Windows app devs of having to think about power failure recovery! "POSIX semantics" is often short-hand for shortcomings of POSIX that happen to also be there in the WIN32 APIs, such as the lack of a filesystem write barrier, file stat-like system calls that mix different kinds of metadata (which sucks for distributed filesystems protocols), and so on. And yes, you can open files in Windows such that rename-over is allowed, so you still have this problem.

pixl97 · on Sept 10, 2024

Ya NTFS file naming and long path is a wreck. Unzipping files from mac/linux is a easy way to end up with missing data. Applications quite often break on long file paths especially Microsofts own stuff like powershell.

masklinn · on Sept 10, 2024

I am talking about posix semantics yes, I have no idea how things work on windows.

red_admiral · on Sept 10, 2024

There's the old saying that on Windows*, files have names; in POSIX, names have files. I think that's what makes the difference here.

* technically it's the filesystem as much as the OS that is relevant here.

pixl97 · on Sept 10, 2024

* a very limited set of names in relation to what's possible on other operating systems.

If your processing files from other systems on NTFS you'll very likely have rename said files in an application and store an index of the names.

magicalhippo · on Sept 10, 2024

Phew. I've primarily used Windows. Not that any of the posix programs I've been exposed to have done the dir sync though.

For cross-platform stuff I've mainly used Boost, which I assumed handled such details.

guappa · on Sept 10, 2024

I'm sure it does not.

Also these things are needed very very rarely (which is why few even know about the issue) and are not good for performance and battery life.

account42 · on Sept 10, 2024

More specifically, these things are for trying to improve the behavior on unclean system shutdown (e.g. power loss) which is inherently chaotic and unless all parts (most critically the disk and its controller) are well behaved you don't have any real guarantees anyway.

Windows also doesn't guarantee that data is written to disk by the time WriteFile/CloseHandle returns, the Windows version of fsync is FlushFileBuffers.

formerly_proven · on Sept 10, 2024

One of the advantages of using a database for files is that it's relatively more likely that these platform-dependent considerations were indeed considered and you don't have to moonlight as an DBMS engineer while writing your application.

EasyMark · on Sept 10, 2024

One reason Iike use sqlite so much in my personal projects, sometimes just using it to store various config data bits within row entries of the most generic table ever, works well with small projects, probably not so much with large scaled ones.

red_admiral · on Sept 10, 2024

Hence the saying, "SQLite competes with fopen()".

layer8 · on Sept 10, 2024

Yes. File system implementations never really thought this through, and hence here we are.

anon-3988 · on Sept 10, 2024

Is anything ever enough aside from powering down the PC and then reading it back again to make sure its there?

Verdex · on Sept 10, 2024

Great another case of simple thing everyone knows is simple but turns out to be horrifyingly complicated and heavily system dependent, but only on occasion so you can get 80% of the way through your career before encountering the gaps in your knowledge.

I guess I'll add it to the list.

Of course on the other hand, I was already thinking that I should just use SQLite for all my file handling needs. This little nugget makes me think that this was the correct intuition. [Queue horrifying revelations w.r.t. SQLite here.]

garrettr_ · on Sept 10, 2024

w.r.t SQLite, the only horrifying revelation I’ve had is that it allows NULLs in composite primary keys, which I’ve seen lead to some nasty bugs in practice.

cryptonector · on Sept 10, 2024

Yes, you have to declare PRIMARY KEY columns as NOT NULL. There's lots of little caveats like this about SQLite3. So what.

kccqzy · on Sept 10, 2024

You don't need even to bother with fsync unless you are developing a database or use the file system like a database.

There's a reason Apple made fsync useless and introduced F_FULLSYNC. They know developers have an incentive to overestimate the importance of their own files at the detriment to system responsiveness and power draw.

o11c · on Sept 10, 2024

The problem is, Apple's fsync doesn't even introduce any useful semantics. All it says is "write this eventually".

What a lot of people really need is just a "provide ordering, don't leave me with inconsistent data" operation. If you lose power "after" a write but before the fsync, for any non-networked application that's no different than losing power before the write, as long as the filesystem doesn't introduce chaos.

lxgr · on Sept 11, 2024

That seems significantly harder to implement, though, given that dirty page cache entries don't have an order entry as far as I know, and so retroactively figuring out which writes not to reorder with others is anything but trivial.

account42 · on Sept 10, 2024

Agreed, things like Firefox constantly fsyncing its profile database is absurd.

cryptonector · on Sept 10, 2024

You need to `fflush()` too, before the `fsync()`.