Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For what it's worth, `man fdatasync` on linux says (emphasis mine)

  fdatasync() is similar to fsync(), but does not flush modified metadata *unless that meta‐
  data  is  needed*  in  order to allow a subsequent data retrieval to be correctly handled.
  For example, changes to st_atime or st_mtime (respectively, time of last access and  time
  of  last  modification; see inode(7)) do not require flushing because they are not neces‐
  sary for a subsequent data read to be handled correctly.  On the other hand, a change  to
  the file size (*st_size, as made by say ftruncate(2)*), *would require a metadata flush*.


It does say that for Linux now, but it didn't used to, and it doesn't on other OSes.

FreeBSD says the opposite at https://www.freebsd.org/cgi/man.cgi?query=fdatasync&sektion=...:

  The fdatasync() system call causes all modified data of fd to be moved to
  a permanent storage device.  Unlike fsync(), the system call does not
  guarantee that file attributes or metadata necessary to access the file
  are committed to the permanent storage.
I.e. "does not guarantee that ... metadata necessary to access the file are committed".

Thus "given SQLite is written to be so portable and cautious and has been around a long time" I'm surprised.

I'm not sure if we should believe the FreeBSD documentation, as current OpenBSD, NetBSD and Solaris have statements that work out similar to current Linux. However, if the documentation is to be believed, SQLite's durability using fdatasync() is broken on FreeBSD after a file append.

My point really is that if I recall correctly, what FreeBSD's man page says isn't an accident, it's indicative of what used to be the understood guarantee provided by most if not all implementations. Actual guarantees were not exactly specified, and I think there may have been a vague transition in the meaning of fdatasync() from "commits the data blocks" to "commits data blocks and the metadata necessary to access them" over time because the latter is obviously much more useful.

But that transition over time is why it seems a risk to assume it in highly portable software.

That said, back in the 2.6 days Linux even fsync() wasn't durable on consumer HDDs, it was on SCSI HDDs though. Then for a while O_DSYNC (which is like fdatasync() after every write) was unreliable for data retrieval (see the Persona link in GP post) but O_SYNC worked (which is like fsync() after every write).


From SQLite's unix.c:

    /*
    ** We do not trust systems to provide a working fdatasync().  Some do.
    ** Others do no.  To be safe, we will stick with the (slightly slower)
    ** fsync(). If you know that your system does support fdatasync() correctly,
    ** then simply compile with -Dfdatasync=fdatasync or -DHAVE_FDATASYNC
    */
    #if !defined(fdatasync) && !HAVE_FDATASYNC
    # define fdatasync fsync
    #endif


From SQLite's autoconf/configure.ac:

    AC_CHECK_FUNCS([fdatasync usleep fullfsync localtime_r gmtime_r])
In other words, it automatically uses fdatasync if that's available on any target OS.

SQLite's os_unix.c actually confirms my suspicions about fdatasync:

    /* fdatasync() on HFS+ doesn't yet flush the file size if it changed correctly
Given its portability, and the folklore around fdatasync not flushing the file size on other OSes at some time (and the current FreeBSD man page), I'm still surprised the code as written assumes the size issue only exists on OSX.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: