Spent a while reading about and understanding how to reliably write files to disk and I'm dropping the best resources I found here so it doesn't all just evaporate. I found I needed to read all this stuff and a bunch of mailing list posts to really get my head round it all because the information is kind of scattered but this is plenty to be getting on with.
Writing data consistently using buffered IO (I didn't really consider direct IO at all):
- Excellent paper from 2014 on crash consistency and differences in filesystems, with common applications tested and analysed too: https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf
- The slides associated with the above paper also contain a useful toy example: https://www.usenix.org/sites/default/files/conference/protected-files/osdi14_slides_pillai.pdf (presentation: https://www.youtube.com/watch?v=SVYegdh2CbE)
- POSIX and the real world (from the O_PONIES ext4 discussions) gives good background understanding. This is from 2009 - ext4 has changed behaviour since then (as seen in the paper above): https://lwn.net/Articles/351422/
- SQLite documentation on their WAL: https://www.sqlite.org/atomiccommit.html
- Decent short summary talk of this stuff: https://www.deconstructconf.com/2019/dan-luu-files
- This talk is worth it too: https://www.youtube.com/watch?v=LMe7hf2G1po
Totally related but a bit different is stuff around fsync
errors. PostgreSQL realised after 20 years that they had been handling this wrong (got named "fsyncgate"). Summary: errors from fsync
cannot be relied upon at all and is essentially unspecified behaviour - different filesystems and kernels do different things (page cache may or may not be cleared, marked clean, etc). PostgreSQL eventually went with intentionally crashing and relying on their write-ahead log to recover from these failures:
- Good collection and summary of stuff: https://wiki.postgresql.org/wiki/Fsync_Errors
- LWN coverage is good: https://lwn.net/Articles/752063/
- Presentation about the issue: http://bofh.nikhef.nl/events/FOSDEM/2019/K.1.105/postgresql_fsync.webm
- Research called Can Applications Recover from fsync Failures? My interest in this stuff began to fizzle out a wee bit before I finished this (maybe this week...), but it looks good!: https://www.usenix.org/system/files/atc20-rebello.pdf