It was for work so I can't share it, but it's read-only and assumes an immutable...

simonw · 2025-02-06T23:30:49 1738884649

Wow, that thing is really clever.

Suggestion: since having multiple writers will corrupt the database, it's worth investigating if the recent (November 2024) S3 conditional writes features might allow you to prevent that from accidentally happening: https://simonwillison.net/2024/Nov/26/s3-conditional-writes/

electroly · 2025-02-07T15:41:42 1738942902

I think that works, but I think they'll need to use it to implement an external lock rather than coordinating writes to a single object as those comments suggest. They use an object per page, not an object per database, so they have to synchronize updates to multiple objects instead of just one. But a global writer lock is what SQLite does locally, anyway. It's a good idea.

I see some challenges, though. Can we implement a reader-writer lock using conditional writes? I think we need one so that readers don't have to take the writer lock to ensure they don't read some pages from before an update happening concurrently and some pages from after. If it's just a regular lock, oops--now we only allow a single reader at a time. I wonder if SQLite's file format makes it safe to swap the pages out like this and let the readers see a skewed list of pages and I'm just worrying too much.

A different superpower we could exploit is S3 object versions. If you require a versioned bucket, then readers can keep accessing the old version even in the face of a concurrent update. You just need a mechanism for readers to know what version to use.

ncruces · 2025-02-08T16:48:34 1739033314

The problem with a block/page based solution (block per page is a bad idea, IMO, because S3/GCS are cost/performance optimized for around megabyte objects) is consistency: ensuring a reader sees a consistent set of blocks, i.e. all of the same database version.

Under concurrency, but also under a crash.

Similarly, the big issue with locks (exclusive or reader/writer) is not so much safety, but liveness.

It seems a bit silly to need to have (e.g.) rollback journals on top of object storage like S3.

Similarly, the problem with versioning is that for both S3/GCS you don't get to ask for the same version of different objects: there's no such thing.

So you'd need some kind of manifest, but then you don't need versioning, you can use object per version.

Ideally, I'd rather use WAL mode with some kind of (1) optimized append for the log, and (2) check pointing that doesn't need to read back the entire database. I can do (1) easily on GCS, but not (2); (2) is probably easier in S3, not sure about (1).

ncruces · 2025-02-08T13:35:26 1739021726

The read only bit is definitely a relatively simple afair.

My Go driver supports it, through a combination of a VFS that can take any io.ReaderAt, and something that implements a io.ReaderAt with http Range requests (possibly with some caching).

https://github.com/ncruces/go-sqlite3/blob/main/vfs/readervf...

This is compatible with S3, but also GCS, and really any thing that supports http Range for GET.

I've wanted to implement something writeable with concurrency. Not that it'll be great, but because people tend to do worse anyway (like run SQLite on top of something like GCS FUSE, and we can definitely do better than that), so why not?

I've also worked with GCS concurrency features and locks, so I'd like to try and do better. But it's hard to be compatible with multiple cloud providers.

So thanks for that!

2shortplanks · 2025-02-07T07:23:29 1738913009

I’m really curious about the number of requests you end up making. I’d imagine that this would get expensive fast in a web app - AWS charge $0.40 per million read requests, so if you’re seeing ten request to your app server a second and that ends up making a hundred requests to S3, that’s going to come in at at about $34 a day!

electroly · 2025-02-07T15:18:54 1738941534

We don't use it for anything interactive or user-facing, it's too slow for that. It's for some large+cold data that is used in batch analysis jobs that need to query out a small subset of the data using relatively complicated indexed queries. We probably don't get more than a few thousand queries per day through the system which is no big deal and basically free, but migrating this cold data to S3 allowed me to save a few terabytes in our expensive SQL Server cluster which is a big deal.