This is a bad idea. Every use case they mention has a simpler, more performant option that’s easier to maintain. It reads like someone asked an LLM if they can use git as a database, then decided to post that to the web because being controversial gets clicks.
That's exactly what they're doing, it's just driving engagement for their sales:
> While Git makes an interesting database alternative for specific use cases, your production applications deserve better. Upsun provides managed PostgreSQL, MySQL, and other database services
This should do the opposite: I would not trust anyone who thinks that this is a solution worth considering for the use cases they identified as my database service provider.
ex-insider here, actually Git is used internally as a database of sorts. And it makes sense for many other reasons not cited. You are guaranteed to be able to access the "database" in a decade or two if you need to restore stuff without having to worry about binary compatibility. For more complex data types you can actually use an Sqlite database you save in a git-note. But this is very much about a very specific use-case. And the last paragraph could have been written less salesy and more "don't try this at home, don't run with scissors kinda message".
So this is kinda the contrary to "use Postgres for everything" vibe.
Consider using Git as an actual database if you were considering using Sqlite or simple text files, or consider actually wrapping those with Git because it will give you quite a bit of extra oomph with very little effort.
I personally used git as a database on a project where we had just 200 records. With infrequent changes. Some bulk updates. That needed, for weird reasons, to be synchronized every ~12 months, yeah, and some of the nodes were air-gapped. As far as I know, more than 15 years later this is still running with basically 0 maintenance (other than git being upgraded with OS updates).
Why is it a bad idea? They only mention one use case, but I have been in a similar situation where git was already being used and, like them, I simply needed to connect into the same functionality with my service. Sure, I could have come up with all kinds of complex solutions to synchronize git with other databases... Or I could just use git. It works fine. Why wouldn't it? That's the type of workload git is designed for.
The article lists the following "strengths", which imply the use cases. Let's discuss!
1. Built-in audit trails. Git's a hash tree, right? So you can't tamper with it? Wrong. You can force push and delete commits. By default, git garbage collects dangling commits and old branches. In order to make a cryptographic audit trail, you'd need to publish the hash tree publicly and non-deniably so your auditor can know if you tampered with the logs.
2. Atomic transactions. Many databases support these, with a lot of granular control over the details. Postgres is one of them. (The ones that don't may have good reasons.) Git does not have much granular control; merges are not guaranteed to keep any data intact.
3. Distributed architecture. Good point. But wait, that's not really true. How do we decide what the real "main" is? It's whatever main is on your upstream, which in almost all cases is a centralized server. And if it's not, git doesn't provide any automated tooling for dealing with conflicts between multiple sources of truth (no algorithm for byzantine generals, for example). So this is a red herring.
4. Content addressing. Does this mean a commit hash? Like, say, an _index_ but without the ability to index over any other data?
In my career, I have seen git used twice as a "database" in a sensible way: for tracking source code changes (that is, changes to structured text files). This is precisely the use case it was designed for. And in both cases, it was actually underutilized---no real merge capability, just file versioning, so they could have gone simpler.
But git is not even good for most "collaborative text editing" since merge handling is not automatic. You wouldn't use it for a google-docs style platform, for example.
> for tracking source code changes (that is, changes to structured text files).
Which is exactly what they claim to use it for.
> so they could have gone simpler.
Maybe, but they also claim that it is a desirable, maybe even necessary, property to ingrate with existing git use. So you can only go simpler until you have to actually interface with git and then any possible downsides of git come rearing their ugly head at this point anyway. So what have you gained by just not using git from top to bottom?
I will grant that there is one use case where git is better than postgres, which is tracking source code changes. But to use git as a database instead of postgres is ... misleading? Clickbait? Trolling?
> But to use git as a database instead of postgres is ... misleading? Clickbait? Trolling?
Contextual.
You are right that the article is a bit hard to follow where it goes down roads that it also says you shouldn't actually do, but that is there just to explain that git is a database, which isn't as obvious as it should be to some people. There are no doubt better ways to get that idea across, but someone in the business of providing managed Postgres hosting isn't apt to have writing long-form text as a core competency. Nor should one need to have it as their core competency. Writing is for everyone — even those who aren't particularly good at it.
For a small project, this would be fine. Maybe a slight step up from just storing data in a single local file. For a more serious small project, SQLite is the way to go, as git's 'atomic' commits are still going to be relatively slow.
I assume you'd struggle to get a few hundred commits per second even on good hardware?
Possibly if this sounds interesting you should check out Dolt: https://github.com/dolthub/dolt “Git for Data! (...) a SQL database that you can fork, clone, branch, merge, push and pull just like a Git repository.”
With respect to querying Git repos, I was pleasantly surprised with how usable git cat-file --batch was as a programmatic way to walk around the Git graph in http://canonical.org/~kragen/sw/dev3/threepowcommit.py, which reads about 8000 commits per second on my laptop—not fast, but not as slow as you'd expect.
Don't do this. I did this many years ago for a small internal "parts" database for our small EE team, since we needed an audit history of changes.
It was just awkward to use. Diffs were weird and hard to wrap a UI around, search wasn't great, it was hard to make ids sequential (the EE team hated uuids), etc., and conflict resolution usually still had to be done in a code editor. Even the killer app, the audit trail, didn't work quite the way the EE team wanted. Code to work around the disparities was probably half the codebase.
I ended up migrating to a normal database after a few months and everything was a lot better.
And actually my bigger takeaway was to be very very careful when adopting higher-level tools. Things that may seem to match requirements on the surface could easily have the same kind of mismatches encountered above.
A recent example of this was using Temporal for state management of multi-page user-driven workflows. In happy path use cases it made the code incredibly simple. But when CS had to get involved for a user, or when there was a bug and we had to patch a bunch of invalid user states in the backend after fixing it, there's often a need for some kind of direct access to that state, to update it in ways that hadn't necessarily been planned for. In Temporal, much of the state is an implicit "function pointer", which is what makes the code so nice in the happy path, but that you can't touch if you need to. Yeah you could rewrite the Temporal logic in terms of a state machine and event loop so that the function pointer always ends up at the same spot, but then you're essentially writing a state machine anyway, except without a good way to do bulk updates, and just paying extra latency, cost, and system overhead.
So it ended up that a database was still the best option. A little more boilerplate in happy path cases, but a lot more flexible in cases where unexpected things happened.
In the current system, rights, owners and debts of any land parcel in Germany are simply recorded in PDF files, with an ID for each parcel. So, when adding a "record", the govt employees literally just open the PDF file in a PDF editor and draw lines in the PDF, then save it again. Some PDF files are simply scanned pages of typewriter text (often even pre-WW2), so the lines are just added on top. It's a state-of-the-art "digital" workflow for our wonderful, modern country.
Anyway, so I wrote an entire system to digitize it properly (using libtesseract + a custom verifying tool, one PDF file => one JSON file) and track changes to parcels using git. The system was also able to generate Änderungsmitteilungen (change notices) using "git diff" and automatically send them via E-Mail (or invoke a webhook to notify banks or other legal actors, that the parcel had changed - currently this is done using paper and letters).
It was a really cool system (desktop digitizing tool + web server for data management), but the German government didn't care, although some called it "quite impressive". So it's now just archived on GitHub. Their current problem with "digitization" was that every state uses different formats and extra fields for the and they are still (after 20+ years) debating about what the "standardized" database schema should be (I tried to solve that with an open, extensible JSON schema, but nah [insert guy flying out of window meme]). I'm a one-man show, not a multi-billion dollar company, so I didn't have the "power" to change much.
Instead, their "digital Grundbuch" (dabag) project is currently a "work in progress" for 20+ years: https://www.grundbuch.eu/nachrichten/ because 16 states cannot standardize on a unified DB scheme. So it's back to PDF files. Why change a working system, I guess. Germans, this is where your taxes are spent on - oh well, the project was still very cool.
GitHub wouldn't care for the most part, because they have solid rate limits in place that would foil your plan to use this for a production app immediately.
Imagine the possibilities of using GitHub actions as a way to trigger events based on new data being saved. This could be used to sync data across multiple git databases.
This is built in to git as “git hooks”; GitHub not required. But don’t do it. Postgres has a system to trigger commands based on SQL patterns (eg on commit, when updating a row in this table, etc). Much more powerful and maintainable since it is designed for this.
The fact is that basically every data structure can be abused as a database (python dicts, flat files, wave in mercury traveling down a long tube). Don’t reinvent the wheel, learn to use a power tool like Postgres.
I never got around to part 2, but I made a git-auto-commit program (using Watchman), with one of my use cases being capturing the history of my logseq markdown notes.
I never did the next part but yeah, I was hoping to write some agents running remotely to actively follow the repo & then rewrite history with better commit names. It was pre AI boom, but auto-summarizers & taggers running as their own agents were a fantasy ask, something that felt like the architecture would have been condusive for if/when the disaggregated work became possible.
Git as eventsourcing is kind of a ground floor. Being able to decoupled the processing is the loftier hope for centralizing around git imo.
There's a lot of people yelling pretty loudly 'git would never work' 'the performance wouldn't be enough' 'write your own app specific version control if you need it', but being in the world's best ecosystem with the world's best tools for version controlled things feels like a no brainer. Theres so many fun distributed computing tasks folks get up to, so many ways of doing sync, but we developers already know a really good standard option and the world of tools around it is enviable beyond imagining.
Performance might also have some big upsides, by virtue of processing being so seamlessly decentealizable here too. I do think that the giant "huge tables of all intermixed customer data" monoliths / dataliths that most apps use is not a good fit, won't scale nicely. But if you can give up some coherency, and take advantage of the scale out ability, it lets any machine any process loosely couple it's way into the transaction processing or analysis pipelines: it becomes more interesting and compelling. I've worked at a place where each customer had their own data tables, and BlueSky/atproto'a Personal Data Store'a one-sqlite-per-user also rather mirrors this architecture, of having lots of data stores, then separate systems doing any necessary aggregation with more typical big table architectures): scaling a lot of gits feels like the place where git would win.
Wandering off possible core architectural upsides briefly: biggest likely challenge I see is figuring out how to save state nicely into files. To really take advantage of diffs and merges, we need to walk away from treating git like a blob store, and imo likely walk away from JSON stores too. Finer grained updates, having the file system exposes the structure of the data feels quasi-mamdatory, if you want to really let git help you store data: you have to come down to its level, leverage the filesystem to file your data.
"Your Scientists Were So Preoccupied With Whether Or Not They Could, They Didn’t Stop To Think If They Should" (quote by Jeff Goldblum's character Ian Malcolm in the original Jurassic Park film). /s
More seriously, i agree that to make a use-case fit seems too much of a stretch...Yes, its cool that git can be used in this fashion, as a neat experiment! But, for even non-serious needs, not sure that i would ever do this. But, still, very clever thinking for even thinking of and doing this; kudos!