There's an option in the Postgres configuration named "track_commit_timestamp" that does this automatically. It's required to be enabled when using LWW as the conflict resolution model. If there's a tie, the node with the highest node number wins.
I don't recall which customer you may have been, but the standard solution to that specific DDL issue with BDR is to use Stream Triggers to enable row versioning. One of the 2ndQuadrant customers used it extensively for multi-region cross-version app schema migrations that could last for months.
Essentially what that boils down to is you create stream triggers that intercept the logical stream and modify it to fit the column orientation by version. So during the transition, the triggers would be deployed to specific nodes while modifications are rolled out. Once everything was on the new version, triggers were all dropped until the next migration.
Spock doesn't have anything like that _yet_, but as you observed, being unable to use DDL replication significantly increases complexity, and tmux is a poor substitute.
Version 3.6 is definitely after we introduced that functionality. It was, alas, rarely deployed though. Usually clients needed a consultant to put together a configuration for them, and most didn't bother.
You can't even run pgbench unaltered on CockroachDB, as simple table structures and indexes are fundamentally different there. It is in no way a compatible product, and never has been.
In Postgres, updates contain the entire row, including all column values. Since the Spock extension follows the "Last Write Wins" model by default, one row version will win, while the other is essentially discarded. This is assuming the update happened on each node _before_ the new value was synchronized over, or essentially simultaneously.
You can address this partially using a CRDT such as the Delta Apply functionality for certain columns:
I'm not against using UUIDs, but it has to be done in a balanced manner. Normal numeric IDs for all surrogates, and then a UUID as an external lookup attribute. You push that externally for API calls, tokens, etc. so you don't have to worry about leaking sequential values, so it should be indexed. Otherwise, it's purely a lookup to key you into the JOIN chain for everything else.
That said, Now that UUIDv7 is available, switching to that should be a priority.
You might want to consider a different name, as there's already a pgcat project which appears to be a logical replication extension to augment native logical replication.
LWW / LUW systems are a fine first approximation, but generally do not fulfill the strictest guarantees and desirable outcomes for a database without accounting for a lot of edge cases.
A good conflict management system will have that only as a default. Better ones will also provide mechanisms for overriding LWW in specific scenarios on a per table or even row basis depending on the in/out values from both nodes. The trick is building these conflict management rules so they're independently deterministic, or you'll end up with node divergence.
Normally we just do what you said, and strongly recommend interacting with sticky sessions, geographically disparate segments, etc., to prevent the problem before it happens. It's far easier to avoid a conflict than to correct it after the fact.
While a consensus model "fixes" this, it drastically cuts your throughput to a function of the latency of the most remote node in the quorum, essentially defeating the purpose of having a local Master node. Ideally you would reserve this for specific transactions that are sensitive in that they require absolute consistency regardless of incurred latency (See PACELC). And even this requires a distributed deadlock detector unless more than just the transactions are managed through the consensus layer.
BDR for example provides both of these models for the above reasons. Multi-Master is not an easy problem to solve, and requires a multi-faceted approach to even come close to something usable in a generalized scenario.
I don't agree with the "fine" in the "fine first approximation". I believe it is indeed a bad solution.
LWW/LUW basically leads to data loss. Something unexpected for Postgres users. Which potentially undermines its heritage of caring of data to the extreme.
From a practical perspective, if you run an inventory or sales of physical goods, a counter over a multi-master with said conflict resolution may lead to double bookings, lost bookings or negative inventory stocks. None of these events are fine to me.
I'd prefer not to have conflict resolution and have the replication system explicitly check that data sets are disjoints. In this case, these multi-master systems would be very useful (for a particular set of applications).
That consensus is "slow" is a problem also well solved in the real world. Sure, it's not easy, but when you partition and parallelize, you overcome many of the problems. Look at Spanner, AzureDB or the Postgres-compatible APIs of YugabyteDB or CockroachDB. They all have implemented multi-master with consensus.
Right tool for the right job. Users in Tokyo won't be editing accounts for users in Austrailia. There's often a natural partition there where you can leverage the data locality with an eventual consistency basis safely.
Conflict-free Replicated Data Types (CRDTs) and column-level conflict resolution also largely address the issue with data merges and coarse counter systems. I'll just ignore the general design flaw there, considering these should be an insert-only ledger system rather than counters, because bad or naive implementations abound. I get that the application layer shouldn't necessarily have to coddle the database, given absolute ACID would resist non-ideal usage patterns.
I haven't played with the others, but I set up an 8-node CockroachDB and managed to get pgbench to run on it. I was clearly doing something wrong, because I wasn't impressed at the throughput even with very high client counts. From what I could see there, I was clearly being throttled by some mix of RTT and locking contention between the 3-node default consensus group. Supposing this is not problematic in non-overlapping shard locality, you still hit a hard performance ceiling when multiple clients interact with the same pool fragment.
> Right tool for the right job. Users in Tokyo won't be editing accounts for users in Austrailia. There's often a natural partition [...]
Exactly. That's why I'd say for tools that support multi-master to provide and enhance (enforce) support for this use case and avoid (don't support) the case where data durability and consistency are lost.
> Conflict-free Replicated Data Types (CRDTs) [...]
Yes, that's another solution for the problem... but I don't see them provided by BDR or Pgcat.
> I'll just ignore the general design flaw there, considering these should be an insert-only ledger system rather than counters [...]
It's not that I advocate for that system, but many users may choose this implementation. That you say "hey, you need to do all this and implement this best practices (and pass the Jepsen test) just to be sure you don't screw up data consistency and durability... it's putting a heavy burden in the user. Databases, if anything, are abstracting users away from complex issues with data and providing solid primitives to work on them. Again: Postgres is known as a very well trusted data store with given ACID properties. Conflict resolution breaks away with this.
> I haven't played with the others, but I set up an 8-node CockroachDB and managed to get pgbench to run on it. I was clearly doing something wrong, because I wasn't impressed at the throughput even with very high client counts [...]
I haven't benchmarked it myself, but I heard stories of low performance. Actually, I always believed that the internal key-value store that they use would never scale to represent table workloads. But give a shot to the other solutions. Spanner, for one, is known to process millions of transactions per second, with global (planet scale) replication (of course, with notable lag, but total consistency). That's pretty cool ;)
> Actually, I always believed that the internal key-value store that they use would never scale to represent table workloads.
Care to elaborate here? As someone working at that layer of the system, our RocksDB usage is but a blip in any execution trace (as it should be, any network overhead you have given it's a distributed system would dominate single-node key-value perf). That aside, plenty of RDBMS systems are designed such that they sit atop internal key-value stores. See MySQL+InnoDB[0], or MySQL+RocksDB[1] used at facebook.
Don't get me wrong, both RocksDB and the work done by CCDB is pretty cool.
Yet I still believe that layering a row model as the V of a K-V introduces by definition inefficiencies when accessing columnar data in a way that row stores do, as compared to a pure row storage. Is not that it can't work, but that I believe it can never be as efficient as a more row-oriented storage (say like Postgres).
I have no idea what you're saying. What's a "row-oriented storage" if not storing all the column values of a row in sequence, in contrast to storing all the values in a column across the table in sequence (aka "column store"). What does the fact that it's exposed behind a KV interface have to do with anything? What's "more" about Postgres' "row-orientedness" compare to MySQL?
In case you didn't know, a row [idA, valA, valB, valC] is not stored as [idA: [valA, valB, valC]]. It's more [id/colA: valA, id/colB: valB, id/colC: valC] (modulo caveats around what we call column families[0], where you can have it be more like option (a) if you want). My explanation here is pretty bad, but [1][2] go into more details.
I know well. I have read most CCDB posts and architecture documentation. Pretty good job.
There are several ways to map a row to K-V stores, and different databases have chosen different approaches, I'm not referring specifically to CCDB's.
Whether you do [idA: [valA, valB, valC]] or [id/colA: valA, id/colB: valB, id/colC: valC], what I say is that I believe it is less efficient than [idA, valA, valB, valC], which actually also supports more clearly the case of compound keys (aka [idA, idB, idC, valA, ....]). Both are the ways Postgres stores rows.
I had a very protracted DPDR experience start around when I was 12, and slowly dissipated over the course of a decade. I woke up one morning, and it was like I was slapped out of my body. It's like I was piloting a Me(ch) suit.
I wasn't seeing through my eyes. They were transmitting, with perceivable lag and some kind of acknowledged overlay, their sensory data. It's like being embedded in _extremely advanced_ and nearly seamless VR, but also hyper aware of the "nearly" part. It's like the Uncanny Valley effect, but directing it toward your own sensory system.
It doesn't feel quite real anymore, and you don't know why. So that's probably how I'd convey it. "You know the Uncanny Valley? Imagine everything you experienced felt that way."
That's an excellent point. By opening up this can of worms, MySQL fans essentially get free reign to take shots at Postgres for flaws it still has for one reason or another. Perceived or genuine.
On the other hand, sometimes a good flamewar really gets the juices flowing and sheds valuable light on painful truths. Even Postgres can learn from being the butt of a few jokes.