Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is (2) motivated by using aggregates which do not commute or by trying to do distributed modifications on a single value?

One common technique if you have commutative aggregates is to have each writer just write to their own spot and then do a range query on read to re-join. If your aggregates don’t commute this, of course, doesn’t work and you’re stuck in “single threaded” land. I do remember reading a paper that avoided this, but i can’t remember the implementation / trade offs (If I can find the paper I’ll post it here)



What do you mean by commutative aggregates? My understanding is that an aggregate is state, whereas commutativity is a property between two operations on said state. Do you mean something like CRDTs?


I think what's meant is that the event stream can be reordered and give the same result.

This isn't necessarily an abuse of terminology - I believe if we look at an operator that takes sequences of events and appends them, what we're talking about is in fact commutativity of that operator (in terms of the impact on the system).


You're right, you can line the two concepts up to some extent. (I mentioned CRDTs because they're one way to do that in a particularly nice way.) But given the context of the question, it seemed like teasing out the difference would be helpful.


I'm not familiar with CRDTs, though I'll look into them -- thanks for the pointer. To clarify my original idea/question, the following example:

Yeah sure, imagine you're trying to compute count of how many times you've seen something. This is a SUM aggregate.

I assume the reason the author said "Ensuring aggregates are essentially single-threaded entities is a must" was because they are mutating a single key, if two different processes try to change the same value, you get inconsistent state.

An example of this would be:

Current K/V Pair: "A" -> 2

Modifier 1: Reads "A", gets 2

Modifier 2: Reads "A", gets 2

Modifier 1: Writes "A" -> 3

Modifier 2: Writes "A" -> 3

This means our sum is now inconsistent, as we would expect the value to be 4 in a single-threaded system.

However, because we are doing a sum, we can use commutativity to remove this conflict. Instead of each writer trying to write to a single key, you might make a compound key (Id, <Unique Writer Identifier>)

Current Value Pairs: [("A", "Modifier 1") -> 1, ("A", "Modifier 2") -> 1]

Modifier 1: Reads ("A", "Modifier 1"), gets 1

Modifier 2: Reads ("A", "Modifier 2"), gets 1

Modifier 1: Writes ("A", "Modifier 1") -> 2

Modifier 2: Writes ("A", "Modifier 2") -> 2

Then, a reader could just ask for all the keys with the prefix "A" (This is the range query). So a reader gets back (2, 2), which then can now merge into 4 as SUM is a commutative operation. Because the reader is taking care of the final aggregation step, there's no concurrency conflict so you can get away with having any number of writers as long as the read and read-side computation is cheap enough.

Some aggregates are commutative only in certain forms, like AVERAGE. To explain further, if one writer says the AVERAGE is 5 and another says it is 7, I can't combine those to say that the global average is 6.

However, if each writer stores both SUM and COUNT, I can use (COUNT1 + COUNT2)/(SUM1 + SUM2). This is because division doesn't commute, so I have to delay the non-commutative operation until the reader if I want to be able to merge two data sources.

Edited: Formatting


I see. I think you may be using a different definition of "aggregate" than the article and OP are referring to. Your "aggregate" is that from database systems like SQL and Excel, while the OP's "aggregate" is a different concept taken from the lexicon of Domain-Driven Design.

In DDD, an "aggregate" is a collection of state that can only be changed atomically, as a whole, by the outside world. It helps to think concretely of an Actor, i.e. some entity that owns a collection of state, such that the only way to mutate that state is to send a message to the actor. No matter how many messages come in from the outside world, only one is being handled at a time by the entity, and there are no concerns of concurrent access. Necessarily though, all messages are handled in a linear order.

From Martin Fowler's article on aggregates [0]:

> Aggregates are the basic element of transfer of data storage - you request to load or save whole aggregates. Transactions should not cross aggregate boundaries.

I think you'll be interested in CRDTs. They're like DDD aggregates where mutation operations are essentially always commutative, so there's a bit more alignment between the senses of "aggregate" being confused here.

[0] https://www.martinfowler.com/bliki/DDD_Aggregate.html




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: