Combining event sourcing and stateful systems

evdev · on April 21, 2020

I think to truly be an event-driven architecture you need to go a step or two further and be data-driven.

In other words, the appropriate way to describe your system would not be (subscribable) relationships between a set of components that describe your presumptive view of a division of responsibilities. (This is the non-event driven way of doing things, but with the arrows reversed.)

Instead, you track external input types, put them into a particular stream of events, transform those events to database updates or more events, etc. Your entire system is this graph of event streams and transformations.

These streams may cut across what you thought were the different responsibilities, and you will have either saved yourself headaches or removed a fatal flaw in your design.

If you're thinking about doing work in this area, don't just reverse the arrows in your component design!

dpc_pw · on April 21, 2020

I'm really interested to understand your comment better.

Can you give an example for "presumptive view of a division of responsibilities" and generally the whole comment? Something like "bad way" vs "good way"? Thanks!

adamfeldman · on April 21, 2020

I'm currently reading this book, and it's clarified a lot for me with regard to structuring events https://www.manning.com/books/the-tao-of-microservices

evdev · on April 21, 2020

It's abstract, but I'll try to get something down.

First, look at what happens to the system from the outside, say a web request that leads to a web response. In between, information is gathered from other areas (databases, program logic) and combined with the request data. There are also possibly other effects generated (writes to database state, messages to other users, etc.).

Now take all of those “effects”--the web response, but also the database updates, logs, messages, etc.--and look at each of them as a tree (going left to right, with the root, the result, on the right) where different kinds of information were combined and transformations were performed in order to get the result.

We’re being conceptual here, so imagine we’re not simplifying or squashing things together--the tree can be big and complicated. Also temporarily ignore any ideas you may have that there’s a difference between information coming from the “user” area versus the “admin” area versus the “domain object #1” area. In this world, those stores of information only exist to the extent they enable the flow that produces our results.

Now notice that there are many different requests and many different effects and responses. Thankfully, some number of the inputs are shared and reusable. Further, entire spans of nodes are in common (an event type) or entire subtrees are in common (a subsystem). These are your data streams and your modules. You didn’t add them in because you felt like there had to be a “user service” or an “object #1 service”--those commonalities factored out (to the extent they did) of the requirements of the data flows.

Often, there isn’t an “object #1” at all--that was a presumption used to put stakes down so you had somewhere to start. And our systems that are made of up of things like “object #1 service” and “object #2 service” very frequently end up with problems of the form: “we can’t do that because object #1s don’t know about [aspect of object #2s]! Everyone knows that! We need a whole new sub-system!”. In the data-driven world the question is always the same: what data do you need to combine in order to get your result?

This isn’t to say all modules we usually come up with will turn out to be false ones (especially since a lot of the time we’re basing our architectures on past experience). For instance, that there is some kind of “user” management system is probably made inevitable by the common paths user-related data take to enter the system.

Now for the reverse argument: imagine you have a system that was done with the sort of modeling where there is an “object #1 service” that must get info from the “user service” and work with the “object #2 service” through the “object set mediator service”. You’re tracing through all the code that goes into formulating a response to requests, from start to finish, but someone has played a trick on you: they’ve put one of those censoring black bars over deployment artifacts, package names, and class names. The punchline is that your architecture inevitably is one of the trees described above--it’s just a question of how badly things are distorted because someone presumed the system comes from the behavior of “object #1”s and “object #2”s and not the other way around.

heavenlyblue · on April 22, 2020

It is the same as arguing whether lambda calculus is better than pi-calculus or a Turing machine.

These are all isomorphic structures. Neither of them can do more than the other.

For example - you’re speaking of dependencies, etc - but any language based on statements can be reduced to a dependency graph defined by it’s single-assignment form.

Event sourcing is not a panacea.

Fire-Dragon-DoL · on April 21, 2020

I'm working with an event sourced system and we did some mistakes in the process of the design, so some areas that didn't need event sourcing do have it.

The biggest downside has been the UI: events are not real time and these objects are just CRUD stuff, so the user wants to see that you saved what he has just written. You might not have this information yet, so you need to mitigate it, for example updating the UI through sockets (lot of additional work).

On the upside, we are acquiring a lot more insights in what business processes bring value and are meaningful, versus what I call "just configuration".

We figured out quite a few rules of thumb over time that are helpful though.

One thing I noticed over time is that on average there is no need for "created" and "updated" event, usually there is one meaningful business event that would encompass both (not always the case), e. g. "product listed", or something along those lines. This not only saves lines, but some code reacting to this event has a reduced interaction surface (less bugs and coupling), as well as being more expressive.

If you might be interested, we chat a lot about event sourcing in the Eventide Slack channel: https://eventide-project.org/#community-section

agentultra · on April 21, 2020

This is super important and I cannot stress it enough! If your events contain words like, "create, update, delete, associate, disassociate," then you're building a weak domain model that won't benefit from the added complexity of deriving state from the source of events.

Your events should use the same words your customer would actually use to describe their business process. For example, a system to manage intake of patients in an ER would have events such as Patient Arrived, Patient Screen Completed, Patient Admitted, etc.

If you don't have such a vocabulary then you're not capturing interesting events so don't store them. You probably want something that is event driven instead or perhaps simply to log actions to an audit table.

Fire-Dragon-DoL · on April 21, 2020

Indeed! It became pretty evident at some point, we had 4 models that had just "created, updated and deleted" and felt weak, suddenly we arrived at one model that had only a "configured" event, that needed all the information from the previous 4.

That's when we figured out that the other 4 were pointless, just "UI gimmicks to make the life of the user easier", the only event that was relevant was that "Configured", with all the related information in there.

The event modeling is the key part. It also help gaining a much deeper understanding of the business problem the software is trying to solve

btown · on April 21, 2020

While this is correct for a LOT of applications, if you ever end up in a situation where multiple automated systems and manual users may be concurrently updating and associating/disassociating records, and last-write-wins won't cut it (due to e.g. some fields being co-dependent) you're going to need to merge their change streams in a custom way. "UpdatedRecordData(Type, ID, Value, Actor, Context, Date)" may not be a domain-specific vocabulary but it's kind of necessary if you're building a system with a data model that can be extended by clients. Event-Sourced Salesforce will be a thing!

withinboredom · on April 21, 2020

You should take a look at Microsoft's Durable Functions which pairs event sourcing + (optional) actor model + serverless. It's some pretty neat tech.

I tried doing something similar to this several years ago, and here's a few issues I ran into:

1. Pub/sub in Event Sourcing is a bad idea. It's really hard to get right. (what to do if sub happens after pub due to scaling issues/infrastructure, etc?) Instead it's better to push commands deliberately to a process manager that handles the inter-domain communication and orchestration.

2. Concurrency. Ensuring aggregates are essentially single-threaded entities is a must. Having the same aggregate id running in multiple places can cause some really fun bugs. This usually requires a distributed lock of some sort.

3. Error handling. I ended up never sending a command to a domain directly, instead I sent it to a process manager that could handle all the potential failure cases.

slashdotdash · on April 21, 2020

For anyone interested in event sourcing with the actor model I've built an open source Elixir library called Commanded (https://github.com/commanded/commanded) which takes advantage of Erlang's BEAM VM to host aggregate processes. There's also an event store implemented in Elixir which uses Postgres for storage (https://github.com/commanded/eventstore).

The actor model provides the guarantee that requests to a single instance are processed serially, while requests to different instances can be processed concurrently. Distributed Erlang allows these instances to be scaled out amongst a cluster of nodes with transparent routing of commands to the instance, regardless of which connected node it is running on.

In Elixir and Erlang, the OTP platform provides the building blocks to host an aggregate instance as a process (as a `GenServer`). Following the "functional core, imperative shell" style I model the domain code as pure functions with the host process taking care of any IO, such as reading and appending the aggregate's events.

sbellware · on April 21, 2020

> Pub/sub in Event Sourcing is a bad idea

I find this point surprising. I would say the exact opposite. I would say that pub/sub and event sourcing are two sides of the same coin: events.

> what to do if sub happens after pub

That should only ever be a problem with a non-durable transport that doesn't have serialized writes per topic. Which, admittedly, can be pretty common. But it's not so much an event sourcing or pub/sub issue as much as a choice of message transport issue.

> Concurrency. Ensuring aggregates are essentially single-threaded entities is a must. Having the same aggregate id running in multiple places can cause some really fun bugs. This usually requires a distributed lock of some sort.

Or it requires partitioning the queues and using an optimistic lock when writing (just to be on the safe side).

withinboredom · on April 22, 2020

> I find this point surprising. I would say the exact opposite. I would say that pub/sub and event sourcing are two sides of the same coin: events.

I meant in the context of getting it right. I didn't experiment with all the pub/sub systems at the time, but most I experimented with would lose data in a catastrophic event and cause inconsistencies. This was several years ago though.

sbellware · on April 22, 2020

> most I experimented with would lose data in a catastrophic event and cause inconsistencies

Fair enough. Those are probably message buses or message queues that are ephemeral transports. Since event sourcing is predicated upon permanent storage of events, there's no way to lose events that have already been committed (unless someone actually physically deletes the events).

vorpalhex · on April 21, 2020

On [1] - You're correct that Pub/Sub is difficult to get right, but it can confer a bunch of benefits which make it worthwhile to struggle with. RabbitMQ + Replay-ability (the details of which will differ based on your design) + good data model design is usually a safe bet here.

alextheparrot · on April 21, 2020

Is (2) motivated by using aggregates which do not commute or by trying to do distributed modifications on a single value?

One common technique if you have commutative aggregates is to have each writer just write to their own spot and then do a range query on read to re-join. If your aggregates don’t commute this, of course, doesn’t work and you’re stuck in “single threaded” land. I do remember reading a paper that avoided this, but i can’t remember the implementation / trade offs (If I can find the paper I’ll post it here)

Twisol · on April 21, 2020

What do you mean by commutative aggregates? My understanding is that an aggregate is state, whereas commutativity is a property between two operations on said state. Do you mean something like CRDTs?

dllthomas · on April 21, 2020

I think what's meant is that the event stream can be reordered and give the same result.

This isn't necessarily an abuse of terminology - I believe if we look at an operator that takes sequences of events and appends them, what we're talking about is in fact commutativity of that operator (in terms of the impact on the system).

Twisol · on April 22, 2020

You're right, you can line the two concepts up to some extent. (I mentioned CRDTs because they're one way to do that in a particularly nice way.) But given the context of the question, it seemed like teasing out the difference would be helpful.

alextheparrot · on April 21, 2020

I'm not familiar with CRDTs, though I'll look into them -- thanks for the pointer. To clarify my original idea/question, the following example:

Yeah sure, imagine you're trying to compute count of how many times you've seen something. This is a SUM aggregate.

I assume the reason the author said "Ensuring aggregates are essentially single-threaded entities is a must" was because they are mutating a single key, if two different processes try to change the same value, you get inconsistent state.

An example of this would be:

Current K/V Pair: "A" -> 2

Modifier 1: Reads "A", gets 2

Modifier 2: Reads "A", gets 2

Modifier 1: Writes "A" -> 3

Modifier 2: Writes "A" -> 3

This means our sum is now inconsistent, as we would expect the value to be 4 in a single-threaded system.

However, because we are doing a sum, we can use commutativity to remove this conflict. Instead of each writer trying to write to a single key, you might make a compound key (Id, <Unique Writer Identifier>)

Current Value Pairs: [("A", "Modifier 1") -> 1, ("A", "Modifier 2") -> 1]

Modifier 1: Reads ("A", "Modifier 1"), gets 1

Modifier 2: Reads ("A", "Modifier 2"), gets 1

Modifier 1: Writes ("A", "Modifier 1") -> 2

Modifier 2: Writes ("A", "Modifier 2") -> 2

Then, a reader could just ask for all the keys with the prefix "A" (This is the range query). So a reader gets back (2, 2), which then can now merge into 4 as SUM is a commutative operation. Because the reader is taking care of the final aggregation step, there's no concurrency conflict so you can get away with having any number of writers as long as the read and read-side computation is cheap enough.

Some aggregates are commutative only in certain forms, like AVERAGE. To explain further, if one writer says the AVERAGE is 5 and another says it is 7, I can't combine those to say that the global average is 6.

However, if each writer stores both SUM and COUNT, I can use (COUNT1 + COUNT2)/(SUM1 + SUM2). This is because division doesn't commute, so I have to delay the non-commutative operation until the reader if I want to be able to merge two data sources.

Edited: Formatting

Twisol · on April 22, 2020

I see. I think you may be using a different definition of "aggregate" than the article and OP are referring to. Your "aggregate" is that from database systems like SQL and Excel, while the OP's "aggregate" is a different concept taken from the lexicon of Domain-Driven Design.

In DDD, an "aggregate" is a collection of state that can only be changed atomically, as a whole, by the outside world. It helps to think concretely of an Actor, i.e. some entity that owns a collection of state, such that the only way to mutate that state is to send a message to the actor. No matter how many messages come in from the outside world, only one is being handled at a time by the entity, and there are no concerns of concurrent access. Necessarily though, all messages are handled in a linear order.

From Martin Fowler's article on aggregates [0]:

> Aggregates are the basic element of transfer of data storage - you request to load or save whole aggregates. Transactions should not cross aggregate boundaries.

I think you'll be interested in CRDTs. They're like DDD aggregates where mutation operations are essentially always commutative, so there's a bit more alignment between the senses of "aggregate" being confused here.

[0] https://www.martinfowler.com/bliki/DDD_Aggregate.html

agentultra · on April 21, 2020

I did a formal model of an event source system used in production and it was quite illuminating. It turns out that concurrency is something one should take into account when designing these systems. Versioning often refers to two things:

1. The event data itself; when business cases change or understanding grows we wish to add, remove, or change the type of different fields in an event record.

2. The current state of a projected model

The latter is what requires some form of co-ordination otherwise you can end up with events being written in an incorrect order and produce the wrong state.

It is a good idea though to avoid event sourcing all of your models. Microsoft wrote about their experiences implementing an event-sourced application and how they reached that conclusion [0]. In my experience it's because of temporal properties: event sourced systems are inherently eventually consistent systems. When you have domain models that depend on one another you will need to be quite certain that A eventually leads to B which eventually leads to C and that if a failure happens along the way that nothing is lost or irrecoverable.

[0] https://docs.microsoft.com/en-us/previous-versions/msp-n-p/j...

gen220 · on April 21, 2020

We encounter a similar problem at my current job (mixing systems that we want to keep stateful with systems that we want to make “real-time”/stream-based).

I think you’ve covered most of the problems you’ll encounter. One thing that sticks out to me is downtime: how will your order subscriber handle a product publisher that’s down or otherwise delayed? Then, the events will be potentially out of order, is that a problem for you?

On another note, we follow the same bounded context principles, but we implemented it with Kafka+confluent, since that infrastructure and those libraries were already available. Teams make their data accessible via a mix of “raw” Kafka topics and very refined gRPC services. Your subscriber is implemented as a cron job that reads from N stream and “reduces” them to 1 stream.

FWIW, we also store a transaction log in each of our databases, so we can generate a stream of object states relatively easily later on. This has helped a lot with converting old tables into streams, and vice versa.

The only thing that’s a persistent issue is schema changes. My only recommendation there is to never make them... In all seriousness, keep your data models small, and whenever you want to experiment with a schema change, add the new data as a FK’d table with its own transaction log, rather than a schema mutation to your core table. It’s never worth the headache if you take data integrity seriously.

stormageddon · on April 21, 2020

How long did it take you to come up with this approach? How many meetings etc? As a lone developer I always wonder this stuff.

brendt_gd · on April 21, 2020

It took several hours of individual research, watching talks, reading blog posts; and took several pair-programming sessions of several hours over the span of four weeks to come up with a solution we liked.

We informed our client that this was a new area for us and that we didn't have hands-on experience with, but that we believed it would be beneficial to spend time to explore it, as it would be an elegant solution to several of their business problems. They agreed and we kept them in the loop with weekly meetings.

We're now in the phase of actually implementing real-life processes, the project will probably be in active development for another year or two.

sdiupIGPWEfh · on April 21, 2020

How do you approach estimating effort for this sort of thing? I find it awkward enough in Scrum to guess up front how many days of effort research will take and commit to delivering a plan or design by the end. If you have clients and aren't strictly bound by someone else's framework, they still want some rough idea how long research will take. Especially if the client is footing the bill. If the research is un-billed time, then the estimate is critical to you. What do you do?

_y5hn · on April 21, 2020

If you are perceived as just a cog delivering software patches, you've already lost. If you're presenting business proposals and designs, you've already provided research, analysis and business cases. Often this is alot of extra unrewarded work. Though, if a vendor has proven themselves already, they're in position to ask for billable time for more different types of roles. This upfront work is an investment in own business and can be used to attract other clients, but sure, most of it goes down the toilet unless a good outlet is found for all that creative work. A bigger consultancy can have many people cooperating on such work in order to add that extra value to hiring companies.

brendt_gd · on April 22, 2020

Everything starts with trust, of course. We've proven ourselves in several large projects and within our open source community before. The client knows that.

There's never any estimate of "so much hours will be spent in total on this research", we just honestly communicate with the client along the way, and they give us their trust.

Whether we can keep that trust is up to us.

steve76 · on April 21, 2020

Reaction Commerce behind an Airframe static site up quick.

Have an http server between the two, and log the requests.

Users don't update products or orders directly. The critical data that needs to be logged runs on a private network. Something with a lot of validation updates that.

Don't delete services when creating new features. Keep them running.

Avoid a central source of truth. Instead log discrepancies. Then your procedure is "Don't do this" along with "do this".

A lot of microservices are built starting as low as you can go.

I prefer to install proven working software and close it to modification. Have a bunch of them running together instead of making something new.

carapace · on April 21, 2020

    Events + state = state machine

I was working (briefly) at a startup once and we were having a meeting and the CTO sketched out his idea for our internal architecture and I looked at it and thought, "That's ethernet." I quit that job.

(Technically, I was let go. Friday I went to head of HR and said, "I think I'm gonna quit." Monday morning I was laid off. * shrug *)

jdkoeck · on April 21, 2020

I don't get what this has to do with ethernet.

icedchai · on April 21, 2020

They're both reinventing the wheel...

animeshjain · on April 21, 2020

I would be interested in knowing how the reactors are handling side-effect which should never be replayed. Is there some well established pattern for doing this?

agentultra · on April 21, 2020

Reactors can keep their own state including their current position in the event stream. When a replay is initiated it ignores events older than it's current "head."

sbellware · on April 21, 2020

What happens when the server that holds the thing that holds that state is restarted?

agentultra · on April 21, 2020

That's an important question to ask!

Can you assume the reactor has durable, local storage with atomic transactions?

The answer is to model your design and use a sufficient level of rigour in validating that your system meets your requirements.

Maybe you could use a database server that has the right properties to ensure your reactor could survive a restart.

What if you want to add multiple reactors so that you can process an event stream with a high volume of events, faster?

sbellware · on April 22, 2020

What happens when the write of the current position fails?

It's the same problem as presuming that ACK messages in message brokers/queues are guaranteed to not fail.

Since the message transport and other durable resources are rarely able to be enlisted in the same atomic transaction, and since distributed transactions would largely be an antipattern, it would seem that a reactor that records its current position can't be presumed to be an infallible way of ensuring that messages aren't processed more than once.

In the end, it always comes back to ensuring that handlers are idempotent, having something that can be used as a stable and consistent idempotence key, and accepting that the idempotence logic is the responsibility of the handler coders rather than something we can count on generalized infrastructure for.

pdexter · on April 21, 2020

How were the diagrams made?

brendt_gd · on April 21, 2020

With https://excalidraw.com/

Believe me: a whole new world will open once you've discovered it.

the_arun · on April 21, 2020

Pretty cool! I like the idea of edit a file -> save in your repo -> edit -> save. Export as needed.