Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I agree with the author completely. I worked on a fairly large system using event sourcing, it was a never-ending nightmare. Maybe with better tooling someday it will be usable, but not now.

Events are pretty much a database commit log. This is extremely space innefficient to keep around. And not nearly use useful as you might think.

Re-runs need to happen pretty often as you change how events are handled. Even in our local CI environment, it eventually took DAYS to re-run the events. It was clear that the system would never survive production use for this reason alone.

De-centralizing your data storage is a bad idea. We ended up not only with a stupidly huge event log, but multiple copies of data floating around at each service. Not fun to deal with changes to common objects like User. Sometimes you would have to update the "projection" in 5-10 different projects.

In practice ES amounted to building our own (terrible) database on top of a commit log that's actually a message queue. Worst technology I've worked with in years, eventually the whole project collapsed under it's own weight.

Some of these problems are fixable in theory. Perhaps a framework to manage projection updates, something to prune old events by taking snapshots, a DB migration style tool to "fixup" mistakes that are otherwise immortalized in the event stream. But right now, seriously stay away :)



I agree that it's hard, however doable and pays benefits if you know what you're doing. I worked on 3 successful implementations for finance sector and we could replay a few million messages per second. Have a look at how we achieved that in LMAX: https://martinfowler.com/articles/lmax.html

Sorry to say it, but clearly you must have been doing something wrong or employing event sourcing where it does not belong.


> LMAX's in-memory structures are persistent across input events, so if there is an error it's important to not leave that memory in an inconsistent state. However there's no automated rollback facility. As a consequence the LMAX team puts a lot of attention into ensuring the input events are fully valid before doing any mutation of the in-memory persistent state. They have found that testing is a key tool in flushing out these kinds of problems before going into production.

I'm sorry, but this is saying "catch your bugs before they reach production" which just isn't feasible on non-critical software development (i.e., most software development). The important part that is left out here is: what happens when one such errors slips in? How do you deal with it after the fact?

That being said, your system is impressive and I loved being able to read about it. Please keep up the good work and specially sharing your findings! :)


Not quite! I don't think that's what they're getting at.

The idea is this: Say you have a record A with fields f1, f2, f3. When an even comes in you run a function F with steps s1, s2, s3 each of which may modify a field of record A.

Here's the issue, if s3 fails (due to "invalid input"), the modifications to A from s1 and s2 are incorrect and A is now corrupt.

There are a bunch of ways to handle this but the one described here is to avoid touching data that persists between requests until you're at a stage where nothing can fail anymore.


> until you're at a stage where nothing can fail anymore.

... and then there's a NullPointerException because you forgot to check something that could indeed fail (i.e.: you have a bug).

In other words: they advise you to not have bugs in that part of the codebase, which was precisely my objection.


Absolutely. But doing things in this style protect you from large classes of the especially hard to reproduce bugs. Nothings perfect but it helps a lot!

I'd never heard it articulated before but I personally discovered this style over the years as well.


I honestly can't imagine bug free software, even in critical software development. Luckily for me I've worked on important apps, but if there's an issue there is time to trace and repair the data...not a system that has 6 million orders per second.


Hmm. You are both right?

If LMAX fits your problem, and you're fine with the distribution aspects, and you can adopt a homogeneous architecture etc, then it works well.

But the way Event Sourcing is normally sold and implemented is you emit events in some components, written in some mix of languages and half of them probably already legacy, into some 'event bus' thing that you adopt and is probably written internally in some other paradigm, and then you consume these events in several different programs all implemented in other languages by other teams at other times, and you probably do all this across the atlantic!

I live in a gazillion-events-per-second world, perhaps not as hectic as finance, but sadly bigger events and sadly globally distributed, and sadly running on amazon (which will be orders of magnitude slower overall than if you have dedicated hardware once you're at this scale because of, aha, aws's complete lack of 'mechanical sympathy' ;) ). It sucks. Oh I wish I had an LMAX-shaped problem.


That's EDA (Event Driven Architecture) not ES.


That article is masterwork, thank you. It seems like having a "hub and spokes" architecture is key to getting event sourcing right. LMAX's Business Logic Processor is the only thing sourcing outputs, so you don't hit dependency hell. Also, your business process is literally a process, rather than living on the ledger.

It reminds me of well-designed Paxos control systems like Borg, where the journal serves as the ledger and all business logic lives in one scheduler thread.


Most of the time what people actually want is the audit log abilities of event sourcing. Seeing how data was in the past and what (or who) made changes to it. There are dozens of ways to accomplish this at different layers but unfortunately they go all in on event sourcing instead.


>Re-runs need to happen pretty often as you change how events are handled.

Aren't you supposed to take Snapshots every now and then to solve that problem?


Migrating snapshots just moves the pain of changes : instead of letting the new projection logic replay all past events, you write code to migrate snapshots and hope that it will give the same result as the replay.

In those cases, I would rather do the replay, but do it offline before the release.


You are, but if you change how events are handled, or create a new event handler you probably need to replay all events.


You make a convincing case that it's tricky to implement.

There's also an (somewhat obvious) case to be made that it isn't for every use case.

Doesn't mean it can't be useful sometimes, or maybe, god forbid, often.


The replays are done on the processor right ? Isn't it supposed to be fast ?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: