Yeah, being able to do structured queries quickly is the key. I don't think Elasticsearch is actually that great at that; it really feels designed for fulltext searches and suffers in terms of speed when dealing with the semi-structured nature of logs. (Overall, the cardinality of log messages is not as high as you'd think, but ES doesn't know this.)
I also agree that operating your own log search is kind of a pain. Often the node sizes are much larger than what you're using for your actual application. I wrote a bunch of go apps and they use 30MB of RAM each, then you read an article about making Elasticsearch work and find that you suddenly need 5 nodes with 64G of RAM each... when you can fit a week of log data in just one of those node's RAM. It's hard to get excited about paying for it when it's your own node, and it's even harder when someone else offers it as a service (because that RAM ain't free).
That is why I like some sort of mapreduce setup; you can allocate a small amount of RAM and a large amount of disk on each of your nodes, and these queries can use excess capacity that you have laying around. When your data gets big enough to need 320GB of RAM... you can still use the same code. You just buy more RAM.
Basically, the design of Elasticsearch confuses me. It's designed to be a globally-consistent replicated datastore with Lucene indexing. How either of those help with logs confuses me. You write your logfile to 3 nodes... if one of them blows up, now you only have two copies. A batch job can get the replication back at its leisure, if you really care. But in all honesty, you were just going to delete the data in a week anyway. (If you're retaining logs for some sort of compliance reason, then you do probably want real consistency. But you can trim down the logs to the data you need in real time, and write them to a real database in an indexable/searchable format.)
It is indeed good reading, but as you say, old. Since then, we have invested tremendously in the [data replication][0] and [cluster coordination][1] subsystems, to the point that we have closed the issues that Kyle had opened. We have fixed all known issues related to divergence and lost updates of documents, and now have [formal models of our core algorithms][2]. The remaining issue along these lines is that of [dirty reads][3], which we [document][4] and have plans to address (sorry, no timeline, it's a matter of prioritization). Please do check out our [resiliency status page][5] if you're interested in the following these topics more closely.
Thanks for all of the feedback in this entire thread.
I'm always happier when I see a follow up analysis by the Jepsen team, as a third party verification that the issues have been fixed and no major new ones introduced. Any chance Elastic is going to contract out to them for a follow up?
I also agree that operating your own log search is kind of a pain. Often the node sizes are much larger than what you're using for your actual application. I wrote a bunch of go apps and they use 30MB of RAM each, then you read an article about making Elasticsearch work and find that you suddenly need 5 nodes with 64G of RAM each... when you can fit a week of log data in just one of those node's RAM. It's hard to get excited about paying for it when it's your own node, and it's even harder when someone else offers it as a service (because that RAM ain't free).
That is why I like some sort of mapreduce setup; you can allocate a small amount of RAM and a large amount of disk on each of your nodes, and these queries can use excess capacity that you have laying around. When your data gets big enough to need 320GB of RAM... you can still use the same code. You just buy more RAM.
Basically, the design of Elasticsearch confuses me. It's designed to be a globally-consistent replicated datastore with Lucene indexing. How either of those help with logs confuses me. You write your logfile to 3 nodes... if one of them blows up, now you only have two copies. A batch job can get the replication back at its leisure, if you really care. But in all honesty, you were just going to delete the data in a week anyway. (If you're retaining logs for some sort of compliance reason, then you do probably want real consistency. But you can trim down the logs to the data you need in real time, and write them to a real database in an indexable/searchable format.)