Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It really depends. Capabilities like time-series-specific compression, automatic rollups, complex aggregations and/or ranking, stable storage in S3, clustering, and replication vary a lot and I think that's why we see so many TSDBs out there. I maintain a list of TSDBs[0] and it started as an evaluation of what already available for my previous employer to use. We didn't find one that fit our exact use case, so we ended up building our own on top of MySQL.

[0] https://misfra.me/2016/04/09/tsdb-list/



All the features you mentioned are already part of distributed column-oriented databases. S3 storage is orthogonal and unrelated to time-series data. It's usually not included in shared-nothing local storage architecture of databases but you can definitely mount S3 storage in a variety of ways.

Many options on your list are not TSDBs, like Aerospike, Elasticsearch, Cassandra, Kudu, GridGain/Ignite. EventQL and Riak are obsolete. Apache Apex is a stream processing framework. Many of the others are just extensions to Prometheus built-in mini-storage or offer time-series indexing on top of existing databases.


> All the features you mentioned are already part of distributed column-oriented databases.

I disagree, but even if that was the case, not all of them perform well. For example, we could've used Cassandra for our use case at my previous employer but the lack of push-down aggregations (at the time, not sure if they're supported now) would've been terrible for our top-K aggregate queries.


What features do you disagree about?

Cassandra is not a distributed relational column-oriented database, so yes, it will be bad at OLAP queries.

Cassandra is a "wide-column" or "column-family" database, which is unfortunately confusing industry jargon but better referred to as an advanced/nested key-value store. It comes from the original Dynamo whitepaper, along with similar systems like HBase, BigTable, DynamoDB, Azure Table Storage, etc. They can sometimes handle time-series queries with good data modeling because of fast prefix scans but the lack of a real query language makes them a bad choice for analytics scenarios.


I understand. Can you give an example of a “modern distributed relational column-oriented database”?

Two capabilities that are important in my work are roll-ups (reducing resolution of data) and fast bulk deletes of old data.


Clickhouse, MemSQL, Redshift, MapD, Kinetica, etc.

If you just want rollups and don't care about every row, then look at Druid (or imply.io for a startup making it easier).

All these systems can delete old data very quick as they just delete entire compressed partition files.


Fwiw, more recent versions of Druid have a no-rollup mode that does ingestion row-for-row. It ended up being useful for cases where you _do_ care about every row, maybe because you want to retrieve individual rows or maybe because you don't want to define your rollups at ingestion time. And in that mode, Druid behaves like the other DBs you mention.

(I am a Druid committer.)


Some of those we’ve looked at before and decided not to go with because of unknown observability, high operational requirements, or cost. But yeah, no real problems with data models or queries.

I think Druid has come the closest to the most ideal system for the requirements I’ve had to deal with, but haven’t used it yet.

Thanks, by the way! This helps a lot.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: