Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Scaling Relational SQL Databases (2020) (stribny.name)
128 points by feross on June 16, 2021 | hide | past | favorite | 29 comments


(DevRel at Timescale)

The resurfacing of this article today is particularly interesting as I work on a new blog post and video talking about managing all aspects large-scale data, specifically geared towards time-series, but certainly applicable otherwise. Fast ingesting, compressing data, partitioning, pre-computed aggregates, etc. which shadows (unknowingly) a lot of the sound advice in this article. Nice work Petr!

Timescale recognizes that time-series data is often a particular challenge in many of these areas, which is why we're primarily focused on bringing time-series superpowers to PostgreSQL. Sometimes we're fortunate enough to include enhancements like SkipScan[1] that work on any ordered index, regardless if time-series data.

TimescaleDB addresses many of these scaling issues by providing functionality to auto-partition data, natively compress it into columnar form (which often improves historical queries saves upwards of 97% of storage), create continuous aggregates to pre-compute data for reporting and downsampling, set up easy data-tiering functionality, data retention, and straightforward horizontal scaling for time-series data.

It's great to see confirmation of many of the features within a related context.

Thanks again for the article and getting it in front of folks today.

[1]: https://blog.timescale.com/blog/how-we-made-distinct-queries...


This is a great piece. I've (been on teams that) explored every one of these points and I cannot think of others.

The only big thing (which the article does call out) is that this is about scaling _traditional_ SQL databases. Traditional SQL databases don't have builtin horizontal scaling.

The article is a little hazy about "specialized extensions":

> When we need to store and work with geospatial, time series and other specialized data, we can sometimes use a database extension like TimescaleDb or PostGIS to make the data processing and storage more efficient.

The point I'd call out is that specifically TimescaleDB allows you to horizontally scale PostgreSQL which is a pretty radical improvement on vanilla Postgres.

https://blog.timescale.com/blog/timescaledb-2-0-a-multi-node...


> The only big thing (which the article does call out) is that this is about scaling _traditional_ SQL databases. Traditional SQL databases don't have builtin horizontal scaling.

FWIW, the use of the word "traditional" here may need an asterisk, as some OG relational SQL databases like Teradata (born 1979, still used for ultra-high-scale applications) were designed for and depend on horizontal scaling.


Do you mean that TimescaleDB 2.0 can be used instead of Citus to scale general purpose (non just timeseries) workloads?

For example a multitenant workloads where you shard by org_id, etc.?

I actually thought about it, but wasn't aware about v2.0


Horizontal scaling within TimescaleDB is still geared around time-series data and (distributed) hypertables.

https://docs.timescale.com/timescaledb/latest/how-to-guides/...

https://docs.timescale.com/timescaledb/latest/how-to-guides/...


Our use case was OLAP workload and IIRC I don't think it's transaction support is great for horizontally scaling OLTP workloads. I wanted to call it out since the author mentioned timeseries data but didn't mention this aspect of Timescale. (Though to be fair, this feature of Timescale may have came out after the post was written.)


Thanks for the information. I wasn't aware of such feature at the time of writing the post.


I have only kind of recently started to look into using the Actor model as another way to scale apps.

This talk outlines a bunch of great examples of how you can scale to crazy levels using this pattern with almost no database optimisations at all https://youtu.be/7OVU9Mqqzgs

I think one of the problems is that until super recently good implementations of the pattern were tied up in language specific runtimes like Orleans in .NET world and Akka in Java for example.

I’ve been playing around with a new language agnostic runtime that Microsoft just released for distributed apps (https://dapr.io/) and it’s really opening up my eyes to these approaches which I guess don’t get talked about a lot lately where the idea has been to make everything stateless as the only way to scale.

If anyone has any good real world experience with Actors I would love to hear about it, what worked well, what you wish you knew from the start etc.


Actor frameworks remove some of the ceremony around async message passing, but the central issue of scalability is "canonical state" and how well commands and queries on that scale.

It's a bit like tech support in tiers. You have one CEO, but you don't get the CEO on every tech support call. You have bunch of layers trying to satisfy you until you get there.

The problem is those layers may also provide very poor UX.


> the central issue of scalability is "canonical state"

Exactly. I've written a reasonable amount of Erlang, and yeah, from a purely technical perspective you'd be much better off handling concurrent requests with something actor based than async-du-jour dumpster fire (python is particularly bad). But the real difficulty ultimately is managing canonical persistent state, and actors won't magically solve that for you.

Having said this, empirically speaking, almost none of the companies who believe they have scalability problems actually do and a relational DB combined with a not super inefficient architecture are all that's needed.


Problem with python is that async doesn't confer any performance benefit because of the GIL. You cannot actually run async functions in parallel. Each function makes forward progress independently but in serial.

One problem I have is that the performance of computation is tied to performance of the storage, hence bringing computation to the storage for performance. If you have to communicate to get storage, then your solution will be slow.


Async is not really about parallel. Async in nodejs works the same way. The point is for IO to happen concurrently. To scale to multiple cores in Python or node you run multiple processes.

If you're not sharing mutable state between them, that scales great. If you are, you really shouldn't be using those languages, they're not the right tool for the job.


I take advantage of I/O being parallel in Python in my mazzle continuous integration pipeline tool. I'm not sharing mutable state.

I spin up a graph of python Threads and each joins others in a graph. This way we can run graphs in parallel. See this graph - the parts that look like this:

dependency -> {parallel1; parallel2; parallel3} -> postparallel

parallel1, parallel2, parallel3 can run in parallel in a separate python thread because the IO is parallel.

postparallel joins parallel1, parallel2, parallel3 and waits for them all to complete.

Where parallel1-3 is things like ansible, packer (slow), AMI builds, chef runs etc.

https://github.com/samsquire/mazzle-starter/blob/master/arch...


I wish I was smart enough to get what you were saying here but I got a bit lost on the analogy. Could you ELI5 that for me? It sounded interesting


When you scale, your basic solution is "I take this job done by one person/machine, and I split it to multiple people/machines".

Sometimes you can't do that. Either because you're "outsourcing" part of the solution to someone beyond your control, or because the data relationships of your system require a "synchronizer", someone who works with this data in a serial, specific way.

For example we have some government agency. Long queues, lots of waiting. One office for registration. What do we do? We add more registration offices, so people's forms can be processed in parallel.

But every person needs to get a unique ticket number let's say. Now they still need to go through a single office to get this number, because the independent offices can't guarantee global uniqueness as they don't synchronize between each other. You need a "synchronizer" whose only purpose is to provide unique numbers.

And this works, you have 100 offices and 1 "number" office. You scaled 100x over having one office for everything. But when you get to 200 offices, suddenly the queue in front of the "number" office starts to grow. You can't scale this office, because there's a "data relationship" in there that you can't split up and scale.

You wanna identify and avoid such bottlenecks, and when you can't avoid them you want those bottlenecked systems to do as little as possible (so they can run as fast as possible) and to be hit as rarely as possible (by having someone in front take care of common cases: for example you can have a front-desk before a "number" office that checks you have all needed documents to get a number, before you get sent to get a number, now the "number" office has less to do).

Sorry that's very abstract, but without specific examples it's hard to explain anything.


No need for the apology, I appreciate you taking the time to event attempt explaining it :)

Let me point out where I am falling over specifically because it might be more helpful.

My understanding of Actors is that because they are essentially operating as in-memory objects with their own state, that with the exception of ad-hoc queries on my data (where SQL shines for example) I am doing the majority of my operations on various resources which handle persistence in a eventually consistent fashion and therefore aren't really obliged to immediately write to a database with each individual change in state thus freeing up the bottlenecks.

This is all a pretty new concept to me so I feel like I might have made some bad assumptions in there for example, I am just not 100% sure where yet :)


The concept was nice of the one office that gives out numbers is similar to auto incrementing integer IDs in DB. To get around one single thing having to generate all the numbers, you can have every office generate UUIDs, and toss in a timestamp if you need ordering. I’ve heard people are playing with UUID v6, too, but I’m not sold on it yet.


It always comes down to your data that must be persistent and consistent. This data must be stored somewhere. The real problem is always storage. Message passing is all fine and good, but the real problem is your data. How scalable is your data layer in terms of size and operations per second.


In other words: You can't add many workers to a linear job and expect the task is done in SizeOfJob/Workers.

Or: If you put many car lanes that go into one, you will get contention.

This will manifest in many ways.

But you "can" if along ALL the pipeline you have a way to coordinate the task!. However, if componentA is paradigmA and componentB is paradigmB you will get trouble.

---

In practical term: Wanna kill your database? Open N connections launched by thousand of actors/async/threads. You need to instead use a pool, coordinate which jobs are fast to done or slow to complete.

AND

You need to architect the code to exploit both what this paradigm is, + what that make happy the DB (typical: NOT do sql + n, think in batches, group operations, etc)

This is the part you don't get much help not matter what you use.


Thank you I've been looking for something just like this. I wanted a framework to build F# apps with the actor model.


Horizontal scaling is a delicate art. IMHO It's must be achieved by keeping in mind all 4, end users, business use case, application server and database.

Traditional RDBMS are near best possible solution as long as vertical scaling is concerned and also the number of features they provide, both sql and nonsql. Distributed SQL databases are good at horizontal scaling and query engine and they are getting better.

Divide and Rule method always gave me nice control over deployment and architecture complexity. Before adoption of Distributed SQL, I would rather consider traditional RDBMS (or should I say data platform ) like PostgreSQL, application server as reusable component at high level architecture. More kind of lego blocks but few in numbers.

I would rather implement sharding at business use cases through application logic level ( NodeJS / JVM / Whatever) by having few major individual services ( not microservices for each function ) instead of having monolith and if thats not enough then further shard at user level in application server layer and not database layer. For aggregation of data and analytics purpose we can have different computation servers ( either PG itself through use of fdw or apps written in general purpose language) querying the read replicas which are specifically deployed for this purpose.


Is denormalization (e.g. to star/snowflake schemas) still common or advisable? I know the conventional wisdom is that it can improve performance, but how much scale does it actually get you in practice (a factor of 2? an order of magnitude?), and can it still be the right call for public cloud based workloads where it's trivial to scale vertically and relatively easy to scale horizontally?


It very much depends. You have to try and measure it. Sometimes it makes things slower even.

In general it's one tool in the toolbox and never the first one reached for.


How about keeping transactions short? I thought that would be at the same level of importance as others.


Most of this article tracks quite well with my experiences! It's nice to see stuff like this written down.

> If the application’s code is in our control we can cache data in a memory store like Redis or Memcached to avoid querying the database.

On this point - what I've seen over and over again is caches being used as a band-aid for overall throughput limitations. Then the cache goes down (sometimes partially in the case of a clustered cache), the database gets slammed, and there's an outage soon after.

So caches help as an improvement on latency to some degree, but using them to avoid scaling your capacity can get you in a pickle depending on your usage patterns / hit distribution. You usually need some amount of capacity planning and consideration there.


>>When it comes to scaling, we might need to think about: fast INSERTs and UPDATES for write-heavy workloads

If you want to scale database, all your write operations should be immutable and therefore you should rarely need UPDATE statements.


Immutability makes it easier to scale writes in a relational DB, but comes at the expense of more complexity and difficulty scaling reads. It's also problematic with regulation like GDPR / "right to be forgotten".

Immutability is definitely not a requirement for scaling a database. In my direct experience, the vast majority of large tech companies use mutable relational databases, with a write workload mixing INSERTs, UPDATEs, and DELETEs.


Are these in the order of difficulty to implement/manage/complexity? A list like this would help for what to try next if my interviewer is not convinced. They never seem convinced though. :(


I would say no - depending on your problems/business (for instance) you might have extra human time but no budget for vertical scaling - congratulations, you've chosen query plan and type upgrades and optimization!

Also I would add one - measure and check your assumptions - that database feature you might be thinking is helping could be the key to your destruction.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: