I am deeply interested in parallelism. Would enjoy talking to you about it. I th...

Jensson · on Dec 24, 2022

Sorting is a standard distributed algorithm, it is basically quicksort but you move some data between servers between the steps. The aim is to create ordered buckets of data that you then send to servers. The first step is to find good pivots, so sort the data and select data points at the median and different percentiles from each server, then fan out those pivot candidates to every server. Now since all servers has the same pivots they can create the same buckets, and send the data they have of each bucket to the corresponding server. The buckets wont have perfect balance, so you typically want a few times more buckets than you have servers to ensure that you don't overload one of them, that is the fast way. Alternatively you can do another fan out step computing the number of items in each bucket among the servers, and split large ones into smaller buckets using the same technique as before, this will always work but requires a bit more communication between the servers.

As for distributing heavy computations, Google has done that to compute your search results since almost the beginning. Fan out a request to a huge cluster of servers that all have different parts of their data in ram, then they filter and return the parts of the data that is relevant to you. It is totally possible to do it and return a response to the user in a fraction of a second. Typically it takes a few milliseconds to send data between servers in a data center, you can easily fan out a request to a few hundred servers, let them compute and return something 15 milliseconds later. It costa bit to run that though, Google can afford it since search is worth so much per request, so you got figure out if your usecase is worth that much compute. But it doesn't cost that much more than running the same work on a single server, it depends a bit on how well your servers are collocated.

Btw, frameworks for distributed computing like Apache Spark are way too slow for this, you have to hand roll those. But it really isn't that hard to do, there are programming competitions involving those, a correct solution to these kind of things takes like 30 minutes to write if you know what to do and you have done all the boring infrastructure work to ensure you can contact servers and access the data. That is why most programmers never work with complex algorithms, because the boring and tedious work on all the other parts takes up most of the time and manpower.

convolvatron · on Dec 24, 2022

assuming that local access is cheaper than distribution, wouldn't a merge sort be better?

Jensson · on Dec 25, 2022

I assumed the data is spread out in a set of databases. Reading even just a few terabytes of data into a single server is slow and expensive, and then you'd still need to spend a few days cpu time to sort it locally. And for a bit larger datasets you wont be able to handle it with most computers, I think most large companies has some data sets at about a petabyte, so I don't think assuming that amount of data is unreasonable. I worked with data at about an exabyte, not many has that much data, but it starts to get hard to do it locally at just one millionth that amount. But maybe I'm over estimating how common that is.

Another question is how much time it takes. Lets say we have small data, a few gigabytes or so. Sorting that would still take about 10 seconds to a minute, which is annoyingly slow to users. If that data is distributed over many computers that can each read a subset, then the distributed sorting will go much much quicker so you can do a full sort over the entire dataset within a normal 100ms request SLO. But this assumes that the user needs to sort using arbitrary comparisons, otherwise it is of course much better to just pre sort the dataset and serve it that way.

And even if you have all data locally, if that data doesn't fit in memory distributed would still help. Network within a server cluster is typically faster than communicating with local persistent storage, so distributed you can read it once and send it to the different workers instead of having to read and write it over and over as you'd have to do if you sorted it locally. So in other words, making a computer read X data from disk takes about the same time as getting that data to a distributed set of computers, since you can send data faster than you read it, and this is the worst case for distributed.

Note: This assumes that we can't precompute the sort, meaning the user asks for some ordering they want the items to be sorted in and we have to provide that on the fly. Pre-sorting data once is easy, so I'm not sure why we would discuss that case. Also networking tends to cost a ton more when you run it on others computers like AWS or Azure or Google cloud, then it might not be as viable. Don't they take like 10x more than the cost of networking? But when I worked at Google we got to run distributed jobs juggling a petabyte of data without anyone asking questions, so it can't be that expensive. As long as you didn't make an infinite loop in your distributed program or so nobody cares, and even when a guy did that his job ran for a couple of weeks before people noticed that our compute quota ran out, so I'm pretty sure it can't be expensive. But that amount of networking would probably cost a fortune in Gcloud.

dmurray · on Dec 25, 2022

> If we knew the formula of fibonacci that didn't depend on previous iterations, we could parallelise fibonacci. But I'm not a mathematician so I I'm not sure how to find out that formula.

You find it on Wikipedia! The insight that a closed-form solution might exist - "knowing what to Google" - is really all the mathematical insight you need.

But maybe Fibonacci is a stand in for something else in your explanation. I don't follow the definition of the integer updates problem.

samsquire · on Dec 25, 2022

Thank you for your kind words and that critical insight - "closed solution".

Sorry for my poor explanation.

When you shard data, you could shard the record identifier and keep the data on a particular server -OR- you can shard the data itself and mapreduce it.

For example, a file system locally shards the file data into inodes. The sum of all the inodes = the data of the file.

I am trying to split an integer into shards, the sum of the parts is the value.

If account balance is defined as the SUM of all server's amounts, then each server stores a proportion of the data for that user.

If there are 8 servers and a user has £8000 in their bank balance, then you try store £1000 in each server, so any server can serve requests without coordination. If you user wants to spend £25, then that's fine, just deduct the local data. If you the user wants to spend £1001, then you need a globally consistent read since that would drop the balance below 0 for the local replica, and you need to check the other replicas to see if the user has enough.

I think it is related to the ID generation problem. If you have 1000 servers and you need to generate monotonically incrasing ID numbers without clashes between servers, how do you do it without synchronization? Do you split an integer / 1000 and give each server that batch to assign ID numbers out of?

Writing this out has lead me to think that you could run two clusters. One cluster serves sharded data. The other serves a stream of updates from the other servers and keeps a globally consistent value.

With probability, most transactions shall be below the server's proportion and shall be served locally without coordination. Coordination is only necessary rarely - for large purchases.