Not to disagree with your analysis of the performance implications, but I don't ...

jerf · on Feb 5, 2020

"The article says that the data is basically "per-user","

Given that this is a table of who is "online", I don't think that's per-user in the sense that you are inferring. I infer that it's not a whole bunch of little local data that doesn't interact, it's a big global table of who is online and not online, constantly being heavily read from and written to in real time. Consider from the perspective of Bob's Erlang process that he wants to go offline and notify all of his currently-online friends that is is going offline. Bob's Erlang process doesn't have that data. Bob's Erlang process is going to get it from the Big Table of Who's Online. That table is the problem; it can't be stored in Bob's Erlang process.

I was at least imagining that the table could be partitioned into pieces pretty trivially (first X bits of the hash), but with Erlang's design, that implies an IPC just to ask some server process to give me the PID of the chunk I need to talk to, which itself is going to bottleneck. (In practice we'd probably cheat and use a NIF to do that, but that amounts to an admission that Erlang can't do this, so....)

At smaller scales you could try to live update Bob's local information as it changes, but this breaks down in all sorts of ways at scales far smaller than Discord, scales much closer to "a single mid-sized company".

"Another could be storing the data in mnesia, BEAM's internal mutable in-memory DB."

I have used mnesia for loads literally a ten-thousandth as small as this, if that (I could probably tack two more zeros on there), and it breaks down. It is an absolutely ludicrous idea that mnesia could handle what Discord is doing here. Last I knew the official Erlang community consensus was basically that mnesia really shouldn't be used for anything serious; my experience backed that up.

I think a non-trivial part of the reason why Erlang hasn't taken off is that its community still seems to exist in 2003, where it's a really incredible unique language that solves huge problems that nobody else does. In 2003, it rather has a point. But a lot of things have learned from Erlang, and incorporated its lessons into newer designs, and moved on.

See my other comment for what other runtimes have Erlang's advantages, but I'd invite you just to consider what we seem to basically agree on here; Erlang would be wildly slower and require a lot more hardware than Rust, the Rust code probably wasn't that hard to write, ... and the Rust code is way more likely to be correct than the Erlang code, too. I mean, what more "catching up to Elixir's inherent concurrency advantages" in this context than "did a job Elixir couldn't possibly do" do you want?

hopia · on Feb 5, 2020

Yeah the scale is what makes this problem a problem here. I've done exactly that "online" stuff per user process and it works fine on a small scale, even when it needs to be globally inferred. But I suspect it'd quickly become the bottleneck when scaling.

I had no idea mnesia was that fragile though, what gives? What kind of issues did you encounter with it? What do you use now to solve those issues with Erlang/Elixir?

Sure, we all know Erlang doesn't shine in computationally intensive workloads. Obviously, Rust was the right call here. But stateful distributed soft real-time concurrency, can you really say with a straight face that Rust comes with all the same features as BEAM out-of-the-box? Or any other modern platform for that matter. I've yet to see Erlang/Elixir beaten in that particular niche.

jerf · on Feb 5, 2020

"I had no idea mnesia was that fragile though, what gives? What kind of issues did you encounter with it? What do you use now to solve those issues with Erlang/Elixir?"

I had ~10,000 devices in the field with unique identifiers creating long-term, persistent connections to a central cluster. An mnesia table stored basically $PERSISTENT_ID -> PID they are connected to. It needed to be updated when they connected and disconnected, which let me emphasize was a relatively rare occurrence; the ideal system would be connected for days at a time, not connecting & disconnecting dozens of times a minute. At most, reconnection flurries might occasionally occur where they'd all try to connect over the course of a few minutes (they had backoff code built in) if the cluster was down for some reason.

Mnesia fell over. A lot. All I could find online as an explanation was basically "yeah, don't do that with mnesia". Bizarrely, it wasn't the connection flurries that did it, either... it was the normal "maybe a few dozen events per second" that tended to do it. Erlang itself was usually fine. (Although for machines right next to each other in a rack, I did lose the clustering more often than I'd like, and have to hit the REPL to re-associated nodes together. Much less often than mnesia corrupted itself, though.)

"can you really say with a straight face that Rust comes with all the same features as BEAM out-of-the-box?"

Well, that's another way of looking at what I was trying to say. That's the wrong question. Rust doesn't need "all the same features as BEAM". Rust needs "the features necessary to do the work". While the Erlang community is looking for a language that has "all the same features as BEAM" and smugly congratulating themselves that no other language seems to have cracked that yet, a number of languages are passing them by by implementing different features. Many of those languages, as I said, are informed by Erlang. Many of these new languages are choosing their "not exactly like Erlang" features in knowledge, not ignorance, as I think the Erlang community thinks.

Besides, Erlang builds in a lot of things that can be libraries in other languages. I built the replacement in Go. Mostly because it was hard to get people who wanted to work in Erlang but despite the rage on HN anytime Go comes up, getting people who are willing to work in Go was trivial even 5 years ago. (Hiring someone who knows Go already is still a bit of a challenge, but crosstraining someone into it is easy.) For the port, I wrote https://github.com/thejerf/reign . You will look at it and go "But Erlang has this and that and the other thing with its clustering, and your thing doesn't have those things!" And my response is twofold: First, that some of those things are supported in Go code in other ways than what you are expecting, and that was not intended to be "Erlang in Go" but "a library for helping port Erlang programs into Go without rearchitecting", and second... the resulting cluster has been more reliable and more performant (we actually cut the cluster from 4 to 2, because now even a single machine can handle the entire load), and all the "features" reign is missing, well, maybe they aren't so important out of the context of Erlang. I suppose in my own way this is another sort of story like Discord's; on the metrics I care about, my home-grown clustering library worked better for me than Erlang's clustering code.

(In fact, Go's even got the edge on Erlang for GC for my use case, which is one of the ways in which the new system is more performant. Now, it happens that my system is architected on sending around messages that may frequently be several megabytes in size, and Erlang was really designed for sending around lots of messages in the kilobyte range. Even as I was using it, Erlang got a lot better with handling that, but it still was never as good or fast as Go, and Go's only gotten better since then, too. I was able to do things in Go for performance to re-use my buffers that are impossible in Erlang.)

So, I mean, while I do deeply respect Erlang for its pioneering position, and I am particularly grateful for the many years I spent with it back when it was the only option of its sort (if I had to write the project in question in C++ or something, I just wouldn't have; do not think I "hate" Erlang or something, I am very grateful for it), if I am a bit less starry-eyed about it than some it's because I see it as... just code. It's just code. Erlang gets no special access to CPU instructions or special Erlang-only hardware that allows it to do things no other language can. It's just code. Code that can be and has been written in other languages, in other environments.

I like Erlang in a lot of ways, and respect its place in history. But it's community is insular, maybe even a bit sick, and I don't really expect that to change, because once an individual realizes it, they tend to just leave, leaving behind only the True Believers, who still believe that Erlang is the unique and special snowflake... that it was... 15 years ago.

hopia · on Feb 5, 2020

Thanks for the comprehensive reply!

I guess I better experiment more with mnesia before really using it for anything serious. Or find alternatives. We had Redis before but that experience turned out just awful so we got rid of it.

As for the community, I think Elixir is where it's at nowadays. There is, unsurprisingly, a very strong focus on webby stuff with Elixir, and a lot of the things you would build with it are just easy. Like a multi-machine chat server.

If I started to build a new distributed chat server today, Elixir would still be the easiest way to go, despite eventually likely not being the most performant solution out there. Discord likewise seems happy with their choice for this particular use case, only supplementing it with the likes of Rust for specific problems in their domain.

I mean you yourself built a lot of the Erlang's/BEAM's logic from the scratch on Go just to be able to use it there. I'm expecting I'd end up in a similar alley with Rust/Haskell/take your pick if I was attacking the problems where Elixir has all the facilities already set up and battle tested.