Most startups I've worked at literally have a script to deploy their whole setup to a new region when desired. Then you just need latency-based routing running on top of it to ensure people are processed in the closest region to them. Really not expensive. You can do this with under $200/month in terms of complexity and the bandwidth + database costs are going to be roughly the same as they normally are because you're splitting your load between regions. Now if you stupidly just duplicate your current infrastructure entirely, yes it would be expensive because you'd be massively overpaying on DB.
In theory the only additional cost should be the latency-based routing itself, which is $50/month. Other than that, you'll probably save money if you choose the right regions.
Are the same instance sizes available in all regions?
Are there enough instances of the sizes you need?
Do you have reserved instances in the other region?
Are your increased quotas applied to all regions?
What region are your S3 assets in? Are you going to migrate those as well?
Is it acceptable for all user sessions to be terminated?
Have you load tested the other region?
How often are you going to test the region fail over? Yearly? Quarterly? With every code change?
What is the acceptable RTO and RPO with executives and board-members?
And all of that is without thinking about cache warming, database migration/mirror/replication, solr indexing (are you going to migrate the index or rebuild? Do you know how long it takes to rebuild your solr index?).
The startups you worked at probably had different needs the Roblox. I was the tech leach on a Rails app that was embedded in TurboTax and QuickBooks and was rendered on each TT screen transition and reading your comment in that context shows a lot of inexperience in large, production systems.
A lot of this can also be mitigated by going all in on API gateway + Lambda, like we have at Arist. We only need to worry about DB scaling and a few considerations with S3 (that are themselves mitigated by using CloudFront).
Are you implying that Roblox should move their entire system to the API Gateway + Lambda to solve their availability problems?
Seriously though, what is your RTO and RPO? We are talking systems that when they are down you are on the news. Systems where minutes of downtime are millions of dollars. I encourage you to setup some time with your CTO at Arist and talk through these questions.
1. When a company of Robolox's size is still in single-region mode by the time they've gone public, that is quite a red flag. As you and others have mentioned, game servers have some unique requirements not shared by traditional web apps (everyone knows this), however Roblox's constraints seem to be self-imposed and ridiculous considering their size. It is quite obvious they have very fragile and highly manual infrastructure, which is dangerous after series A, nevermind after going public! At this point their entire infrastructure should be completely templated and scripted to the point where if all their cloud accounts were deleted they could be up and running within an hour or two. Having 18,000 servers or 5 servers doesn't make much of a difference -- you're either confident you can replicate your infrastructure because you've put in the work to make it completely reproducible and automated, or you haven't. Orgs that have taken these steps have no problem deploying additional regions because they have tackled all of those problems (db read clones, latency-based routing, consistency, etc) and the solutions are baked into their infrastructure scripts and templates. The fact that there exists a publicly traded company in the tech space that hasn't done this shocks me a bit, and rightly so.
2. I mentioned API Gateway and Lambda because OP asked if in general it is difficult to go multi-region (not specifically asking about Roblox), and most startups, and most companies in general, do not have the same technical requirements in terms of managing game state that Roblox has (and are web app based), and thus in general doing a series of load balancers + latency based routing or API Gateway + Lambda + latency based routing is good approach for most companies especially now with ala carte solutions like Ruby on Jets, serverless framework, etc. that will do all the work for you.
3. That said, I do think that we are on the verge of seeing a really strong viable serverless-style option for game servers in the next few years, and when that happens costs are going to go way way down because the execution context will live for the life of the game, and that's it. No need to over-provision. The only real technical limitation is the hard 15 minute execution time limit and mapping users to the correct running instance of the lambda. I have a side project where I'm working on resolving the first issue but I've resolved the second issue already by having the lambda initiate the connection to the clients directly to ensure they are all communicating with the same instance of the lambda. The first problem I plan to solve by pre-emptively spinning up a new lambda when time is about to run out and pre-negotiate all clients with the new lambda in advance before shifting control over to the new lambda. It's not done yet but I believe I can also solve the first issue with zero noticable lag or stuttering during the switch-over, so from a technical perspective, yes, I think serverless can be a panacea if you put in the effort to fully utilize it. If you're at the point where you're spinning up tens of thousands of servers that are doing something ephemeral that only needs to exist for 5-30 minutes, I think you're at the point where it's time to put in that effort.
4. I am in fact the CTO at Arist. You shouldn't assume people don't know what they're talking about just because they find the status quo of devops at [insert large gaming company here] a little bit antiquated. In particular, I think you're fighting a losing battle if you have to even think about what instance type is cheapest for X workload in Y year. That sounds like work that I'd rather engineer around with a solution that can handle any scale and do so as cheaply as possible even if I stop watching it for 6 months. You may say it's crazy, but an approach like this will completely eat your lunch if someone ever gets it working properly and suddenly can manage a Roblox-sized workload of game states without a devops team. Why settle for anything less?
5. Regarding the systems I work with -- we send ~50 million messages a day (at specific times per day, mostly all at once) and handle ~20 million user responses a day on behalf of more than 15% of the current roster of fortune 500 companies. In that case, going 100% lambda works great and scales well, for obvious reasons. This is nowhere near the scale Roblox deals with, but they also have a completely different problem (managing game state) than we do (ensuring arbitrarily large or small numbers of messages go out at exactly the right time based on tens of thousands of complex messaging schedules and course cadences)
Anyway, I'm quite aware devops at scale is hard -- I just find it puzzling when small orgs have it perfectly figured out (plenty of gaming startups with multi-region support) but a company on the NYSE is still treating us-east-1 or us-east-2 like the only region in existence. Bad look.
Also, still sounding like you don’t understand how large systems like Roblox/Twitter/Apple/Facebook/etc are designed, deployed, and maintained-which is fine; most people don’t–but saying they should just move to llamda shows inexperience in these systems. If it is "puzzling" to you, maybe there is something you are missing in your understanding of how these systems work.
Correctly handling failure edge cases in a active-active multi-region distributed database requires work. SaaS DBs do a lot of the heavy lifting but they are still highly configurable and you need to understand the impact of the config you use. Not to mention your scale-up runbooks need to be established so a stampede from a failure in one region doesn't cause the other region to go down. You also need to avoid cross-region traffic even though you might have stateful services that aren't replicated across regions. That might mean changes in config or business logic across all your services.
It is absolutely not as simple as spinning up a cluster on AWS at Roblox's scale.
Roblox is not a startup, and has a significant sized footprint (18,000 servers isn't something that's just available, even within clouds. They're not magically scalable places, capacity tends to land just ahead of demand). It's not even remotely a simple case of just "run a script and wee we have redundancy" There are lots of things to consider.
18k servers is also not cheap, at all. They suggest at least some of their clusters are running on 64 cores, some on 128. I'm guessing they probably have a fair spread of cores.
Just to give a sense of cost, AWS's calculator estimates 18,0000 32 core instances would set you back $9m per month. That's just the EC2 cost, and assuming a lower core count is used by other components in the platform. 64 core would bump that to $18m. Per month. Doing nothing but sitting waiting ready. That's not considering network bandwidth costs, load balancers etc. etc.
When you're talking infrastructure on that scale, you have to contact cloud companies in advance, and work with them around capacity requirements, or you'll find you're barely started on provisioning and you won't find capacity available (you'll want to on that scale anyway because you'll get discounts but it's still going to be very expensive)
This was in reply to OP who said deploying to a new region is insanely complicated. In general it is not. For Roblox, if they are manually doing stuff in EC2, it could be quite complicated.
So Roblox need a button to press to (re)deploy 18,000 servers and 170,000 containers? They already have multiple core data centres, as well as many edge locations.
You will note the problem was with the software provided and supported by Hashicorp.
> It's also a lot more expensive. Probably order of magnitude more expensive than the cost of a 1 day outage
Not sure I agree. Yes, network costs are higher, but your overall costs may not be depending on how you architect. Independent services across AZs? Sure. You'll have multiples of your current costs. Deploying your clusters spanning AZs? Not that much - you'll pay for AZ traffic though.
The usual way this works (and I assume this is the case for Roblox) is not by constructing buildings, but by renting space in someone else's datacentre.
Pretty much every city worldwide has at least one place providing power, cooling, racks and (optionally) network. You rent space for one or more servers, or you rent racks, or parts of a floor, or whole floors. You buy your own servers, and either install them yourself, or pay the datacentre staff to install them.
Yes. If you are running in two zones in the hopes that you will be up if one goes down, you need to be handling less than 50% load in each zone. If you can scale up fast enough for your use case, great. But when a zone goes down and everyone is trying to launch in the zone still up, there may not be instances for you available at that time. Our site had a billion in revenue or something based on a single day, so for us it was worth the cost, but it not easy (or at least it wasn't at the time).
How expensive? Remember that the Roblox Corporation does about a billion dollars in revenue per year and takes about 50% of all revenue developers generate on their platform.
Right, outages get more expensive the larger you grow. What else needs to be thought of is not just the loss of revenue for the time your service is down but also it's affect on user trust and usability. Customers will gladly leave you for a more reliable competitor once they get fed up.
There are definitely cost and other considerations you have to think about when going multi-AZ.
Cross-AZ network traffic has charges associated with it. Inter-AZ network latency is higher than intra-AZ latency. And there are other limitations as well, such as EBS volumes being attachable only to an instance in the same AZ as the volume.
That said, AWS does recommend using multiple Availability Zones to improve overall availability and reduce Mean Time to Recovery (MTTR).
(I work for AWS. Opinions are my own and not necessarily those of my employer.)
This is very true, the costs and performance impacts can be significant if your architecture isn't designed to account for it. And sometimes even if it is.
In addition, unless you can cleanly survive an AZ going down, which can take a bunch more work in some cases, then being multi-AZ can actually reduce your availability by giving more things to fail.
AZs are a powerful tool but are not a no-brainer for applications at scale that are not designed for them, it is literally spreading your workload across multiple nearby data centers with a bit (or a lot) more tooling and services to help than if you were doing it in your own data centers.
Data Transfer within the same AWS Region
Data transferred "in" to and "out" from Amazon EC2, Amazon RDS, Amazon Redshift, Amazon DynamoDB Accelerator (DAX), and Amazon ElastiCache instances, Elastic Network Interfaces or VPC Peering connections across Availability Zones in the same AWS Region is charged at $0.01/GB in each direction.
Wrong. Depends on the use case AWS can be very cheap.
> splitting amongst AZ's is of no additional cost.
Wrong.
"
across Availability Zones in the same AWS Region is charged at $0.01/GB in each direction. Effectively, cross-AZ data transfer in AWS costs 2¢ per gigabyte and each gigabyte transferred counts as 2GB on the bill: once for sending and once for receiving."
It's also a lot more expensive. Probably order of magnitude more expensive than the cost of a 1 day outage