There's probably a lesson in here about managing incidents, very few updates or ...

deathanatos · on Oct 30, 2021

The status page acknowledges there is an incident, that it's pretty bad — everything is marked as down, and it's in red! What more could you want? (/s)

Compare to Azure: most incidents never see the status page. Hell, getting support to acknowledge that an incident exists even once your internal investigation has reached certainty on "oh, yeah, it was them" is hard. There was an AAD outage earlier this year (?; IDK — I've lost track of the passage of time in the pandemic…) and the status page was down, and even once you managed to get the status page (IIRC you could hit the IP if you knew it magically which the Twitterverse did) … most services were still green, even if completely offline as far as one could tell by issuing queries to the service…

And I'm comparing a kid's game with a "major" cloud PaaS…

I'm definitely suffering from Stockholm syndrome.

KronisLV · on Oct 30, 2021

> Compare to Azure: most incidents never see the status page.

That sounds like a conscious decision on their part. Everyone always talks about disclosure being the best policy, but at the same time there's plenty who believe that not informing anyone about an outage or even a breach is the correct thing to do, since then they'll probably get into less trouble themselves, or at least will create the illusion of not having as many outages as "those other guys".

After all, informing everyone about an outage that will noticeably affect only some probably has a larger impact on the company's reputation than having it be dragged through mud in smaller communities for its dishonesty. Then again, with many of the larger services, it's not like you have much of a choice of using or not using it - you just get a corporate policy passed down upon you and that's that.

Thus, sweeping problems under the rug and pretending that they don't exist is a dishonest, yet valid way of handling outages, breaches and so on. Personally, i'd avoid any service that does that, though it's not like that's being done just because of incompetence.

deathanatos · on Oct 30, 2021

> That sounds like a conscious decision on their part.

It is; I've been told by their support that they don't want to cause alarm.

I think it's a bad way to run a PaaS, though. If I'm looking at your status page, it is because I suspect an outage and am trying to confirm it. Very willing to give some leeway to fix problems (an SLA — and Azure could do better here too — exists to establish what that allowable leeway is) I just need to know "is it me, or not?" and it's nicer to just get the answer when it's not me. As it is, I have to jump through a support hoop to get at "I think you are having an outage" and even then, it's typically multiple cycles before support seems to query engineering (and — that's another problem: support doesn't just know that there is an outstanding issue…) and gets to the bottom of it.

It needs to be easy for a customer, experiencing an issue with a service, to drive resolution of that problem. I can forgive small service outages, but it's this lack of any ability to get resolution or closure or some "yeah, we had a failure, here's what we're doing to prevent it going forward" that is the real problem.

Sadly, there is only so much choice I have in the matter of which cloud provider we're using…

zinekeller · on Oct 30, 2021

Compare to Azure

I'll honestly change that to all major cloud providers - AWS has some small hiccups and then that were never recorded, and Google Workspace describes "a problem to a subset of users" when it clearly worldwide.

aliswe · on Oct 30, 2021

a kids game yes, but worth 50B

stingraycharles · on Oct 29, 2021

Seems to me that a routing / networking issue would have been resolved by now, and application bug would have been rolled back.

If I were to speculate, I would say that it must have something to do with databases / storage. Something must have gone wrong and broke some database, and it’s difficult to restore.

intunderflow · on Oct 29, 2021

They had an incident in the past where they returned nil for all get operations to their database, that caused some chaos at the time so it's not unprecedented: https://devforum.roblox.com/t/update-datastores-incident-dat...

polack · on Oct 29, 2021

I agree that DB is the prime suspect here. They said the problem was identified over 24h ago, cant imagine it being something other than a data issue that takes that long to resolve.