Redundant Against What?

tkinom · on April 15, 2021

I have designed/coded two redundant systems before.

First one is super complex redundant router done by 100+ engineers. All TCP states, BGP states, complete configurations were replicated to backup CPU. If the primary failed, power off, etc, the standby will take over, etc. When the failed unit comes backup, it become the new stanby. Demo works fine, but the company sold just 1 system for $200K. The company did sold for four hundred millions. It paid off my mortgage.

2nd is relatively simple done by me in 3 weeks including testing, etc. It is a system convert 48 channels digital MPEG streams to analog NTSC signals for analog cable plant such as Comcast. Xilinx power PC running linux. All configurations were sync. Auto switch over on power off, network disconnect, etc. The system can detect all failure conditions in 50 milliseconds. Very cool to demo - if you blink your eye and you will miss the system failure over event. Support in service SW upgrade - update backup to new SW, sync config over, force switch the primary and update the new backup to new SW with ZERO downtime.

The product was very successful - only one firmware bug found after two years in the field. Sold $100 millions+ to various cable companies. But the VC air drop a "professional CEO" and "scale" the company from 16 people to 300+ people. Did two more rounds and manage to burn the company to the ground and sold it for pennies.

Redundant system design are not that hard and can be simple. It is best to remember KISS - "Keep It Simple Stupid".

smiley1437 · on April 15, 2021

I was called in to troubleshoot system once, a small 2-node cluster with a SAN with no single point of failure in the fabric.

They had both their 1500va UPSs plugged into a single 15amp circuit.

zabzonk · on April 15, 2021

I once found an electric kettle plugged into a UPS.

nexuist · on April 15, 2021

That's a redundancy against thirsty developers.

two_handfuls · on April 16, 2021

Yes, it is indeed a problem when all replicas run the same software.

One approach that's been suggested is to use multiple independent implementations of the program. Or, if that's too onerous, N-Version Programming (https://en.wikipedia.org/wiki/N-version_programming).

cyberlab · on April 15, 2021

I like the word antifragile instead of redundancy. I don't operate a datacenter, but in my own personal computing space, I try to build antifragile systems that can bounce back from failure rather rapidly. It is astonishing for example, just how many times I have to roll back a Windows 10 install because it likes to randomly break for whatever reason. It could be anything: a faulty update that corrupts the whole OS. The hard-drive gets filled up really quickly when doing gaming video captures, or games that do really heavy writes to the SSD and wearing it out within a year of use.

Currently I use virtual machines to mitigate this, and you need a beefy setup to do this. If a virtual machines fails, I have a 'template' virtual disk image where I can start afresh. One caveat to VMs is you can't do gaming since you're emulating an OS, so when I can I use a bare metal solution for gaming. You can do basic gaming for games that use low resources (like Minecraft). But forget about playing Crysis or other monsters in a VM!

BiteCode_dev · on April 15, 2021

Antifragile, at least by the definition of Taleb that I believed coined the term, is not "bouncing back quickly", but actually improving after the event.

I don't see how a computing system could be anti-fragile. A team building one, certainly, learning from each mistake and improving the system.

But a system by itself, baring some miraculous learning AI, I don't see how.

heavenlyblue · on April 15, 2021

An anti fragile computing system is a deep learning algorithm because it gets better from fluctuations in data as it forced it to generalise than end up in local minima

blacktriangle · on April 15, 2021

I realize I'm nitpicking, but that's not what antifragile means at all. I think the word for what you are describing is resilient.

Antifragile systems are those which don't just continue to function in the face of unknown inputs, but in fact perform better in the face of shocks to the system. A classic example of this is the California redwood forests that thrive in the face of regular forest fires.

scubbo · on April 16, 2021

(not the person you're replying to) I appreciate your nitpicking, because I was unaware of that distinction - so, today I learned something!

yjftsjthsd-h · on April 15, 2021

I'm actually surprised by this; I was under the impression that CPU only took a ~5-10% hit and you can pass the GPU directly to a VM. So I've never actually done it, but I expected VMs to be fine for gaming these days. Where does it fall apart?

cyberlab · on April 15, 2021

VMs are fine for basic games like Minecraft (that isn't resource intensive). It falls apart when you play anything that is resource intensive. Even if you have, say, 32GB of RAM and a good CPU+Graphics, the fact you have to emulate anything at all means it gets noticeably laggier. You can laud the fact that the GPU gets passed to the VM, but since we are emulating, you will notice that.

throwaway3699 · on April 15, 2021

VMs aren't emulating anything. Services like Stadia and GeForce Now explicitly rely on virtualizing gaming machines to be able to scale them. Nvidia even have a technology for slicing and dicing up large graphic cards to multiple client VMs.

cwyers · on April 15, 2021

Xbox One/Series X/S actually runs games within a VM as well:

https://wccftech.com/xbox-one-architecture-explained-runs-wi...

nitrogen · on April 15, 2021

A service that is basically a one way video conference call with a remote system isn't exactly a benchmark for high performance/low latency gaming.

smnrchrds · on April 15, 2021

Why not robust?

cyberlab · on April 15, 2021

'Robust' assumes the system doesn't need to bounce back from failure, since it's strong enough. But most systems these days are not so strong that they can deal with anything. We have to resort to hacks and mitigation strategies like snapshotting (in the case of VMs) or using fault tolerant filesystems like ZFS or RAID if you operate a datacenter.