Uptime 15,364 days – The Computers of Voyager [video]

segfaultbuserr · on Oct 18, 2019

> Uptime 15,364 days

Is the uptime really technically true? Sure, the Voyager has been operating for 40+ years, but all embedded systems must have watchdog timers. And given how hostile the space environment is, I'll be surprised if the main system hasn't been reset by a watchdog timer for a couple of times to recover from fault conditions, thus the actual uptime must be much less than that.

Zenst · on Oct 18, 2019

https://www.wired.com/2013/09/vintage-voyager-probes/

"Practically all of Voyager's redundancy is gone now, either because something broke along the way or it was turned off to conserve power. Of the 11 original instruments on Voyager 1, only five remain"

Though I did wonder if they updated the firmware/code in all this time and als no definitive answer stood out: https://www.quora.com/Was-the-opportunity-to-update-the-Voya...

coldnose · on Oct 18, 2019

Voyager 2's computer was reset in 2010:

https://voyager.jpl.nasa.gov/news/details.php?article_id=16

It suddenly started returning corrupted frames. So engineers had it slowly transmit a core dump, and found a single flipped bit. To fix the problem, they "reset" the computer, which I assume means they rebooted it.

gus_massa · on Oct 18, 2019

From https://en.wikipedia.org/wiki/Command_Loss_Timer_Reset :

> Most spacecraft have more than one Command Loss Timer Reset for subsystem level safety reasons, with the Voyager craft using at least 7 of these timers.

methehack · on Oct 18, 2019

Isn't this splitting hairs? The real achievement it seems to me is continuous operation within spec. The watchdog timers (whatever they are) are part of the design that enabled this. I don't think it lessens the achievement at all.

segfaultbuserr · on Oct 18, 2019

> I don't think it lessens the achievement at all.

I was not implying that "the fact that the system has been rebooted lessens the achievement", I said none of it. I was just wondering about details of the system and technical accuracy of the statement, isn't it the point of posting on HN?

Running a probe for 15,364 days without even a single bit flip or poweroff would be an extraordinary miracle that exceeded all reasonable expectations, not simply the grestest accomplishment.

Please don't assume that every technical statements/questions imply undervaluation, criticism or attack, regardless of how common they are in tech.

Thanks.

okcando · on Oct 18, 2019

Sure, but it's not conventionally what is called "uptime", an uninterrupted period in a normal and responsive state. Voyager has had faults and been rebooted. In any other situation that I'm aware of, you'd have to say that uptime started over or that uptime is <100%.

joosters · on Oct 18, 2019

I wonder what the 'safest' uptime possible is today, for a computer connected to the internet? e.g. what's the oldest linux kernel that has no known remote attacks (not just remote exploits but DOS weaknesses too) ?

To make it more difficult, what would be the safest uptime for a box that allowed remote logins? SSH flaws don't count, since you can always upgrade that on the fly, but kernel-level privilege escalation weaknesses would count as critical.

necovek · on Oct 18, 2019

Why would you limit yourself to Linux: it has a very large "surface" area so new vulnerabilities are found all the time. It has the benefit of large number of eyes on it, so gets patched quickly (and can be live-patched for most part), but it's not a good candidate for super-long uptime.

I would imagine something small and stripped down serving a particular purpose would fare better. And OpenBSD prides itself in security for general purpose computing, but it still gets regular security fixes.

zaarn · on Oct 18, 2019

Don't look into BSD, you gotta look at mainframes. Some of the mainframes at banks have been running since they bought the very first one in the 90s or even 80s. Since stuff like VMS allowed to simply clustering, you could simply add modern machines, transfer over and shutdown the old hardware without having to shutdown the system itself. These are probably the only machines with a chance to reach 30+ years of uptime.

dogma1138 · on Oct 18, 2019

Many if not all of these mainframes have been fully upgraded or at least serviced due to hardware failure while still running.

I’m not entirely sure how you define uptime for these machines if none of the original parts are still there.

verisimilitude · on Oct 18, 2019

Philosophers ask a similar question to yours: https://en.wikipedia.org/wiki/Ship_of_Theseus

dogma1138 · on Oct 18, 2019

I was hinting at that, which is why we should define an uptime of a system rather than a machine because with distributed systems that uptime of the system isn't dependent on the uptime of a single "machine" and a mainframe is a distributed system even if it's in a single rack.

The question is then where you define the boundaries of a system and it's uptime. At least from my recollection for mainframes they defined uptime based on the execution of batch jobs and availability of services not the OS/Hardware which if it crashed often involved Big Blue coming to investigate WTF happened and how it happened since System Z machines are designed with so much redundancy that you can swam RAM modules without interrupting the workflow.

Today with RAIM (RAID for Memory) IBM System Z machines even support an entire memory channel dying without interruption.

pinkfoot · on Oct 18, 2019

Trivially: 100%

zamadatix · on Oct 18, 2019

But the topic is safe uptime for a computer connected to the internet. I have no doubt a mainframe has a chance to reach 30+ years of lifetime but I also have no doubt it'd also be done for if attacked.

giancarlostoro · on Oct 18, 2019

Dont forget some of those systems are highly firewalled to middle-men who then send back data to the web or other UIs.

joosters · on Oct 18, 2019

Linux was just an example, in fact I'd guess that there's probably a lot of BSD variants out there with ridiculous uptimes.

I think the 'allows user access' bit is the thing that would limit safe uptimes the most, since once you're logged in to a box, the kernel surface area for attacks is much larger.

ClumsyPilot · on Oct 18, 2019

It depends on your definition of a computer. A bare-metal embedded system could be fine without any software updates, ever. Some of them are in-place sensors for weather/temperature/whatever. Think of an Arduino with an Ethernet shield, that measures air temperature, encrypts with AES and sends it over plain TCP connection. There is no port is open for listening, and there is nothing to hack into. The only possible vulnerability is OTA update, but in some cases you might purposefully avoid that. For example in the rail industry, the only way to update their devices is to physically change the hardware, and that is done intentionally.

The hardware might be sleeping for 99% of it's service life, and is typically over-engineered. They could run into the next century if corrosion doesn't get them.

joosters · on Oct 18, 2019

If it's doing TCP then there's a whole network stack to attack. I bet there's a big range of embedded systems that can be crashed because they have a simplistic cut-down, non-hardened TCP/IP/Ethernet implementation that can be abused. You've got a chance of breaking it through sending malformed packets to cause a panic, or just exhausting its memory - which might cause it to reboot (maybe lots of fragmented packets?)

The lack of listening ports might shrink the attack surface, but a malicious endpoint that it connects to might be able to confuse it. (or perhaps some evil MITM attacker)

thisisnico · on Oct 18, 2019

I bet that these devices are local only. Working in IT the security measures for these devices are likely that they are sitting on their own isolated VLAN. Routing is likely controlled by layer 3 switches with dedicated ACLs to only allow certain systems to obtain access to that VLAN. If it's a rail company the network itself is likely protected by an Cisco ASA or ASA style dedicated network security appliance. Networks have been hardened pretty well. Most often the exploits that IT is worrying about is application-based and user-based (social engineering).

joosters · on Oct 18, 2019

It's probably the best way to secure them - but I think that defeats the spirit of the 'connected to the internet' part of this uptime security challenge :)

wbl · on Oct 18, 2019

Network hardening fundamentally has sone serious limitations because it doesnt expose the principals to the devices controlling access.

ClumsyPilot · on Oct 18, 2019

Malicious endpoint is certainly an interesting scenario, let me work through a few possibilities:

1. Wiznet produces chips where the entire network stack is implemented in hardware, there is no code. Malformed packets will never touch my code. http://shop.wiznet.eu/chips/w5500.html However, this is a minority of devices.

2. These devices typically have no dynamic memory allocation at all, they will work with a circular buffer of fixed size. Normally they can't run out of memory.

3. You might be able to clog up their CPU, however they typically use an RTOS, those operating systems will place hard limits on amount of CPU time different components of the system could take. At best, you will prevent the device from sending out anything over the network while you are attacking it.

4. Of-course, if there are mistakes in the TCP/IP stack, you might cause issues. Most of them are using lwIP stack, I am not qualified to comment on it's security. https://savannah.nongnu.org/projects/lwip/

GhettoMaestro · on Oct 18, 2019

Client side vulns do exist. Your model is not complete.

ClumsyPilot · on Oct 18, 2019

I am not claiming they don't, but often these devices only run thousand lined of code, not millions. The security challenge is much, much smaller than securing the Linux kernel.

cameronbrown · on Oct 18, 2019

With kernel upgrades in place is this an issue anymore?

zamadatix · on Oct 18, 2019

Depends how long you're willing to backport security fixes. Even Red Hat only goes ~15 years.

joosters · on Oct 18, 2019

In-place kernel upgrades are cheating wrt uptime :)

SmellyGeekBoy · on Oct 18, 2019

joosters · on Oct 18, 2019

No real reason, just that they potentially give you unlimited uptime. I know that's a good thing, just not for this particular silly challenge :)

Although, in-place upgrades haven't been around all that long in the grand scheme of things. Perhaps there's a box out there that predates in-place upgrades and is still running securely?

zaarn · on Oct 18, 2019

Ask your bank what the uptime on their mainframe is, with good chance, it'll be higher than anything else you've seen (unless power outage).

chief167 · on Oct 18, 2019

As someone who works in finance, I can tell you it's going to be less than a year. Although uncommon, those things sometimes reboot as part of maintenance. For example, we upgraded our COBOl compiler and all got an email that for like half an hour after midnight in a weekend the mainframe would be down for maintenance.

jankotek · on Oct 18, 2019

We rebooted system as part of training every two years, new staff needed to know how it is done :-)

swarnie_ · on Oct 18, 2019

I'm a Natwest/RBS customer. I'm going to guess their uptime is about three weeks.

ClumsyPilot · on Oct 18, 2019

My experience is literally that 1/3 of my logins result in "our system is down right now". Probably because I often want to check my account in the evening on a weekend.

mytailorisrich · on Oct 18, 2019

That says something about the hardware more than about the software, IMO.

In a smallish embedded system there are not too many (software) difficulties in keeping the software running virtually forever.

Getting the hardware to keep running without any fault for 42 years, on the other hand...

tomxor · on Oct 18, 2019

One of the main things I remember about these old CCSs is the low level hardware redundancy:

> The Viking CCS had two of everything: power supplies, processors, buffers, inputs, and outputs. Each element of the CCS was cross strapped which allowed for “single fault tolerance” redundancy so that if one part of one CCS failed, it could make use of the remaining operational one in the other. [1]

Modern systems like that of curiosity rover also use hardware redundancy, (triple redundancy even), but I believe this happens at a much higher level, i.e whole computer.

[1] https://www.allaboutcircuits.com/news/voyager-mission-annive...

atupis · on Oct 18, 2019

Is there somewhere leader board for uptime? I found subreddit on topic https://www.reddit.com/r/uptimeporn/

rfmoz · on Oct 18, 2019

In that subredit there are usually a lot of reports from network devices. Achieve a large uptime in a server in which the people do things all day, it's difficult.

zamadatix · on Oct 18, 2019

Being proud of not having patched a network device for 15 years is a great way to get in contact with security and HR at my company. I can't imagine thinking it's a great thing to go and brag about on the internet lol, different worlds.

ananonymoususer · on Oct 18, 2019

Great video! My only nitpick was when Aaron said the camera resolution (800x800) was 640 MEGAPIXELS. I can understand why he misspoke. Everybody today uses megapixels as the measure of pixel density, but back in the mid 1970's, digital cameras did not yet exist and the resolution of the onboard image orthicon tube was actually just 640 KILOPIXELS.

efiecho · on Oct 19, 2019

Does anyone have any good resources on how to design systems like these? I find the idea of computers that have to work for decades, be autonomous and self reparing really exciting.

paradoxos · on Oct 18, 2019

That's around 12 years more uptime than I have - impressive!

rodnim · on Oct 18, 2019

Don't you reboot every night? :)

LandR · on Oct 18, 2019

Pffft and lose all my tabs!

Seriously, at this point I have multiple browsers open, multiple tabs, multiple programs split over multiple desktops with my workflows.

But more seriously an encrypted drive that has a lot of data on it that I've forgotten the password for, and can't remember the system I used to encrypt it / set it up, and figuring out how to change it is always pushed to future LandRs problem. A restart and I'm screwed!

I'd basically just give up and go live in the woods.

joosters · on Oct 18, 2019

I know that feeling. I've got an old mac mini in a remote server room, with limited on-site access. It's been up for 2.5 years, going through various ubuntu releases and upgrades. I'm afraid to reboot it because it initially had a strange boot setup, and there's been enough changes now that I'm not sure it'll come back up. So I keep delaying the inevitable, and hope it'll last until I need to get a new machine :)

jacquesm · on Oct 18, 2019

There are companies in that situation too. They live in perpetual fear of power failure or hardware crashes. It's exactly the sort of thing we're on the lookout for during technical due diligence. Anything that you are afraid of rebooting is a risk that needs mitigation while the system is still up and running.

shultays · on Oct 18, 2019

Reminds me of this http://thedailywtf.com/articles/Trauma-Center

milankragujevic · on Oct 18, 2019

Sleep mode only, a reboot wipes all volatile memory :P

TenJack · on Oct 18, 2019

"Premature optimization is the root of all evil." -Donald Knuth

dsirola · on Oct 18, 2019

I strongly dislike that his words are always distorted by taking out of context this small segment of what he said.

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified. It is often a mistake to make a priori judgments about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail." - Donald Knuth

avian · on Oct 18, 2019

To be honest, I don't think taking the sentence out of context distorts much. This quote (which I see in full length for the first time) pretty much says how I always understood the shorter version.

It's not "optimization is root of all evil". The key is "premature optimization". Maybe people gloss over that part, but it is right there.

Yes, Knuth goes into more detail on what he considers premature optimization in the context of programming computers. However the short sentence applies much more broadly in my experience.

For example, "premature optimization" of BOM costs in a hardware project can cost you dearly down the road when it turns out that leaving in some extra flexibility in the design would be mighty useful.

hombre_fatal · on Oct 18, 2019

Also, of course there are always exceptions to a platitude. I don't think we need to couch every single statement we ever make with "...but there are exceptions, of course!" which is basically what Knuth goes on to belabor.

aidenn0 · on Oct 18, 2019

More like such platitudes are nearly devoid of meaning:

Premature X is bad.

Overusing X is bad.

These are true for most X. If it's not bad, then you didn't do it prematurely or overuse it!

aidenn0 · on Oct 18, 2019

For further context it was justifying using a goto statement to shave 12% of the execution time off of a function. Knuth bringing it up was specifically to acknowledge that he is aware of the principle to stave off arguments. I more often see it used to push back against any changes for speed.

mikestew · on Oct 18, 2019

It brings to mind a quote from Ralph Waldo Emerson that is often abused in a similar manner: "A foolish consistency is the hobgoblin of little minds." It is interesting to observe the results of someone treating "foolish" as a filler word that can be glossed over. For our purposes here, substitute "foolish" with "premature". Leave that word out, and more context is needed. But that's why Knuth didn't leave that word out. With one simple adjective, the statement stands as is.