Is the uptime really technically true? Sure, the Voyager has been operating for 40+ years, but all embedded systems must have watchdog timers. And given how hostile the space environment is, I'll be surprised if the main system hasn't been reset by a watchdog timer for a couple of times to recover from fault conditions, thus the actual uptime must be much less than that.
"Practically all of Voyager's redundancy is gone now, either because something broke along the way or it was turned off to conserve power. Of the 11 original instruments on Voyager 1, only five remain"
It suddenly started returning corrupted frames. So engineers had it slowly transmit a core dump, and found a single flipped bit. To fix the problem, they "reset" the computer, which I assume means they rebooted it.
> Most spacecraft have more than one Command Loss Timer Reset for subsystem level safety reasons, with the Voyager craft using at least 7 of these timers.
Isn't this splitting hairs? The real achievement it seems to me is continuous operation within spec. The watchdog timers (whatever they are) are part of the design that enabled this. I don't think it lessens the achievement at all.
> I don't think it lessens the achievement at all.
I was not implying that "the fact that the system has been rebooted lessens the achievement", I said none of it. I was just wondering about details of the system and technical accuracy of the statement, isn't it the point of posting on HN?
Running a probe for 15,364 days without even a single bit flip or poweroff would be an extraordinary miracle that exceeded all reasonable expectations, not simply the grestest accomplishment.
Please don't assume that every technical statements/questions imply undervaluation, criticism or attack, regardless of how common they are in tech.
Sure, but it's not conventionally what is called "uptime", an uninterrupted period in a normal and responsive state. Voyager has had faults and been rebooted. In any other situation that I'm aware of, you'd have to say that uptime started over or that uptime is <100%.
I wonder what the 'safest' uptime possible is today, for a computer connected to the internet? e.g. what's the oldest linux kernel that has no known remote attacks (not just remote exploits but DOS weaknesses too) ?
To make it more difficult, what would be the safest uptime for a box that allowed remote logins? SSH flaws don't count, since you can always upgrade that on the fly, but kernel-level privilege escalation weaknesses would count as critical.
Why would you limit yourself to Linux: it has a very large "surface" area so new vulnerabilities are found all the time. It has the benefit of large number of eyes on it, so gets patched quickly (and can be live-patched for most part), but it's not a good candidate for super-long uptime.
I would imagine something small and stripped down serving a particular purpose would fare better. And OpenBSD prides itself in security for general purpose computing, but it still gets regular security fixes.
Don't look into BSD, you gotta look at mainframes. Some of the mainframes at banks have been running since they bought the very first one in the 90s or even 80s. Since stuff like VMS allowed to simply clustering, you could simply add modern machines, transfer over and shutdown the old hardware without having to shutdown the system itself. These are probably the only machines with a chance to reach 30+ years of uptime.
I was hinting at that, which is why we should define an uptime of a system rather than a machine because with distributed systems that uptime of the system isn't dependent on the uptime of a single "machine" and a mainframe is a distributed system even if it's in a single rack.
The question is then where you define the boundaries of a system and it's uptime. At least from my recollection for mainframes they defined uptime based on the execution of batch jobs and availability of services not the OS/Hardware which if it crashed often involved Big Blue coming to investigate WTF happened and how it happened since System Z machines are designed with so much redundancy that you can swam RAM modules without interrupting the workflow.
Today with RAIM (RAID for Memory) IBM System Z machines even support an entire memory channel dying without interruption.
But the topic is safe uptime for a computer connected to the internet. I have no doubt a mainframe has a chance to reach 30+ years of lifetime but I also have no doubt it'd also be done for if attacked.
Linux was just an example, in fact I'd guess that there's probably a lot of BSD variants out there with ridiculous uptimes.
I think the 'allows user access' bit is the thing that would limit safe uptimes the most, since once you're logged in to a box, the kernel surface area for attacks is much larger.
It depends on your definition of a computer.
A bare-metal embedded system could be fine without any software updates, ever. Some of them are in-place sensors for weather/temperature/whatever.
Think of an Arduino with an Ethernet shield, that measures air temperature, encrypts with AES and sends it over plain TCP connection.
There is no port is open for listening, and there is nothing to hack into.
The only possible vulnerability is OTA update, but in some cases you might purposefully avoid that. For example in the rail industry, the only way to update their devices is to physically change the hardware, and that is done intentionally.
The hardware might be sleeping for 99% of it's service life, and is typically over-engineered. They could run into the next century if corrosion doesn't get them.
If it's doing TCP then there's a whole network stack to attack. I bet there's a big range of embedded systems that can be crashed because they have a simplistic cut-down, non-hardened TCP/IP/Ethernet implementation that can be abused. You've got a chance of breaking it through sending malformed packets to cause a panic, or just exhausting its memory - which might cause it to reboot (maybe lots of fragmented packets?)
The lack of listening ports might shrink the attack surface, but a malicious endpoint that it connects to might be able to confuse it. (or perhaps some evil MITM attacker)
I bet that these devices are local only. Working in IT the security measures for these devices are likely that they are sitting on their own isolated VLAN. Routing is likely controlled by layer 3 switches with dedicated ACLs to only allow certain systems to obtain access to that VLAN. If it's a rail company the network itself is likely protected by an Cisco ASA or ASA style dedicated network security appliance. Networks have been hardened pretty well. Most often the exploits that IT is worrying about is application-based and user-based (social engineering).
It's probably the best way to secure them - but I think that defeats the spirit of the 'connected to the internet' part of this uptime security challenge :)
Malicious endpoint is certainly an interesting scenario, let me work through a few possibilities:
1. Wiznet produces chips where the entire network stack is implemented in hardware, there is no code. Malformed packets will never touch my code.
http://shop.wiznet.eu/chips/w5500.html
However, this is a minority of devices.
2. These devices typically have no dynamic memory allocation at all, they will work with a circular buffer of fixed size. Normally they can't run out of memory.
3. You might be able to clog up their CPU, however they typically use an RTOS, those operating systems will place hard limits on amount of CPU time different components of the system could take. At best, you will prevent the device from sending out anything over the network while you are attacking it.
4. Of-course, if there are mistakes in the TCP/IP stack, you might cause issues.
Most of them are using lwIP stack, I am not qualified to comment on it's security. https://savannah.nongnu.org/projects/lwip/
I am not claiming they don't, but often these devices only run thousand lined of code, not millions. The security challenge is much, much smaller than securing the Linux kernel.
No real reason, just that they potentially give you unlimited uptime. I know that's a good thing, just not for this particular silly challenge :)
Although, in-place upgrades haven't been around all that long in the grand scheme of things. Perhaps there's a box out there that predates in-place upgrades and is still running securely?
As someone who works in finance, I can tell you it's going to be less than a year. Although uncommon, those things sometimes reboot as part of maintenance. For example, we upgraded our COBOl compiler and all got an email that for like half an hour after midnight in a weekend the mainframe would be down for maintenance.
My experience is literally that 1/3 of my logins result in "our system is down right now". Probably because I often want to check my account in the evening on a weekend.
One of the main things I remember about these old CCSs is the low level hardware redundancy:
> The Viking CCS had two of everything: power supplies, processors, buffers, inputs, and outputs. Each element of the CCS was cross strapped which allowed for “single fault tolerance” redundancy so that if one part of one CCS failed, it could make use of the remaining operational one in the other. [1]
Modern systems like that of curiosity rover also use hardware redundancy, (triple redundancy even), but I believe this happens at a much higher level, i.e whole computer.
In that subredit there are usually a lot of reports from network devices. Achieve a large uptime in a server in which the people do things all day, it's difficult.
Being proud of not having patched a network device for 15 years is a great way to get in contact with security and HR at my company. I can't imagine thinking it's a great thing to go and brag about on the internet lol, different worlds.
Great video!
My only nitpick was when Aaron said the camera resolution (800x800) was 640 MEGAPIXELS. I can understand why he misspoke. Everybody today uses megapixels as the measure of pixel density, but back in the mid 1970's, digital cameras did not yet exist and the resolution of the onboard image orthicon tube was actually just 640 KILOPIXELS.
Does anyone have any good resources on how to design systems like these? I find the idea of computers that have to work for decades, be autonomous and self reparing really exciting.
Seriously, at this point I have multiple browsers open, multiple tabs, multiple programs split over multiple desktops with my workflows.
But more seriously an encrypted drive that has a lot of data on it that I've forgotten the password for, and can't remember the system I used to encrypt it / set it up, and figuring out how to change it is always pushed to future LandRs problem. A restart and I'm screwed!
I'd basically just give up and go live in the woods.
I know that feeling. I've got an old mac mini in a remote server room, with limited on-site access. It's been up for 2.5 years, going through various ubuntu releases and upgrades. I'm afraid to reboot it because it initially had a strange boot setup, and there's been enough changes now that I'm not sure it'll come back up. So I keep delaying the inevitable, and hope it'll last until I need to get a new machine :)
There are companies in that situation too. They live in perpetual fear of power failure or hardware crashes. It's exactly the sort of thing we're on the lookout for during technical due diligence. Anything that you are afraid of rebooting is a risk that needs mitigation while the system is still up and running.
I strongly dislike that his words are always distorted by taking out of context this small segment of what he said.
"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified. It is often a mistake to make a priori judgments about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail." - Donald Knuth
To be honest, I don't think taking the sentence out of context distorts much. This quote (which I see in full length for the first time) pretty much says how I always understood the shorter version.
It's not "optimization is root of all evil". The key is "premature optimization". Maybe people gloss over that part, but it is right there.
Yes, Knuth goes into more detail on what he considers premature optimization in the context of programming computers. However the short sentence applies much more broadly in my experience.
For example, "premature optimization" of BOM costs in a hardware project can cost you dearly down the road when it turns out that leaving in some extra flexibility in the design would be mighty useful.
Also, of course there are always exceptions to a platitude. I don't think we need to couch every single statement we ever make with "...but there are exceptions, of course!" which is basically what Knuth goes on to belabor.
For further context it was justifying using a goto statement to shave 12% of the execution time off of a function. Knuth bringing it up was specifically to acknowledge that he is aware of the principle to stave off arguments. I more often see it used to push back against any changes for speed.
It brings to mind a quote from Ralph Waldo Emerson that is often abused in a similar manner: "A foolish consistency is the hobgoblin of little minds." It is interesting to observe the results of someone treating "foolish" as a filler word that can be glossed over. For our purposes here, substitute "foolish" with "premature". Leave that word out, and more context is needed. But that's why Knuth didn't leave that word out. With one simple adjective, the statement stands as is.
Is the uptime really technically true? Sure, the Voyager has been operating for 40+ years, but all embedded systems must have watchdog timers. And given how hostile the space environment is, I'll be surprised if the main system hasn't been reset by a watchdog timer for a couple of times to recover from fault conditions, thus the actual uptime must be much less than that.