Quite a bit of our infrastructure runs for very long periods (long in IT terms) ...

tok1 · on May 25, 2022

No disagreement in general, but to your point of infrastructure routers (assuming you refer to ISP and Internet backbone infrastructure):

Having worked in ISP security, IMHO a years-long uptime of such critical components is nothing to be proud of (anymore). Quite the contrary, those are complex components, so if you care about security you have to regularly patch them, including occasionally required reboots. Just look at the list of security advisories of relevant vendors (Cisco, Juniper, Nokia/Alcatel-Lucent, etc), you can find scary vulnerabilities! Granted, "rebooting" a core router is more nuanced than a regular PC (you can e.g. reboot one management engine of a pair; or just a line card; etc), so it does not always mean that the entire traffic stops because of it.

Oh and btw. your network design should be able to cope with such necessary reboots, otherwise you have a single point of failure.

Regards

TomSwirly · on May 25, 2022

> And remember Voyager, been ticking along since 1977.

This is running a single unthreaded process.

My Mac, which isn't doing anything fancy, has over 500 processes running on it. In fact, I just checked to see if anything bad was going on, and I recognize everything I look at - almost 100 processes from Chrome alone, for example.

How sure am I that all of these processes are running correct code? Chrome is running "101.0.4951.54 (Official Build) (x86_64)", which gives a hint of the disposable nature of that binary.

zeusk · on May 25, 2022

how much of that hardware can run a random executable from the web? and does so pretty often?

hinkley · on May 25, 2022

This is what I meant by that parenthetical about not being general purpose.

A computer used for a single task is a bit like a 4WD truck that only stays inside the city limits. It could do those things, sure, but it never does, so it hasn't really proven anything.

zeusk · on May 25, 2022

> A computer used for a single task is a bit like a 4WD truck that only stays inside the city limits. It could do those things, sure, but it never does, so it hasn't really proven anything.

Not really. That's very different from a very application specific hardware+software being used for the exact purpose vs a very general purpose hardware+software being used for all sorts of things.

It's more like using a F1 car to race vs taking your average sporty car to a race track.

Sure that F1 car will race better, but at the end of the day you can't drive home in it or move kids/groceries around in it.

dieselgate · on May 25, 2022

That’s a fair point but kind of seems like a straw man compared to the Voyager example

zeusk · on May 25, 2022

My argument was towards the network/IT equipment, but Voyager is even more special since it's a very application specific system.

Generalization itself is hard, it gets MUCH harder when you have to care about back compat and random executables that can alter system state because previous versions allowed that behavior and it needs to be supported for the common cases moving forwards.

55873445216111 · on May 25, 2022

All DDR5 DRAM ICs have on-die ECC. This is new for DDR5.

willis936 · on May 25, 2022

And the highest data rates to date with the signal integrity requirements that accompany it. Got a piece of dust pushed down by your CPU cooler saddling two DIMM pins? Get ready for your machine to shred your data. And that's just a common simple scenario. I's be surprised if real world error rates in nominal scenarios won't be higher than with DDR4.

b112 · on May 25, 2022

You may be right, however, you could have said the same thing for ddr4 vs 3, with its crazy new high data rates!

willis936 · on May 25, 2022

And again still likely be right. 10 years ago consumer electronics marketing never included signal integrity stuff like eye diagrams but now pretty much every nvidia announcement with a new memory standard does. We're really pushing ever closer to channel bandwidth limits and corners that could be cut in the past can no longer be cut. ECC is more important than ever.

MichaelZuo · on May 25, 2022

That's not the type of ECC the parent was talking about. That's because the densities and clock rates are so high for DDR5 that it needs ECC to function properly, but like most standards the minimal implementation is really quite watered down. It doesn't correct the entire range of bitflips that a server with ECC RAM does.

55873445216111 · on May 26, 2022

Disagree. Parent was discussing the need to reboot after a system has been on for a long number of hours. The failure mode, assuming it's related to the DRAM, would be an accumulation of bit-flips in the DRAM. Every memory has some FIT/Megabit rate. The on-die ECC added in DDR5 spec will be highly effective in addressing this failure mode.

Channel ECC is the ECC type most directly relevant for high clock rates and signal integrity aspects. I agree with you that Channel ECC becomes a practical requirement to meet the interface transaction rates of DDR5. It is also true that channel ECC is not mandatory in DDR5 and is not implemented by mainstream CPU platforms (like previous DDR generations).

MichaelZuo · on May 26, 2022

If the on-die ECC reduces the error rate but the lack of standard channel ECC increases the error rate, because of the much more demanding signals, then it's not clear at all that the overall rate of error will lower.

In fact it could very well be higher depending on how the physical module is designed.

I imagine some portion of bit-flip induced reboots are due to the actual DRAM chips, but also some portion will be due to everything else that can bit flip both on the memory module itself and in the interconnect.

I haven't seen anything yet to say that DRAM chip bit flips will be in the majority.

hinkley · on May 25, 2022

It's never been clear to me if the ECC is necessary for DDR5 to operate, or just a nice feature. Do you or any other readers happen to know the answer?

lproven · on May 26, 2022

> And if a substantial fraction of users wanted it, things like ECC RAM etc would be only fractionally more expensive than our error prone alternatives.

I started in the PC industry in 1988.

Then, they did. All IBM PC kit used 9-bit RAM, with a parity bit.

It was discarded during 1990s cost-cutting.