Was looking through semiengineering for some more sources and some of them address it. Aging (I should probably say "aging" instead of electromigration, I'm not referring to just one effect here) is such a problem below 10nm that literally even just idling the chip wears it noticeably... and of course that leads to uneven wear on the cores too, etc. It's not just electromigration etc anymore.
The reason you don't notice this is that the chip is engineered so you don't notice it. The boost clocks will slow down over time, the voltage applied will increase over time (dynamically controlled by measuring the degradation of the canary cells). Unless the chip catastrophically fails, you probably won't notice the slowdown etc, in scenarios that would have resulted in chip failure 20 years ago. The chips are simply designed to tolerate that - because they have to be, even during normal operation (!).
The lifespan of a 5nm chip is not "infinite if treated properly" anymore. It is actually finite in terms of even idle power-on hours etc let alone load hours. A large number of power-on hours, and deliberately engineered to be large and to tolerate the damage gracefully, but people's mental models of "power-on hours doesn't hurt the chip" is fundamentally not correct anymore. Miners running lots of hours on that 7nm GPU etc is not "just fine" etc.
Also, once you get it beyond the "damage point", especially in analog stuff you have simply changed the characteristics of the circuit. If the amplifier's bias circuit leads some other part of the circuit to be hit with a higher gain, that can continue damaging it even if you stop further damaging the bias circuit etc. Memory and PCIe are analog circuits here.
> Digital and analog will be affected differently, as will devices subject to frequent change — and in some cases, infrequent change. “Any place where there’s a lot of activity will be more sensitive to device aging,” says Art Schaldenbrand, senior product manager at Cadence. “For devices, you can look at the clock tree and look at what is happening. Digital designs are sensitive to delay changes. The other place where this becomes a challenge is within analog designs. An example would be in a bias tree. With the bias transistors moving and aging, it can potentially accelerate the aging of other devices in the bias network. There’s always going to be some different elements in the design, and you have to look at them a little bit differently to be able to analyze the reliability.”
[ ...]
> But you have to be careful to consider all of the important areas. “There is a phenomenon called non-conductive stress,” says Cadence’s Schaldenbrand. “Consider a device such as a watch dog or monitor. It will be sitting idle, potentially for years, and you want it to spring into action if there’s some sort of condition that occurs. Even those circuits, that you think are you’re just sitting there doing nothing, are being stressed. They can age and potentially fail due to the aging that occurs while they’re sitting idle.”
> This impacts the gate because of the natural behavior of the transistors, Elhak explained. “In the transistor you have a gate, which has an electric field that is supposed to control the current that is flowing between the drain and the source but there are random events. This electric field causes some of those carriers, instead of flowing between the gate and the source, to go and get injected into the gate. As more carriers get injected over time, the electrical properties of the gate start to differ because it’s not supposed to have those carriers in it. That changes the properties of the whole device because now the gate is supposed to control that electric field it is now made of a different material.”
> The second mechanism that causes aging is called the bias temperature instability (BTI), which happens when there is a constant bias on the device meaning there is current flowing. Here, instead of being driven by electric field, it is driven here by bias and temperature. Also, charges start to get trapped into the gate and as this happens, the properties of the gate change and again it impacts the threshold voltage and the carrier mobility in that channel. “If you change the threshold voltage and if you change the mobility, then you have a different transistor,” he asserted.
The reason you don't notice this is that the chip is engineered so you don't notice it. The boost clocks will slow down over time, the voltage applied will increase over time (dynamically controlled by measuring the degradation of the canary cells). Unless the chip catastrophically fails, you probably won't notice the slowdown etc, in scenarios that would have resulted in chip failure 20 years ago. The chips are simply designed to tolerate that - because they have to be, even during normal operation (!).
The lifespan of a 5nm chip is not "infinite if treated properly" anymore. It is actually finite in terms of even idle power-on hours etc let alone load hours. A large number of power-on hours, and deliberately engineered to be large and to tolerate the damage gracefully, but people's mental models of "power-on hours doesn't hurt the chip" is fundamentally not correct anymore. Miners running lots of hours on that 7nm GPU etc is not "just fine" etc.
Also, once you get it beyond the "damage point", especially in analog stuff you have simply changed the characteristics of the circuit. If the amplifier's bias circuit leads some other part of the circuit to be hit with a higher gain, that can continue damaging it even if you stop further damaging the bias circuit etc. Memory and PCIe are analog circuits here.
https://semiengineering.com/design-for-reliability-2/
> Digital and analog will be affected differently, as will devices subject to frequent change — and in some cases, infrequent change. “Any place where there’s a lot of activity will be more sensitive to device aging,” says Art Schaldenbrand, senior product manager at Cadence. “For devices, you can look at the clock tree and look at what is happening. Digital designs are sensitive to delay changes. The other place where this becomes a challenge is within analog designs. An example would be in a bias tree. With the bias transistors moving and aging, it can potentially accelerate the aging of other devices in the bias network. There’s always going to be some different elements in the design, and you have to look at them a little bit differently to be able to analyze the reliability.”
[ ...]
> But you have to be careful to consider all of the important areas. “There is a phenomenon called non-conductive stress,” says Cadence’s Schaldenbrand. “Consider a device such as a watch dog or monitor. It will be sitting idle, potentially for years, and you want it to spring into action if there’s some sort of condition that occurs. Even those circuits, that you think are you’re just sitting there doing nothing, are being stressed. They can age and potentially fail due to the aging that occurs while they’re sitting idle.”
https://semiengineering.com/aging-not-always-a-bad-thing/
> This impacts the gate because of the natural behavior of the transistors, Elhak explained. “In the transistor you have a gate, which has an electric field that is supposed to control the current that is flowing between the drain and the source but there are random events. This electric field causes some of those carriers, instead of flowing between the gate and the source, to go and get injected into the gate. As more carriers get injected over time, the electrical properties of the gate start to differ because it’s not supposed to have those carriers in it. That changes the properties of the whole device because now the gate is supposed to control that electric field it is now made of a different material.”
> The second mechanism that causes aging is called the bias temperature instability (BTI), which happens when there is a constant bias on the device meaning there is current flowing. Here, instead of being driven by electric field, it is driven here by bias and temperature. Also, charges start to get trapped into the gate and as this happens, the properties of the gate change and again it impacts the threshold voltage and the carrier mobility in that channel. “If you change the threshold voltage and if you change the mobility, then you have a different transistor,” he asserted.
https://semiengineering.com/adding-aging-to-variability/
https://semiengineering.com/uneven-circuit-aging-becoming-a-...
https://semiengineering.com/transistor-aging-intensifies-10n...
https://semiengineering.com/chip-aging-becomes-design-proble...
https://semiengineering.com/minimizing-chip-aging-effects/
https://semiengineering.com/dealing-with-device-aging-at-adv...
https://semiengineering.com/24142954-2/
https://semiengineering.com/chip-aging-accelerates/