Could you dumb it down? Are you saying that because all the SSDs were from the s...

da_chicken · on Aug 27, 2018

That's true for any type of disks. If you install a disk array using disks from the same lot from the same manufacturer, it's extremely likely that you'll get disk failure at more or less the exact same time. You always want to use mixed lot numbers or disks with significantly different manufacturing dates in your arrays. Some people buy spares from multiple manufacturers, others buy from multiple vendors since it's unlikely to get the same lot that way.

RAID 1 naturally spreads writes across all disks. RAID 5 is designed to spread writes evenly across all disks. Sure, certain applications will do uneven writing, but in general it will be fairly even and this is by design. RAID 3 and RAID 4 were abandoned partially because they used dedicated parity drives, and the result was that the parity drives had much higher write loads and so they failed all the time. This meant that the arrays were more often rebuilding or running with degraded protection.

mmt · on Aug 27, 2018

> You always want to use mixed lot numbers or disks with significantly different manufacturing dates in your arrays.

For the few of us left who know how to do this (or even that it's beneficial to do in the first place), it's becoming less practical to do at scale.

It's hard enough finding someone who will hire me and use any of These Thngs I Know about operating hardware, isntead of just my "automation" skills, as if all problems can be solved in software.

viraptor · on Aug 27, 2018

> as if all problems can be solved in software.

Isn't this one true to some extent though? We already have scrubbing to detect/fix bitrot. Why not make it (slightly) purposefully unbalanced, so you can avoid the simultaneous failure. I expect your time is worth more money than making one drive fail a few weeks early.

mmt · on Aug 28, 2018

> Isn't this one true to some extent though?

People who write software for a living tend to think so, but the "some extent" is the real issue. For many classes of problems, merely spending more money on hardware (or procedures/process) is objectively better, but one has to know the alternative exists and be willing to make the comparison. Some problems just have physical limitations.

> Why not make it (slightly) purposefully unbalanced, so you can avoid the simultaneous failure.

I think the short answer is because drive failure is non-deterministic and relatively unlikely in the general case.

> I expect your time is worth more money than making one drive fail a few weeks early.

My (ops) time may or may not be worth more than the software engineering time required to make that drive fail early. (NB that the main cost of the hardware solution isn't necessarily time but is often reduced availability and therefore higher cost of suitable parts).

Your proposal also still ignores the fact that some of these drive failures may simply not be manipulable by software. For example, any failure that's correlated with power-on (or spun-up) time, rather than usage, such as bearing failures, could still happen simultaneously (and affect hot spares, a nightmarish situation).

The tried-and-true engineering solution, which happens to be hardware/process based, actually works, and can be shown to address nearly all known drive failures [1], and has a measurable cost. The same can't be said for a software-only attempt to replace it.

[1] firmware bugs, that do things like returning reads of all-zeros on just-written blocks, being a notable exception.

viraptor · on Aug 28, 2018

> I think the short answer is because drive failure is non-deterministic and relatively unlikely in the general case.

If it's deterministic enough to be worth sourcing drives from different batches, why wouldn't it be enough to add small amount of writes on purpose?

> Your proposal also still ignores the fact that some of these drive failures may simply not be manipulable by software. For example, any failure that's correlated with power-on (or spun-up) time, rather than usage, such as bearing failures, could still happen simultaneously (and affect hot spares, a nightmarish situation).

Power cut / spinup / other conditions can be replicated from the OS level as well. I didn't list rather than ignored them. It does sound like a good idea to do those as well, considering it could save you from losing all the drives after a power loss / system crash.

mmt · on Aug 28, 2018

> If it's deterministic enough to be worth sourcing drives from different batches,

I suspect you're using a mistaken premise.

It's worth sourcing from different batches because failures are not deterministic. Instead, we merely have probabilities based on past experiences (usually from vast data generously provided by operators of spindles at huge scale).

> why wouldn't it be enough to add small amount of writes on purpose

Well, it's not enough, because it might only protect against simultaneity of certain failures. It also doesn't actually reduce the potential impact of the failures, merely buying more reaction time. By distributing a single batch of drives across many arrays, even a simultaneous failure is just increased replacement maintenance cost (if that's even the strategy, rather than enough hot spares and abandon-in-place), without the looming data loss. With the software staggering of write amplification, each failure could be the start of a cascade, in which case replacement takes on a time-critical aspect. This replacement emergency ends up being an operational (not software) solution, as well.

My worry would be that the software scheme provides a false sense of security.

Additionally, you may want to quantify what "small amount" is, considering you're suggesting such an algorithm would allow for failure multiple weeks apart. 3 weeks is 2% of 3 years. For an array of 12 drives, does that mean that the 12th drive would need 22% the writes of the 1st drive?

Of course, beyond any performance hit, write amplification for SSDs has other deleterious effects (as per the article). A software solution would have to account for yet another corner case.. or just stop trying to re-invent in software what already has a pretty comprehensive solution in operations.

> Power cut / spinup / other conditions can be replicated from the OS level as well.

Not necessarily, although I suspect that true nearly always on modern equipment. However, that's not what I meant. What I meant was failures that occur more frequently merely with the time the drive has spent powered on (or powered on and spinning). Even if that could be simulated relativistically somehow, that wouldn't be a software solution, either.

Also, adding a "chaos monkey" of the kind that powers down a drive in a running array would both introduce a performance hit that I expect a majority of environments would find unacceptable (more than would find write amplification acceptable) and would introduce additional wear and tear on mechanical drives. The latter may be worth it, but I'd be hard pressed to quantify it. It would be different if limited to hot spares, but that's also of limited utility.

You'd also have to be extremely careful in implementation, as a bug here could make a previously viable array into a data-lost array. If such a technique reveals a drive failure, I'd want it to stop immediately so as to be able to replace it with a different one from a different batch and have enough replacements on hand, in case all the rest suffer the same fate.

> I didn't list rather than ignored them.

Unfortunately, it's impossible to tell the difference in discussions on this topic, because, as I mentioned, so few people have first hand knowledge (or have done the research). Even before "the cloud", there was more mythology than hard data (including about temperature, until Google published data debunking that).

siscia · on Aug 28, 2018

If you are willing to move in Geneva I believe that CERN or any of the LHC experiments could use your skills.

It may be worthed to visit their career page.

mmt · on Aug 28, 2018

Relocation isn't something I'm open to, at this point. (Being in the SF Bay Area, I'm not yet worried that this limits me excessively).

I suppose it's also worth noting that I'm sceptical that any organization that large wouldn't have an equally narrow interest in my skillset.

My goal is the able to apply as close to the full breadth of what I know and can do, rather than something like specifically avoiding automation or specifically exercising my storage knowledge. For that, startups and other small companies seem best, though, oddly, not lately.

ApolloFortyNine · on Aug 27, 2018

Most definitely yes.

But in addition to this, standard Raid5 does not periodically read the data, so it's actually rather common for issues to only arise on a resilver.

This is why proper maintenance in ZFS is to run ZFS Scrub (basically check every file) once a week.

agapon · on Aug 28, 2018

Once a week sounds extreme unless we talk about a smallish SSD-only pool. You may wear out your (spinning) disks with scrubs more than you do with real workloads. Also, depending on a pool size it may take days for a scrub to complete.

mmt · on Aug 28, 2018

> You may wear out your (spinning) disks with scrubs more than you do with real workloads.

This seems like an extraordinary claim requiring extraordinary evidence, especially since the notion of wear out is only applicable to SSDs (as shorthand for write endurance).

I certainly believe that mechanical disks, with all those moving parts, can have their failure rates increased by increased use, but it's not safe to assume even something as high as a linearly proportional relationship, considering which parts move when.

mmt · on Aug 27, 2018

That's also proper RAID5/6 maintenance. My main/recent familiarity is with LSI hardware RAID implementation, where they call it a "patrol read". I believe mdraid has checkarray.

I'm not sure if you meant to imply that ZFS is different from standard RAID in this regard, but it doesn't seem as though it is.

jacquesm · on Aug 28, 2018

It's called scrubbing.

mmt · on Aug 28, 2018

Did you mean to reply to my above question? If so, I'm unclear as to what you're trying to get across.

Is ZFS scrubbing different than the other RAIDs' (scheduled or schedulable) reads of the entire array, other than nomenclature?

Siecje · on Aug 27, 2018

We use ZFS, how can I check if we are doing a ZFS Scrub every week?

simcop2387 · on Aug 27, 2018

'zpool status' will tell you the last time a scrub was run, or if one is running currently and information about it. Then check your crons and see that they make sense and match what you see with zpool status.

Siecje · on Aug 27, 2018

Well it is not every week.

> scan: scrub repaired 0 in 0h36m with 0 errors on Sun Aug 12 06:00:50 2018

I don't see any cron jobs though...

simcop2387 · on Aug 27, 2018

Given the speed of that, I'd bet you don't have a huge pool (or if you do, that's a really nice speed). I'd bet someone's doing it manually. That's what I do for my own systems (about once a month) since they're not heavy use (personal and parent's file servers).

tetha · on Aug 27, 2018

There are two nasty thing going on with raid5: a) You can tolerate 2 drive failures, no more. One less, but if you tolerated one, you can tolerate one more, and no more.

AND b) raid rebuild causes MASSIVE stress on the remaining drives. Seriously massive stress, beyond what distributed systems or raid 1 / 6 based systems do.

This occurs because RAID5 both rewrites parities, but it also has to re-read data from all drives while writing parities to all drives. That's a lot of random access and especially spin-drives and larger drives (see the coincidence?) dislike that. That tends to cause similar-aged drives to die and then your raid is gone.

I'd suppose this is less hard on SSDs than HDDs. But there's still a lot of rewrites going on, and SSDs don't like that either.

rincebrain · on Aug 27, 2018

a) RAID5 can fault precisely one drive without data loss. Two overlapping errors (e.g. two errors in one stripe) and you're up shit creek without a paddle. This is the generic definition for RAID5, not an implementation detail.

b) What makes you think RAID6 doesn't also incur this? The only difference is that RAID6 also includes a Q parity block in each stripe, so the only thing you get saved from is if you don't need to read the parity on the stripes, you save 1/(N+P-1) IOs per drive.

RAID6 is still going to need to recompute ~2/(N+P) parities (one P, and one Q) for rebuilding a drive over (N+P) stripes; and reconstruct the data for the rest (depending on how P and Q are implemented, they could interweave which they use for reconstructions, but AIUI it's generally more expensive to recompute from Q than P, and R than P or Q, in many instances of this math).

c) Many RAID systems can rebuild starting from the starts of the respective disks and streaming along (or, in recent ZFS's case, coalescing the IOs to be in sequential order groups and issuing them), though certainly not all of them.

The logic "usually" goes that RAID5/RAID6 rebuilds are dangerous because they involve reading all the bits, so to speak, so if you don't have an equivalent of scheduled patrol reads to be sure bits at rest that haven't been read by users haven't gone south, you'll first discover this...during a rebuild, and with RAID5, you're SOL.