Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
ZFS is mysteriously eating my CPU (brendangregg.com)
332 points by mfrw on Sept 6, 2021 | hide | past | favorite | 102 comments


Actual issue here: https://github.com/openzfs/zfs/issues/6531

It seems the whole thing took place in 2017 and was fixed then (first by not calling the reclaim if used ARC memory was zero, then the root cause also fixed by using a PRNG instead of the default CSPRNG in 0.7).


Where do you see the actual issue was fixed in that ticket?


You’re right, the hack/workaround to not reclaim ARC if it’s below a size was written but not merged [0]. But the PRNG thing (actual CPU usage bug) is real and in 0.7+ so far as I can tell. Ideally both would have been committed.

[0]: https://github.com/openzfs/zfs/pull/6544


Might be a good idea to go to the PR and mention that. The stale bot has done its awful work… again.


Ahhh, sweet sweet reuse [1]. If code is poetry, most code is noetry, software reuse should be renamed to software refuse. Refuse it straight to the /bin

No code.

https://ocw.mit.edu/courses/aeronautics-and-astronautics/16-...


> stale bot closed this on 23 Nov 2020

as is tradition


Well that was a wild ride - I personally lost it at "We aren't using ZFS." :) I'm surprised that the kernel module was even loaded, though; if they weren't using it, and the only team using ZFS was a minor experimental group, why was it even installed, let alone loaded, let alone burning CPU on a cryptographically randomized noop?


> I'm surprised that the kernel module was even loaded, though

On Ubuntu, some kernel packages install also the zfs modules package. I think (but not 100% sure) that if they're on that O/S, with those packages installed, then the zfs module will be loaded on startup.


Netflix leans towards FreeBSD.

Is ZFS removable there without recompiling?

https://papers.freebsd.org/2019/fosdem/looney-netflix_and_fr...


Disclaimer: despite using it at home, my BSD-fu is atrocious so I could well be wrong, and I’m not in front of a computer to try. However, I believe that ZFS is its own kernel driver and so, assuming it’s not being used then:

  # kldunload zfs.ko 
should do the trick. You would of course have to change the install setup to use UFS rather than a zroot.

I suspect it’s probably loaded by default regardless of OS drive file system, though.


For OpenConnect, not necessarily everywhere else.


From the top of the article (added?) "I summarized this case study at Kernel Recipes in 2017."

The bottom of the article notes that his book Systems Performance 2nd Edition is 45% off until 9/13.

Edit: looks like the comments about "didn't I see this a few years ago?" have been deleted.


Not added, but maybe I'll add something extra to make it more clear. I have a backlog of interesting case studies I'm trying to find time to share.


Not sure if you're comfortable sharing it, but if so, how many copies have you been able to sell?

Also, any idea how something like safari online impacts those numbers?

Thanks again for the book, it's an amazing read that I recommend to anyone that is interested in the area.


Systems Performance 1st edition sold over 10k copies for the English edition, and that was half Solaris when it was falling in market share. 2nd edition is more Linux focused, and shows how eBPF fits into the toolset, so I'd assume it'll sell more than 10k. (The publisher has been changing their author's portal and I haven't been able to login for months now, so I don't know the current numbers for 2nd Edition.) Various companies make it required or recommended reading for new engineers, which helps a lot (thanks!).

How much pirated PDFs hurt these books I'm not sure, maybe a little, maybe a lot. With my BPF book, a "rough cut" was published on Safari and then that did the rounds as the PDF (and still does) even though it was unfinished and buggy. That's what annoyed me the most about it, was that people were reading a broken version and may be thinking it's the final version.

Thanks for getting and recommending it!


Where is it 45% off? I somehow can:t find that?


informit, the link is at the bottom of the page.


Love the look of all the tools netflix have at their disposal


A quick to use remote profiler does wonders to an org. By fast to use I mean it takes me 10 seconds to kick off a remote profiling scrape. Someone entrepreneurial please make a purchase-able service out of that, I'll be feeling lost once I quit my corp (not Netflix). Stuff is 100x more useful than remote debugging which ultimately almost nobody uses in practice.


Shameless plug: We just launched Prodfiler last week -- it's continuous in-production profiling based on eBPF, handles C/C++/Go/Rust/JVM/Python/Perl/Ruby (with NodeJS and .net in the pipeline), and it does not require symbols on the machine, nor recompiles with framepointer, nor special JVM flags.

https://prodfiler.com

:-)


Datadog has continuous profiling that is pretty useful. I'm pretty sure they follow Brendan religiously.. I hope they release something like flamescope next.


And that he described "Normally I'd SSH into a machine", but you shouldn't even be able to do something like that if you've got infrastructure-as-code; the machines are ephemeral, they probably don't have any of the tools that you need to do an analysis, they shouldn't even have SSH or open ports unless there's some tool interacting with it, but in practice, they should be disposable on a whim if there is an update of sorts.

Had a colleague who asked for SSH access to production machines to debug an issue. Ops team asked what he wanted to do, guy just wanted to look at which env vars were set. Ops team told him how to do his job - log the configuration - instead of give him access, because they had a mandate to ensure five or seven nines uptime. Can't risk it.

I had a lot more respect for the ops people there than I had for my fellow SWE's.


Most issues we hit are solved by the monitoring GUIs (Atlas, FlameCommander, PerfDash, S3 log collection, etc.), but some aren't, and there's a ton more information available from the CLI tools. Imagine losing your five nines because no one can SSH on and try a few ad hoc tools.

But yes, CLI can be dangerous, even for observability. E.g., people running strace(1) on production apps and causing outages due to the strace overhead (I wrote a prior post about that). You need to understand the risks and overheads of all tools. It's why I have a "pull no punches" policy when writing eBPF tool man pages [0]: If the overhead can be bad, it should say so clearly.

[0] https://github.com/iovisor/bcc/blob/master/CONTRIBUTING-SCRI...


> log the configuration

That's a change, inherently much more risky than the "cat /.../env" the colleague wanted to do.

Also the change might well cause the problem to go away, and now you know nothing instead.

The principle is good, but it sounds like it has taken on a life on its own. The ops guy could also have executed the cat command on the spot with the right privileges. Sure, it's gatekeeping, but so is the four eyes principle, and a little of that gatekeeping can be necessary to keep those nines rolling.


> Ops team told him how to do his job - log the configuration - instead of give him access

"do your job" is a really rude way to refer to debugging by logging instead of debugging interactively.


I suspect this usage is an awkward phrasing of “told him how to accomplish the result he wanted”; even though in common usage it implies a lack of overall competence, here I think it just implied a lack of a specific recipe/technique.


Doesn’t seem so, as GP specifically said this caused them to lose respect for their colleagues.


That’s a possible interpretation of their comment, but not the only interpretation and certainly not something they specifically said.

(Another interpretation is that the author previously held Ops in higher regard than SWE and that this event did not change that.)


I imagine if overzealous rule enforcement is the cause of respect that the OP would've always had more respect for Ops :D


Sounds weird , logging env vars would be even more risk prone to leaking credentials, than just letting an employee ssh into the machine and see the env vars.

Also like others noted here , If one guy sshing into one machine can screw your seven nine uptimes.

Then you never had a seven nine uptime (you’re company is probably just gambling on that uptime metric until some component fails)


You had a mandate for 5-7 nines uptime and you couldn't tolerate a problem on a single node?


yup, that sounds strange. if the environment is that advanced, then there shouldn't be a problem of taking the node in question out of the service pool and let the devs loose on it (assuming there's no info that they aren't supposed to see). when they're done, just recycle it.


Where did they say it was a single node issue?


i think they mean, in case he mess up a single node, it should still be fine


I see, that makes more sense, thanks for the clarification.

I’d still rather not have to debug arbitrary mutations to the env or file system in a production container though.

It seems to me that shelling into a prod container/VM is discouraged not because you might cause it to fail, but because you might produce undefined behavior while claiming it is healthy (more like Byzantine faults).

For example if you unset a single env var by mistake, then 1/N of your requests will potentially fail. Debugging this kind of issue is a nightmare.

Not to mention that developers can often run arbitrary SQL from a prod shell when the app is backed by a DB.


> For example if you unset a single env var by mistake, then 1/N of your requests will potentially fail.

Or far worse, not fail.


I disagree.

There are always bugs that will happen only on the prod machines. Sure, they are rare, but they exist.

> just wanted to look at which env vars were set. Ops team told him how to do his job - log the configuration

Well, that's not risk free neither. That needs a code change and you risk exposing secrets, or logging more than you should. (Though I agree with the general idea of you shouldn't be SSHing if you can do it another way)


Side node: Why do we have to talk about these things in such highly charged terms in our profession? It sounds like all that happened here is that one person made a reasonable request, and another person gave a reasonable explanation of why the request couldn't be granted.


Because people establish protective totalitarian control over their corporate turf because budgeting, it become full of dogma (the gods said thou shalt not ssh and we said amen).

Dogma results in religious wars


I know at least one place where they do allow ssh as a last resort, but the VMs that gets ssh'd to get flagged to be recycled automatically.

So you can log in and debug, but once you're done, the VM is replaced with a clean one.


If messing up a single instance is going to take your site down, you are nowhere near seven 9's.


This is the sort of fanboyism that I usually avoid when working. It is perfectly reasonable to SSH into a node for debugging. In fact I do it very often in a heavily automated infrastructure as code environment.


Not to mention that if you're using Ansible to manage machines, it depends on SSH to do its job.


Exactly. Any time this topic comes up the people who are arguing for no SSH the ones never operated a production environment successfully and they are busy chasing ghosts.


Memory reclaim is where all the kernel bugs are.


Reminds me of how, after years of massive systemwide latency spikes, I finally discovered that XFS was to blame. It was blocking on I/O writes in the reclaim path... so even random processes that wanted to allocate some RAM but don't do any disk I/O ended up blocked. This was on a machine with tons of free RAM (the reclaim was for clean cache).


Is this an argument for using Rust?


No amount of avoiding memory safety bugs will save you from writing fundamentally incorrect code. Rust stops you making dumb mistakes, and (as a high-level language, this part isn't Unique to rust in any way) lets you create abstractions to stop other people making less dumb mistakes, but will do absolutely nothing if your model of the process is wrong.

The only way to really be sure of that is extensive testing at multiple levels, and ideally some form of verification like a TLA+ model.


No. Memory reclaim != memory leaks != memory safety. This is about filesystem caches.


For the love of 3$&$>=*=$, not rust again!


[flagged]


Even good software has bugs; ZFS is amazing most of the time, but that's hardly a guarantee of zero issues ever - but having bugs occasionally also doesn't make it not an amazing tool most of the time.


> but having bugs occasionally also doesn't make it not an amazing tool most of the time.

For most software, yes. For a filesystem, not so true.


Every major filesystem has had bugs. Also, notice that this bug never endangered data, it was just wasting CPU.


Dan Luu would like to have a word with you about how magnificently atrocious filesystems are.

Files are hard.


From the github issue linked elsewhere in the thread:

> KiB Mem : 16431044 total, 215356 free, 15894868 used, 320820 buff/cache

That's... not a lot of headroom. I'm impressed their OS is paired down to the point they can run in this state with this being the only pathology.


Looks like they run a known [java] workload constantly. The java heap is allocated at startup (and is used for dynamic allocations). I don't see why they shouldn't be doing this. They also have a ridiculous amount of instrumentations and monotoring, so if it was a problem it would have been identified long ago.


This was a constant source of problems with any Java app I've ever used. Why can't heap be dynamically sized in cooperation with the OS JVM is running on?

I imagine it's some early backwards-compatibility JVM restriction, but it is so annoying.


Java can be tuned to release freed memory to the OS:

https://www.geekyhacker.com/2019/01/04/jvm-does-not-release-...


Free memory is wasted memory.

Applications that cache data have to do it on the heap.

Yes, there should be a way for the system to coordinate usage of cache/free memory and coordinate with a VM’s GC. Until we have that, big apps will exhibit the aforementioned memory footprint.


I am not talking of the netflix case from the OP, but rather of the general case where you are likely to have multiple services on a single host. Just imagine if a couple of them were Java-based: you'd have to do figure out how much is the maximum needed for each of those and sum it up and ensure the server has that much memory even if they are unlikely to all need it at the same time: if we are talking about waste, that's wasteful to me.

Even in a more modern setting with containers, you'd have similar issues.


I didn't mean to imply that it was a problem or that they shouldn't be doing it - more that i'm envious of the operational chops to pull it off.


Skip the percentages for a second. >300MB of cache and >200MB free is a lot of memory! That's enough memory to run an entire server without breaking a sweat. Instead of calling proper function impressive, I'd be quite upset if getting that low causes problems.


If they were using ZFS, they probably wouldn't have golfed their memory allocations and had spare (don't the Go folks call this 'ballast') memory, so they wouldn't have trigger the routine?

The bug was that they weren't using ZFS. Not that ZFS had a bug.


> If they were using ZFS, they probably wouldn't have golfed their memory allocations and had spare

You're describing a problem with ZFS (being wasteful with memory), and then saying that if they used ZFS their workaround for that problem would have prevented this problem.

Do I interpret your post correctly?

If so, "the bug was that they weren't using ZFS" is some pretty wild spin.


Now ZFS is eating CPUs and causing a chip shortage


Nice job!


I usually find people running ZFS to run into way too many more problems than what other people run using Ext4. (Edit: I'm surprised to see how people run into issues with Btrfs)

It might be good for some specific purposes but I'd reckon the majority of people using it do not fall into those. Using it "just because" is probably more trouble than worth.

> ZFS really wasn't in use, ever! But at the same time, it was eating over 30% of CPU capacity! Whaaat??

Interesting bug, sounds like the best option is to not even have the module loaded then. (Not blaming the developers here, necessarily, but yeah, it's a bug)


Usually in my experience whenever someone has a problem with zfs it’s solvable in one way or another and the devs are very responsive on irc and GitHub.

On the other hand a lot of problems caused by btrfs is unrecoverable and a mess. Ext4 is stable but lacks the feature set that makes customers go to zfs instead (and that is fine, software should do one thing well and right).


Btrfs gave me the most trouble by far. And most people say the same in my experience. Replicating half of ZFS's features with ext4 is a whole lot of trouble too.


We have a machine that has a large ZFS storage pool, and one smaller btrfs volume on a pair of SSDs. If I delete a large directory hierarchy on the btrfs volume and then run sync, it locks up the entire machine for 5-10 minutes. WTF. Nothing like that ever happens with ZFS or ext4.


I've had a lot to learn with zfs, but it's never just flat out lost my data, unlike btrfs (in about 2017)


Consider yourself lucky. There are data-corruption bugs all over the now-diverged ZFS codebase(s). Check the Solaris 11 SRU notes from the last 24 months for many examples.


Raid5?


Not sure if you are suggesting RAID5 as a relevant problem or solution, but it's arguably the former! :)

Regardless, ZFS has configurations equivalent to most "traditional" RAID options anyway (raidz1 is their RAID5) but with the added benefits of all the other great stuff ZFS gives you, like checksums, compression, deduplication, snapshots, etc.


Probably it was a guess that you were using a RAID5 mode of Btrfs, and that it was the reason for the failure.

Not using RAID5 or RAID6 modes on Btrfs is the first piece of advice people considering Btrfs get.


You are now even warned by mkfs.btrfs itself when you attempt to do it. [0]

[0]: https://www.phoronix.com/scan.php?page=news_item&px=Btrfs-Wa...


Right, I failed to parse the comment as inquiring as to if that's the Btrfs issue referred to. Yes, I am familiar with Btrfs' RAID blunder. Somewhat ironic though since RAID5 is rarely the best idea regardless of the system implementing it


Indeed the read5 implementation was (is?) notorious for some time (or still), well if you Google you'll soon find the stories.


For a homelab use-case, I backed out of creating a ZFS mirrored storage pool for network storage. It wasn't that the steps were too complicated or that ZFS let me down as such.

Just that RAID in general and ZFS put my files in a blackbox, and that I didn't really need a significant boost in network reads. I settled for plain old ext4 and periodic rsync mirror files to another disk (I don't mind some interim data loss + writes are fast). Use SFTP or sshfs for accessing the drive.


Any RAID solution that does block/bit level striping will be a “blackbox” as you describe it. That’s not a bad thing (nor is your approach) but pros and cons.


Yup, and as far as "blackbox" solutions go, if something horrible should happen to your system, you want something that works like ZFS as opposed to, say, some high-performance but proprietary hardware RAID.

As someone who has a couple times needed to rip 100tb of ZFS disks out of something, put them into a different machine with a different architecture or OS in some random order, and access everything without having lost anything, it's hard for me to overstate how great it is that such a thing is even possible.


Hopefully, the pool was exported before you moved the disks. I came very close to losing data (but ultimately didn't) when I yanked a pool out of an OpenIndiana system and tried to import it on Ubuntu without exporting it first. ZFS really didn't like finding the disks on devices with radically different names than where it had left off.

It ended up re-importing OK when I brought it back to the OpenIndiana box. Whew! I then did an export, and Ubuntu was then able to import the pool.


I make my pools with the serial number /dev/disk/by-id/ names, like ata-WDC_WD30EFRX-68EUZN0_WD-XXXXXXXXXXX.

That is not going to change between different machines as far as I know, I might be wrong about this. I made some bad choices (in hindsight) when making this pool for my uses, but it's still going strong many years later. [1]

This current pool (which is now old, 50k+ power on hours on these disk) has survived a motherboard dying randomly and one or two SSD failures with the OS on.

Always just installed a OS on a new SSD and it's been picked up just fine.

[1] I went with RAID-Z2 for 6x3TB WD Reds where I probably should have made mirrored pools or something like it to gain more space? It's been a while since I looked at it. Can't really expand this pool or add more storage without replacing each disk one by one with something bigger.

I could make another pool with new disks but I'd lose another 20% to parity.


I use those names as well, at least now that I entirely use ZoL. That wasn't an option back in the day with OpenSolaris and Illumos. You'd be using names like c0t0d0 instead, and woe to you if it found the wrong disks in the wrong places (it would just refuse to import claiming the pool was faulted).

Caveat: things may have changed since then - it was at least six years ago.


I have only used ZoL myself, on Debian and Ubuntu. I really sorta need to buy new disks for this, but they are so expensive right now. I should probably just backup all the stuff I really care about and offload most of this to some sort of cloud storage/rented space somewhere. I don't know that these drives will fail (WD Red 3TB), but the 50k+ power on hours are starting to worry me a bit.

They have not seen a huge amount of reads/writes though, if we don't count the weekly scrub and weekly usage by just me.


When you import without exporting, all you risk losing is the last few seconds of data transactions. If a motherboard dying was enough to lose all your data people wouldn't really recommend ZFS.

So hopefully the pool was exported, but the worst case scenario isn't very bad.


How exactly did you do that? I'm learning ZFS still and I'm planning to move some of the disks in my first NAS to a better machine.

Which ZFS commands?


`zpool export` on the machine your taking the array out of and `zpool import <pool name>` on the receiving end will do the job in most cases.


Nothing special is necessary. All you need to do is import the pool as normal.

There’s a potential problem, in that you can’t import a poll on an OS version that lacks the feature flags that are enabled on the pool. The way to solve that is to choose a common subset when creating the pool.


Unless they fixed it, another edge case is a change in processor endianness. I started working on a patch a long time ago, but it necessitated changes to code that had remained unchanged since the original release... not a blamelog I'm totally comfortable with.


What?

CPU endianness switches work fine, IME - I have pools I originated on SPARC Solaris that work fine on x86, I just got a patch landed for an edge case in one recent feature not interoperating properly between endiannesses but other than that it's worked fine for me.

What kind of issue did you see?


> CPU endianness switches work fine

Thats great news, hopefully that rare potential problem got addressed in the last year or so.

Steps to recreate: compile linux (maybe FreeBSD as well, that is a little more fuzzy) for LE. Boot new kernel and setup a new pool. Recompile for BE. Boot new kernel and enjoy bootloop...

lol, actually things are starting to come more into focus... page size played a role in the whole thing. As I said, edge case :) No potential for data loss though.


I'm guessing this was probably PPC, given the rarity of the other options?

Sounds like a fun experiment to run down. Maybe I'll go look when I'm done the current nightmare I'm tinkering with.


Yup, Raptor Blackbird. I'd send along my notes if I didn't think they'd be more misleading than helpful. Left that one in a mess that'd be difficult to untangle, but I remember the trail basically ending in a file that was practically unchanged since the original source got released. Because of the code vintage I really wanted to be sure that I didn't introduce performance regressions, so that led to me turning my attention to live kernel debugging over serial, which led to uncovering an even more ancient tty bug (seriously, we're talking unix fountainhead old).

She swallowed the spider to catch the fly... I'm presently writing something to crack 8051 firmware xnor'd by a 64 byte key, I don't remember how - but the previously described ZFS edge case somehow got me to this point.


I am...familiar with falling down rabbit holes.

I cut a patch to fix sparc64 building with the new zstd feature. I then discovered the new zstd feature was broken for endian portability because lol bitfields.


I hadn’t heard of that one. Does send |receive still work?


Nope, the system would instantly panic - presumably in order to protect data integrity... imagine the damage possible otherwise. It was actually a fairly ponderous debugging effort, as I remember it, because it was occurring on an already shaky POWER9 setup I was using to port other stuff. ZFS worked great for any vdev setup on any OS compiled with the same endianness, but plugging in so much as a thumb drive previously provisioned from a different endian host... insta-console-vomit. Definitely an uncommon problem though :)


> I settled for plain old ext4 and periodic rsync mirror files to another disk

This isn't a mirror, not in the sense that most would understand. No performance boost, no protection from bit rot, unnecessarily wasteful in every potential metric. If you wanted to intentionally design a system to propagate errors and render backups useless - you would start with this kind of setup. It makes sense to avoid all the RAID related problems associated with hardware solutions... but I'm drawing a blank on rational reasons to do what you've described. Super weird boot manager + physical space constraints?


Two standard external USB disks of different ages, connected to a powered USB hub attached to a ROC-Raspberry Pi. The older one is the "live" one that serves all data requests. The newer disk is usually attached to the hub, but it's also disconnected during summer.

If the "live" disk fails, the other disk will eventually replace it.

> render backups useless

Like I mentioned, data loss isn't a concern. This is mainly hosting code repositories and media files.

What's my use-case? My laptop, that has a paltry 256GB SSD, keeps running out of disk space, and I often find myself plonking files that aren't super critical on the network drive.


> but it's also disconnected during summer.

I knew there was a reason I was getting a "can't send mail more than 500 miles" vibe... Are we talking small form factor spinning rust, or thumb drives? Because if you thought that ZFS was a black box - look into flash wear leveling algorithms, I'm hard pressed to think of any storage media more prone to spontaneous bitflips. Unless you are checksuming before and after every movement of data - your files are going to silently get corrupted in ways you won't notice, until something breaks.

Anyway, I've had way more instances of bitflips than drive failures - and I've had to deal with that kind of data corruption escaping detection and making it into backup archives. That is why I obnoxiously promote ZFS to the degree I do... it totally eliminates that risk automatically. No joke, this thread prompted me to run a manual pool scrub a month in advance for an 8TB archive pool. Since it is archival, it gets few reads - which means it gets few automatic checksum verifications outside of the scheduled bi-monthly scrub. Well the scrub is only half way through, but it already shows that at some point in the last month a 128KB block on a single 3.5" disk got silently corrupted. Data corruption nipped in the bud, thanks to ZFS <insert whatever Sun's jingle was>.


Again, you're focusing on data consistency, whereas I'm not really concerned about a bunch of cloned throwaway OSS git repositories or npm modules or MP3 files being corrupt over time.

These are old WD Passport spinning hard-disks that were never designed or optimized to run as NAS drives. I had them lying around doing nothing. This is a salvage operation for the old drive.


> Again, you're focusing on data consistency...

All that talk of "homelab use-case", "mirrored storage", "RAID", "rsync"... obviously what is under discussion is how ZFS is a poor fit for the tmpfs tier garbage data use-case, dunno how I missed it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: