In addition to my other comments about parallel IO and unbuffered IO, be aware that WS2022 has (had?) a rather slow NVMe driver. It has been improved in WS2025.
Robocopy has options for unbuffered IO (/J) and parallel operations (/MT:N) which could make it go much faster.
Performing parallel copies is probably the big win with less than 10 Gb/s of network bandwidth. This will allow SMB multichannel to use multiple connections, hiding some of the slowness you can get with a single TCP connection.
When doing more than 1-2 GB/s of IO the page cache can start to slow IO down. That’s when unbuffered (direct) IO starts to show a lot of benefit.
The strange thing is, I did have /MT:32 on (added in a comment at the bottom of the page because I had to go to bed). I like to stick with defaults but I'm not that inept. /J probably shouldn't matter for my use case because 125 MBps just isn't that much in the grand scheme of things.
A workload that uses only a fraction of such system can be corralled onto a single socket or portion thereof and use local memory through the use of cgroups.
Most likely other workloads will also run on this machine. They can be similarly bound to meet their needs.
That’s not the kind of software I had in mind. I mean single large logical systems—databases being likely the largest and most common—that can’t meaningfully be distributed & are still growing in size and workload scale.
- Consumer drives like Samsung 980 Pro and WD SN 850 Black use TLC as SLC when about 30+% of the drive is erased. At this time you a burst write a bit less than 10% of the drive capacity at 5 GB/s. After that, it slows remarkably. If the filesystem doesn’t automatically trim free space, the drive will eventually be stuck in slow mode all the time.
- Write amplification factor (WAF) is not discussed. Random small writes and partial block deletions will trigger garbage collection, which ends up rewriting data to reclaim freed space in a NAND block.
- A drive with a lot of erased blocks can endure more TBW than one that has all user blocks with data. This is because garbage collection can be more efficient. Again, enable TRIM on your fs.
- Overprovisioning can be used to increase a drive’s TBW. If before you write to your 0.3 DWPD 1024 GB drive, you partition it so you use only 960 GB, you now have a 1 DWPD drive.
- per the NVMe spec there are indicators of drive health in the SMART log page.
- Almost all current datacenter or enterprise drives support an OCP SMART log page. This allows you to observe things like the write amplification factor (WAF), rereads due to ECC errors, etc.
You’re also missing an important factor: Many drives now reserve some space that cannot be used by the consumer so they have extra space to work with. This is called factory overprovisioning.
> - Consumer drives like Samsung 980 Pro and WD SN 850 Black use TLC as SLC when about 30+% of the drive is erased. At this time you a burst write a bit less than 10% of the drive capacity at 5 GB/s. After that, it slows remarkably. If the filesystem doesn’t automatically trim free space, the drive will eventually be stuck in slow mode all the time.
This is true, but despite all of the controversy about this feature it’s hard to encounter this in practical consumer use patterns.
With the 980 Pro 1TB you can write 113GB before it slows down. (Source https://www.techpowerup.com/review/samsung-980-pro-1-tb-ssd/... ) So you need to be able to source that much data from another high speed SSD and then fill nearly 1/8th of the drive to encounter the slowdown. Even when it slows down you’re still writing at 1.5GB/sec. Also remember that the drive is factory overprovisioned so there is always some amount of space left to handle some of this burst writing.
For as much as this fact gets brought up, I doubt most consumers ever encounter this condition. Someone who is copying very large video files from one drive to another might encounter it on certain operations, but even in slow mode you’re filling the entire drive capacity in under 10 minutes.
> You’re also missing an important factor: Many drives now reserve some space that cannot be used by the consumer so they have extra space to work with. This is called factory overprovisioning.
This has always been the case, thus why even a decade ago the “pro” drives were odd sizes like 120g vs 128g.
Products like that still exist today and the problem tends to show up as drives age and that pool shrinks.
DWPD and TB written like modern consumer drives use are just different ways of communicating that contract.
FWIW I’d you do a drive wide discard and then only partition 90% of the drive you can dramatically improve the garbage collection slowdown on consumer drives.
In the world of ML and containers you can hit that if you say have fstrim scheduled once a week to avoid the cost of online discards.
I would rather have visibility into the size of the reserve space through smart, but I doubt that will happen.
> You’re also missing an important factor: Many drives now reserve some space that cannot be used by the consumer so they have extra space to work with. This is called factory overprovisioning.
I think it is safe to say that all drives have this. Refer to the available spare field in the SMART log page (likely via smartctl -a) to see the percentage of factory overprovisioned blocks that are still available.
I hypothesize that as this OP space dwindles writes get slower because they are more likely to get bogged down behind garbage collection.
> I doubt most consumers ever encounter this condition. Someone who is copying very large video files from one drive to another might encounter it on certain operations
I agree. I agree so much that I question the assertion that drive slowness is a major factor in machines feeling slow. My slow laptop is about 5 years old. Firefox spikes to 100+% CPU for several seconds on most page loads. The drive is idle during that time. I place the vast majority of the blame on software bloat.
That said, I am aware of credible assertions that drive wear has contributed to measurable regression in VM boot time for a certain class of servers I’ve worked on.
Now that PCIe 5.0 SSDs are available since 6+ months and you could backup your SSD with 15 GB/s but:
> you’re still writing at 1.5GB/sec.
Except of few seconds at the start, the whole process lasts as if you had PCIe 2.0 (15+ years ago). Having so fast SSDs there is no chance to make a quick backup/restore. And during restore you're second time in a row too slow.
It's crazy that instead of using slow PLC at the time of slow PCIe 1.0, back then fast SLC was in use. Now with PCIe 5.0 when you really need fast SLC, you get slow TLC or very slow QLC or even worse PLC coming.
At the deaf orphanage for AI children. It's a bit niche, but you should see what they can do with LEDs. Their artwork fetches a great price at the LLM Arena.
I’ve had many very kind people help me throughout my life. In most cases there was no clear immediate reward for them. Maybe there was immediate return - joy in sharing one’s craft or the satisfaction of passing on good will they received sometime long ago.
Every meaningful project I’ve worked on has benefited more from inclusion than exclusion. The person I help may or may not become a significant contributor to my project, but many times they become the person that can help me with something I’m learning. And so what if I never run across that person again? Maybe they will remember the kindness they received and pass it along.
I have severe hearing loss in my right ear and no to mild hearing loss in the left. AirPods Pro 2 make it so that I feel like I can hear in stereo while streaming without resorting to setting the balance 90% right and jacking the volume. In that respect I love them. However, they are designed only for moderate loss so they will not amplify the right ear sufficiently to hear well in that ear unless the left ear is uncomfortably loud.
For me, I need a real hearing aid to hear a person that is at my right shoulder.
If both ears are about the same, I think the hearing aid volume (separate slider from general volume) could be adjusted to get past the “designed for moderate loss” limitation.
They like it enough that they bought this business from Samsung, who previously developed and supported it through their subsidiary, Joyent. I worked for Joyent for a few years but left before the transition to mnx.
That's good to hear. It sounds really cool, but also as though you need some potentially hard to come by skills to make it work (e.g. someone who used to work at Sun might find it much easier!)
I don’t think that having worked at Sun gives you much of a leg up on Triton (cloud platform). Running Triton does require specialized knowledge, but there are decent docs, IRC, and commercial support available.
Triton uses SmartOS as the operating system on compute nodes. Familiarity with Solaris/illumos is helpful at that layer. If you are
Using it to run Linux VMs, the amount of Solaris wizardry needed should be minimal.
A drive that supports Secure Instant Erase should be encrypting all data. When the SEI function is invoked (“nvme format -s 2”, “hdparm —-security-erase”) they key is thrown away and replaced with a new one. Similar implementations exist for NVMe, SATA, and SAS drives — regardless of whether they are HDD or SSD.
This puts a fair amount of trust and in the drive’s ability to really delete the old key.
I think it is solidigm that has started to argue that with a 128 TB QLC drive constant writes at the maximum write rate will hit the drive’s endurance limit at about 4.6 years. The perf/TB of these drives is better than HDDs. The cost per TB when you factor in server count, switches, power, etc., is argued to favor huge QLC drives too.
reply