Feature request: Add a "library" option to give mkdwarfs a list of files that should be loaded into the dedup mechanism first, but not stored, allowing the image to be even smaller if the contents of the file can be retrieved from that library instead. Bonus points if you can specify a dwarfs image as a library and have it sensibly use the files contained in it.
Then you have the basis for a deduplicating incremental backup system. Currently, I have a system I wrote that will take a single file and a list of library files and produce a compressed deduplicated file that can re-create that single file using the library, which is great if you use tar to create that single file, but a little unwieldy when coming to decompress and restore everything. The bonus of making it a proper mountable filesystem instead is that then it's a proper mountable filesystem and retrieving single files is a doddle.
My use case is that I have students, and I have given them coursework, which involves them logging in to a Linux machine and hacking away. I want to store regular snapshots of their work so that I can keep a backup for their sake but also so I can see a progression of development to try to work out if they are cheating (yes, I have had to deal with this), but I don't want to store 100 copies of the same fairly large files.
borg and restic/rustic are alternatives that do this multi layered backup thing. they can be mounted as a filesystem as well (the rustic reimplementation of restic however doesn't support this yet)
It has a great system to snapshot files but only store data if it's changed. I use it in an environment where I can't use something like zfs to snapshot data because I don't have the ability to make decisions about what filesystem we're using. It's been amazing, love it so much!
I always thought a deduplicating squashfs might be really cool for a read-only MAME system. Since most of the rom files seem to be variants of each other, instead of having zip archives for each game, just put all the raw files in a directory and create a deduplicated squashfs-type filesystem with all of them.
It's not an idle use case by the way. mhx wrote and maintains a library that provides important backwards compatibility for native (typically C based) extensions for Perl across decades of language releases.
I would like to archive every web page I ever visit. Something like this sounds very useful but it would need to be updated periodically.
I’m thinking a hybrid approach could make sense here. Historical data is thrown into a big archive dwarFS while new data of e.g. the current day is kept in a simple normal folder. Every now and then this would be merged together.
Skimming through the docs I can’t tell if it will be possible to recreate a new dwarFS by taking an existing one and adding just a few more files, or if I would need to create a completely new one which also means I need to temporarily have enough space for twice the archive size
To me it seems such an fs should be immutable-first rather than read-only, i.e. let you create (by copying from another fs) files which just can't be changed (can be moved though) ever after. And such would be what I actually need as I have a lot of big and relatively redundant files which aren't meant to change (e.g. video files, picture originals, distro and backup disk images etc.).
It's meant as squashfs replacement. That is image used for rootfs on live-boot system with overlayfs on top for writable tempfs for example. But I imagine this being useful for many embedded applications in general where you need to save space.
Append only files might be enough to be fair. And still very useful.
As then, you can dedup per block, even with a sliding window. Nicest is that you can do it fully asynchronously, so real time appends are not slowed down.
> How do you put the files in it if it's read-only?
The same way we [used to] use specialized software tools to assemble an ISO 9660 image with our data, and then burn that image to a [single-session] write-once optical disk like CD-R.
It's read-only after you create the image. And then you can use overlayfs on top of it as others mentioned -- useful for playing with live Linux distributions or embedded systems where you never want to change the root bootable image except when you are actually updating the firmware.
...Or, in the author's case, when you have hundreds of directories that have only small differences compared to each other.
As other posters in the bigger thread mentioned, this is also very useful for big arcade game collections.
I've got a (relatively old) snapshot of the English Wikipedia that I'm using for testing. The snapshot is around 200 GiB in 14,000,000 files and compresses down to an 11 GiB DwarFS image.
I guess it's without the pictures then ?
Because if I compare with the zim file format (which is optimized for this use-case) https://kiwix.org/en/what-is-the-size-of-wikipedia/ I read
"As of October 2022, the Full English Wikipedia (ca. 6.5 million articles), with images will use up 91GB of storage space (German and French, the second-largest: 36 GB). (...) If you can do without the images (what we call the nopic version), then you are down to 46 GB."
>As of May 2015, the current version of the English Wikipedia article / template / redirect text was about 51 GB uncompressed in XML format.
Compressed data at the same time was 11.5 GB. And that's data from 9 years ago, and just English Wikipedia.
For comparison, I collect leaked password dumps and they (combined, after deduplication) go into hundreds of GBs too. And that's for just username:password lines, not even text.
It's ever so slightly smaller than a .tar.xz of the same data. The main difference being that you don't have to fully extract it in order to access the data.
IPFS is slow as molasses and their solution was to introduce paid pinning services. Complete failure.
And then they announced they will willingly and proactively delete any hash that any legislative agency tells them to, and it was dead in the next minute.
Most distributed systems simply don't live up to their theory. Having a hub is simply always easier and consumers hopping temporally from hub to hub is just the most efficient mechanism.
Agreed. The idea behind IPFS is nice but we need to combine it with all the good lessons from the torrent software and stuff like NNCP in order to have something truly distributed, resilient, fault-tolerant and automatically replicated.
I'm curious to know, but I believe a major portion of this idea comes into play when the data is massively redundant. I typical container (say, a vhdx file) normally wouldn't apply.
No you didn't misunderstand, however, I do believe there would be quite a bit of redundant data in the layers of a docker image, especially when the Dockerfile hasn't been optimized and was just haphazardly constructed.
FWIW, I've updated the comparison using the latest versions of EROFS and DwarFS. There definitely have been improvements to EROFS in the meantime. DwarFS is still orders of magnitude faster in creating the file system and can achieve significantly better compression, but throughput is very similar unless you start optimizing the EROFS image for better compression.
Thanks !
One question though : why do you say "as it's pretty obvious that the domains in which EROFS and DwarFS are being used have extremely little overlap. DwarFS will likely never be able to run on embedded devices" ? (I mean : as far as I know EroFS was mostly used in smartphones, which are actually powerful devices !)
That's definitely a fair point! DwarFS actually works perfectly fine on 64-bit ARM and would likely work fine on a smartphone as well. Still (and I might be completely wrong about this), I think the primary goal of EROFS is to consume as little resources as possible when the file system is accessed and to be able to run on much less capable hardware, including 32-bit systems. DwarFS primarily cares about maximizing compression as long as it doesn't negatively impact performance (access times and throughput). This involves a certain amount of caching / pre-fetching and assumes there's plenty of memory available. This can be configured to a certain extent, but my pessimistic assumption is that DwarFS would use significantly more memory than EROFS. Might be worth actually backing this by numbers! :)
Do I miss ZipMagic that makes us use zipfile like folder in Windows explorer, including read write operation. It is diacontinued and its name is reused by another tool so I can only find its description in Amazon [1]
Very cool idea. I wonder how it would work to first turn every file you want to store into a stream of symbol files using a fountain code like RaptorQ, and then trying to compress all of the symbol files together using something like Z-standard with a big dictionary. It would probably be too slow to use for many file system applications, but I would think it would be pretty good at exploiting redundancy across files and getting good compression ratios.
This sounds like something that could be useful for immutable distros like Vanilla and others, but I wonder how practical it is if it’s not part of the kernel.
This looks really cool! though its a bit limited since it is a FUSE module and not a kernel driver, and unlikely to become a kernel module since it is written in C++ with large dependencies :-\
Would it be possible to take the core design changes here and apply them to squashfs, and maybe propose a next major version of the squashfs internal format to make all these things possible?
I've heard that SMB-fs is "better" than FUSE. More cross-platform, can also be implemented in user-space, and less likely to get jammed up/deadlocked by slow network calls inside the kernel.
I feel like 9p is a better competitor in this space, and I've actually heard of people using it (e.g. the Windows Subsystem for Linux mounts host filesystems this way and so does Crostini, the ChromeOS equivalent).
Can someone explain what this is and when I would use it in layman’s terms? I understand it’s a fuse filesystem but don’t understand quite what it does?
However, since it's a FUSE only file system, it's difficult to see how it would be used on embedded system firmware, so it could perhaps see use as a distribution mechanism. Similar to tar or zip files, but possibly with (much) better performance for random access, should you need only smaller portion of the whole archive.
The author indicates need for keeping multiple similar copies of sets of unchanging files on their computer, and made this to reduce the space needed for them, while retaining the access through the file system. So that is also a use case.
can you give an example of "self-congratulatory tone"?
is it things like, "this is still 10 times larger than dwarfs", because a statement of fact can not be insolent.
I actually only went through a significant portion of the readme (through the CromFS part) because of your comment, and i just don't see it. I see a person who wrote actually useful software that is multi-platform and gives the positives and negatives of the software they wrote compared to alternatives available today.
In every test DwarFS compared favorably, and on tests where one aspect was marginal, the DwarFS code was better in other regards: power at the wall, extract/read times, etc.
How would it be better presented by a solo developer?
that's fair, and i completely missed that. I remember HAProxy used to have similar verbiage on their main page, back when they were the only software load balancer to be able to sustainably manage 10gbit of throughput. I gave a quick scan of the current intro.txt and there's still some ...
> HAProxy offers a fairly complete set of load balancing features, most of which
are unfortunately not available in a number of other load balancing products
It seems like a minor nit that could be "fixed", especially since the "it gets better" is in reference to "not only smaller, but also much faster" and it's not an insignificant performance increase, it's 100 times faster. Basically everyone (mostly) uses squashfs, and this absolutely trounces it - according to the author.
Thanks for the feedback! I was probably a little too excited when writing the documentation. Reading it again with some distance, I can certainly see why this might be a bit off-putting. I'll keep this in mind for the future!
yes, now let's put all public government data on one of this so that we can all inspect the public institutions which we use to rule over ourselves and achieve the next level of real digital democracy: full institutional public transparency
...as if... now, let me spit out all this purple democratic kool aid
DwarFS: A fast high compression read-only file system - https://news.ycombinator.com/item?id=32216275 - July 2022 (64 comments)
DwarFS: A fast high compression read-only file system - https://news.ycombinator.com/item?id=25246050 - Nov 2020 (111 comments)