Real-world large filesystems are distributed across many thousands of hosts and ...

wahern · on Feb 13, 2018

The fact that real-world storage systems are distributed on the network bolsters the case for supporting 128-bit and even larger types.

Creating unified namespaces is really useful and a _great_ simplifier. The reason we don't do that as often as we should is because of limitations in various layers of modern software stacks, especially in the OS layers.

Unfortunately, AFAIU ZFS only supports 64-bit inodes. A large inode space, like 128-bit or even 256-bit, would be ideal for distributed systems.

Larger spaces for unique values are useful for more than just enumerating objects. IPv6 uses 128 bits not because anybody ever expected 2^128-1 devices attached to the network, but because a larger namespace means you can segment it easier. Routing tables are smaller with IPv6 because its easier to create subnets with contiguous addressing without practical constraints on the size of the subnet. Similarly, it's easier to create subnets of subnets (think Kubernetes clusters) with a very simple segmenting scheme and minimal centralized planning and control.

Similarly, content-addressable storage requires types much larger than 128 bits (e.g. 160 bits for Plan 9 Fossil using SHA-1). Not because you ever expect more than 2^128-1 objects, but because generating unique identifiers in a non-centralized manner is much easier. This is why almost everybody, knowingly or unknowingly, only generates version 4 UUIDs (usually improperly because they randomly generate all 128 bits rather than preserving the structure of the internal 6 bits as required by the standard).

ZFS failed not by supporting a 128-bit type for describing sizes, but by only supporting a 64-bit type for inodes. And probably they did this because 1) changing the size of an inode would have been much more painful for the Solaris kernel and userland given Solaris' strong backward compatibility guarantees, and 2) because they were focusing on the future of attached storage through the lens of contemporary technologies like SCSI, not on distributed systems more generally.

paulsutter · on Feb 13, 2018

Unified namespaces on many-petabyte filesystems are perfectly commonplace

HDFS, QFS,.... even old GFS

You wouldn’t make them Linux/fuse mountpoints though, that’s just an unneeded abstraction. Command line tools don’t work with files that are 100TB each.

wahern · on Feb 13, 2018

  Command line tools don’t work with files that are 100TB each.

No, but they do work with small files, which presumably most would be if the number of objects visible in the namespace system were pushing 2^64.

100TB files are often databases in their own right, with many internal objects. But because we can't easily create a giant unified namespace that crosses these architectural boundaries, we can't abstract away those architectural boundaries like we should be doing and would be doing if it were easier to do so.

wahern · on Feb 13, 2018

Just to be more specific, imagine inodes were 1024 bits. An inode could become a handle that not only described a unique object, but encode how to reach that object. Which means every read/write operation would contain enough data for forwarding the operation through the stack of layers. Systems like FUSE can't scale well because of how they manage state. One of the obvious ways to fix that is to embed state in the object identifier.

A real world example are pointers on IBM mainframes. They're 128 bits. Not because there's a real 128-bit address space, but because the pointer also encodes information about the object, information used and enforced by both software and hardware to ensure safety and security. Importantly, this is language agnostic. An implementation of C in such an environment is very straight forward; you get object capability built directly into the language without having to even modify the semantics of the language or expose an API.

Language implementations like Swift, LuaJIT, and various JavaScript implementations also make use of unused bits in 64-bit pointers for tagging data. This is either not possible on 32-bit implementations, or in those environments they use 64-bit doubles instead of pointers. In any event, my point is that larger address spaces can actually make it much easier to optimize performance because it's much simpler to encode metadata in a single, structured, shared identifier than to craft a system that relies on a broker to query metadata. Obviously you can't encode all metadata, but it's really nice to be able to encode some of the most important metadata, like type.

AstralStorm · on Feb 13, 2018

POSIX invented "slow" extended attributes for this kind of use.

catdog · on Feb 13, 2018

For IPv6 the 128 bit has its justification. It's supposed to enable proper hierarchical routing and to reduce the number of entries inside the routing tables which is the pain point where it gets expensive. The idea is that no one, at any level, needs to request the allocation of a second subnet because what he has is large enough by default. So you need more bits than necessary to allow a little bit of "wastefulness" even after some layers of subnetting.

Moreover the convention that there should be no subnet smaller than /64 enables stateless auto configuration for hosts. 64 bits is enough to fit common (supposed to be unique) hardware identifiers and even is large enough to assign random addresses (like with privacy extensions) with a very low probability of collisions.

akvadrako · on Feb 13, 2018

That was the idea, but it didn't really turn out that way. Stateless auto configuration leaks your MAC address, which is a privacy issue. Most servers use static IPs and most desktops use random IPs, with checks for collisions.

IMHO, the 128 space was a big mistake. It's twice as hard to communicate between humans, most languages and databases don't support the data type natively and it complicates high-speed routing.

An average of 48 bits for a network and 15 for the host would have been better. For other reasons you almost never want more than a few hundred hosts on one layer-2 network anyway.

kstrauser · on Feb 13, 2018

Except IPv6 being 128-bit makes fast hardware implementations much easier than if it had to deal with shorter prefix lengths. Nothing shorter than 128 really makes sense at all in an IPv4 replacement.

mixmastamyk · on Feb 13, 2018

No comprendo. Why would 128 be faster than 64?

will4274 · on Feb 13, 2018

Consider a tiny (5 machine) piece of the internet. Three hubs, an outlink and two smaller hubs, all connected (a triangle). With 4 bits, the left hub can have all the 0xxx addresses and the right hub can have all the 1xxx addresses. No matter where the devices connect, they can all get an IP and the outlink only needs to remember a simple rule (starts with one, go right, else go left).

Compare to a 3 bit network. By moving IPs from hub to hub, all five devices can always get an address, but the small hubs need to communicate which addresses they own to the outlink and to each other in order to avoid address exhaustion on either hub. Routing a packet is slower because the routing is more complex.

mixmastamyk · on Feb 13, 2018

So it is routing efficiency? I’ve been asking this question for five years and this is the first time a coherent answer has been offered, thanks.

How much more efficient is the 128 vs 64 for routing and what trade offs does it make for other things? I’m now wondering.

sliken · on Feb 13, 2018

IPV6 is basically 64 bits for routing and 64 bits for the local network segment. Seems plausible that it's faster than trying to mask out the bits you need.

mixmastamyk · on Feb 13, 2018

Why not 32:32? Shouldn’t four billion internets be enough?

tempay · on Feb 13, 2018

Probably, but having a 2^64 tolerance factor isn't a bad idea given how difficult moving from IPv4 is proving to be.

This way we have ~1 IP address per 6 micrograms in the solar system, or per 3.4 tonnes in the galaxy.

kstrauser · on Feb 13, 2018

Hierarchy is nice, though. If you can model the bits as a tree, it becomes super quick to figure out where to route a packet. You can model stuff like that trivially with an FPGA.

zzzcpan · on Feb 13, 2018

On the other hand such committee bikeshedding seems to work rather well for PR. It makes them ignore hard problems and instead focus on things most people can understand and relate. Gaining more trust as opposed to a well designed thing with nothing to understand or relate.