Is there a real fundamental difference between the two approaches? It seems to m...

jude- · on Dec 10, 2014

> Is there a real fundamental difference between the two approaches?

Yes. With true directories, you can implement directory rename() with O(a+b) time complexity, where a and b are the depths of the paths to the directories. With a flat key/value store, directory rename() requires at least O(n) time complexity for n directory children, since each child must be renamed.

TheCoelacanth · on Dec 10, 2014

It would require rethinking the data structure used to store it to make it work efficiently. A trie-like structure would probably work.

jude- · on Dec 10, 2014

I'd be curious to know how to make it more efficient. Either the descendants of a directory include the directory's name in their identifier, or they don't. If they do, then to implement rename() you need to do something such that all descendants of the directory can be queried by a different identifier than the one they were given upon creation (where the difference is exactly the name change of an ancestor directory). You can:

* rename all descendants on directory rename (expensive--takes O(n) time for n descendants).

* log each rename() and translate a key with a new directory name into a key with the original directory name (requires O(k) space for k renames and O(d) time in the worst case for translating a path of depth d whose directories have all been renamed; introduces the need to garbage-collect log records when all descendants of a renamed directory get deleted).

Granted, this is a more nuanced than what I said before about rename() requiring O(n) time complexity when implemented on top of a key/value store, but the second option is still more costly than rename() with true directories. I'm not sure (but cannot prove at this time) that we can do better than the second option above, but would love to hear your thoughts.

TheCoelacanth · on Dec 10, 2014

I'm not debating that it's inefficient if you use the same data structure that is used now. That's why I suggested using a different one.

Instead of a flat list of file names for each directory, you store the file names in a trie, so that files with a shared prefix in their name have a shared ancestor node in the trie. To do a directory rename, you just have to find the node in the trie that corresponds to the directory, move that node to a different location in the trie and change the portion of the name that is stored in that node.

JadeNB · on Dec 10, 2014

> Instead of a flat list of file names for each directory, you store the file names in a trie, so that files with a shared prefix in their name have a shared ancestor node in the trie.

But then, if they behave like directories, and the software treats them like directories, aren't you effectively implementing directories in everything but name? (I mean, of course current software doesn't actually treat directory names this way; but it seems like changing the data structure is going a long way to maintain a directory-less fiction.)

TheCoelacanth · on Dec 11, 2014

No, because you could also rename any shared prefix of file names, not just the parts that end with a slash. Instead of directories having a special significance, now any arbitrary prefix in a file name has the same significance.

jude- · on Dec 11, 2014

I'm not seeing how this is any different from true directories in terms of time and space complexity, as far as rename and path resolution are concerned. Sure, the trie nodes don't have to be implemented as key/value pairs in the data store, but don't they otherwise serve the same purpose in the affected algorithms--pointing to key/value pairs or other directories? Doesn't this still mean that path resolution is O(d) for a path of depth d in the trie, and rename is O(a+b) for two paths of depth a and b?

TheCoelacanth · on Dec 11, 2014

The time and space complexity aren't different. That's the whole point I have been arguing, i.e. that you can support the paths as an arbitrary string semantics without losing efficiency for a directory rename operation.

JadeNB · on Dec 11, 2014

> No, because you could also rename any shared prefix of file names, not just the parts that end with a slash.

I think that

    $ mkdir -p a/bc; touch a/bc/{d,e}
    $ cd a; mv bc Bc

(changing just the 'b', not the whole 'bc') amounts to the same behaviour for directories. I guess you could argue that it's different because you still have to mention the `c`, even if you don't change it.

TheCoelacanth · on Dec 11, 2014

What if instead of just one file starting with b, you have multiple files starting with b?

If there are files with names bc, bd and be, `mv b B` now renames all of those to Bc, Bd and Be. With normal directories, there is no way to do that as one operation.

JadeNB · on Dec 11, 2014

Good point; I understand now.

alayne · on Dec 10, 2014

Is directory rename an operation that needs to be efficient though? Probably not.

sysk · on Dec 10, 2014

Efficiency is important if the operation needs to be atomic though. POSIX for example requires rename to be an atomic operation. http://en.wikipedia.org/wiki/Rename_(computing)

alayne · on Dec 10, 2014

There"s no efficiency requirement on atomicity. Who said we need POSIX filesystem semantics anyway? Is it so important we should trade scalability? Obviously that wasn"t the s3 design choice.

mateuszf · on Dec 10, 2014

Filesystems users (developers) already assume that renames are fast. Changing the mode now would mean that some of the applications would seem like blocked or could behave strangely. S3 users already know and should handle that.