Databases use hashes all the time, also ordered indexes are very frequently used in programming (e.g. C++ stl).
I think ordered indexes are a good default for use in programming languages. They are deterministic and do not have strange corner cases that can be exploited.
E.g. the order when iterating over elements in a hashtable is pretty much random, whereas the order of elements in an ordered tree based data structure is deterministic. That can very easily lead to the output of a program being nondeterministic, e.g. when serializing the contents of a hashtable. This is a frequent cause of compilers having non-deterministic builds.
I like my programs to be deterministic and want to precisely control where randomness enters the execution.
Hash collisions can lead to pathological performance. This can be exploited for DOS attacks, unless the hash data structure in question uses good randomization of the hash function. E.g. the rust hash data structures do a good job with this, but the issue just does not exist for an ordered tree based data structure.
The performance difference is often very small unless having very complex keys or very large sets/maps. A hash based data structure is a valid choice for me once the performance actually matters.
I agree, but instead I would use the term "path-dependent" for hash tables rather than "non-deterministic" because, after all, unless you salt your hash functions, there is really no randomness anywhere. What you think of as non-deterministic really is deterministic but it is determined by the precise order of insertion. This is sometimes also called memorylessness.
Even though tree-based data structures give you an iteration order that is predictable and useful, the structure itself is still not fully memoryless. For example in a red-black tree if you insert an element smaller than all preexisting elements and then immediately delete the minimum, the tree could be different.
> I agree, but instead I would use the term "path-dependent" for hash tables rather than "non-deterministic" because, after all, unless you salt your hash functions, there is really no randomness anywhere. What you think of as non-deterministic really is deterministic but it is determined by the precise order of insertion. This is sometimes also called memorylessness.
You are right that non-deterministic is not strictly correct for non-salted hash functions. That makes it even more annoying, since it is deterministic enough that you might accidentally rely on the order. It is close enough to nondeterministic since iteration order can change when the hash function changes, but also when some minor internal parameter of the hashtable changes.
> Even though tree-based data structures give you an iteration order that is predictable and useful, the structure itself is still not fully memoryless. For example in a red-black tree if you insert an element smaller than all preexisting elements and then immediately delete the minimum, the tree could be different.
Yes, I am aware and a very big fan of data structures that are actually fully memoryless, such as all kinds of binary tries, radix trees, or just wrapped sorted arrays. The latter are totally underrated, since they have very compact in memory representation (just the elements, single piece of heap) and have very competitive lookup performance.
They of course totally suck when you want to do random single element updates, but then just use something else...
I am going to release an entire collections library based on sorted arrays for rust over Christmas...
If you like sorted arrays and Rust, you might want to take a look at my sorted-vec crate, which has O(sqrt(N)) insertions and deletions: https://github.com/senderista/sorted-vec. In my tests it takes less than half the memory of BTreeSet.
PS: It isn't difficult to devise hash tables with path-independent iteration order. You can simply order the keys by their hash code values. See e.g., my implementation of bidirectional linear probing: https://github.com/senderista/hashtable-benchmarks/blob/mast....
My stuff is even simpler. It is just a flat array in memory and should just not be used when you want random single element inserts or deletions. But I find that rather uncommon anyway in many use cases. I made sure that building a new collection from an iterator is fast.
What I do use is a minimum comparison sort that makes very common cases very fast in terms of number of comparisons, and some unsafe stuff to allow in place updates to prevent allocations.
Not quite ready for prime time yet, but you can see the basic ideas being used in this data structure:
You could do a deterministic hash table by using a cryptographic hash function like sha256 and then use a binary trie of the hashes. No need to deal with collisions as far as we know. The hash is expensive, but if you need the hash for something else anyway, this is pretty neat...
I have played with hash tries in the past for determinism/dynamism compared to hash tables. I think they're an excellent choice for integer sets, where you can use a reversible hash function to losslessly randomize the elements (I use the same idea in the integer hash table I linked above). If I didn't need graceful resizing, though, I would use either Cleary hash tables or bucketized cuckoo tables for this application.
The hash set and hash map for scala are actually based on hash tries, but with a non-cryptographic hash, so collisions still need to be handled.
I wrote efficient set operations for scala.collection.immutable.HashSet once. Lots of bit-twiddling, which is not as fun on the JVM compared to rust...
I think ordered indexes are a good default for use in programming languages. They are deterministic and do not have strange corner cases that can be exploited.
E.g. the order when iterating over elements in a hashtable is pretty much random, whereas the order of elements in an ordered tree based data structure is deterministic. That can very easily lead to the output of a program being nondeterministic, e.g. when serializing the contents of a hashtable. This is a frequent cause of compilers having non-deterministic builds.
I like my programs to be deterministic and want to precisely control where randomness enters the execution.
Hash collisions can lead to pathological performance. This can be exploited for DOS attacks, unless the hash data structure in question uses good randomization of the hash function. E.g. the rust hash data structures do a good job with this, but the issue just does not exist for an ordered tree based data structure.
The performance difference is often very small unless having very complex keys or very large sets/maps. A hash based data structure is a valid choice for me once the performance actually matters.