Databases use hashes all the time, also ordered indexes are very frequently used...

kccqzy · on Dec 10, 2019

I agree, but instead I would use the term "path-dependent" for hash tables rather than "non-deterministic" because, after all, unless you salt your hash functions, there is really no randomness anywhere. What you think of as non-deterministic really is deterministic but it is determined by the precise order of insertion. This is sometimes also called memorylessness.

Even though tree-based data structures give you an iteration order that is predictable and useful, the structure itself is still not fully memoryless. For example in a red-black tree if you insert an element smaller than all preexisting elements and then immediately delete the minimum, the tree could be different.

rklaehn · on Dec 10, 2019

> I agree, but instead I would use the term "path-dependent" for hash tables rather than "non-deterministic" because, after all, unless you salt your hash functions, there is really no randomness anywhere. What you think of as non-deterministic really is deterministic but it is determined by the precise order of insertion. This is sometimes also called memorylessness.

You are right that non-deterministic is not strictly correct for non-salted hash functions. That makes it even more annoying, since it is deterministic enough that you might accidentally rely on the order. It is close enough to nondeterministic since iteration order can change when the hash function changes, but also when some minor internal parameter of the hashtable changes.

> Even though tree-based data structures give you an iteration order that is predictable and useful, the structure itself is still not fully memoryless. For example in a red-black tree if you insert an element smaller than all preexisting elements and then immediately delete the minimum, the tree could be different.

Yes, I am aware and a very big fan of data structures that are actually fully memoryless, such as all kinds of binary tries, radix trees, or just wrapped sorted arrays. The latter are totally underrated, since they have very compact in memory representation (just the elements, single piece of heap) and have very competitive lookup performance.

They of course totally suck when you want to do random single element updates, but then just use something else...

I am going to release an entire collections library based on sorted arrays for rust over Christmas...

senderista · on Dec 10, 2019

If you like sorted arrays and Rust, you might want to take a look at my sorted-vec crate, which has O(sqrt(N)) insertions and deletions: https://github.com/senderista/sorted-vec. In my tests it takes less than half the memory of BTreeSet.

PS: It isn't difficult to devise hash tables with path-independent iteration order. You can simply order the keys by their hash code values. See e.g., my implementation of bidirectional linear probing: https://github.com/senderista/hashtable-benchmarks/blob/mast....

rklaehn · on Dec 10, 2019

Yes, saw that one. Neat.

My stuff is even simpler. It is just a flat array in memory and should just not be used when you want random single element inserts or deletions. But I find that rather uncommon anyway in many use cases. I made sure that building a new collection from an iterator is fast.

What I do use is a minimum comparison sort that makes very common cases very fast in terms of number of comparisons, and some unsafe stuff to allow in place updates to prevent allocations.

Not quite ready for prime time yet, but you can see the basic ideas being used in this data structure:

https://docs.rs/range-collections/0.1.0/range_collections/ra...

You could do a deterministic hash table by using a cryptographic hash function like sha256 and then use a binary trie of the hashes. No need to deal with collisions as far as we know. The hash is expensive, but if you need the hash for something else anyway, this is pretty neat...

senderista · on Dec 10, 2019

I have played with hash tries in the past for determinism/dynamism compared to hash tables. I think they're an excellent choice for integer sets, where you can use a reversible hash function to losslessly randomize the elements (I use the same idea in the integer hash table I linked above). If I didn't need graceful resizing, though, I would use either Cleary hash tables or bucketized cuckoo tables for this application.

rklaehn · on Dec 10, 2019

The hash set and hash map for scala are actually based on hash tries, but with a non-cryptographic hash, so collisions still need to be handled.

I wrote efficient set operations for scala.collection.immutable.HashSet once. Lots of bit-twiddling, which is not as fun on the JVM compared to rust...

lichtenberger · on Dec 10, 2019

You can also implement a hash map as a tree ;-)

rklaehn · on Dec 10, 2019

Been there, done that... :-)