> Instead of AVL and red black trees, we instead learned about 2-3 trees and B trees
When we first were taught about B trees in my databases course in college, he professor quipped something like "these are the reason you'll never actually use a red-black tree in practice". I've actually found that in practice I rarely ever up actually needing to use any sort of data structure that maintains sort order; for whatever reason, I've run into cases far more often where I need to maintain insertion order than sort order, but I'm not confident about whether this is due more to the type of stuff I work on or if it's just a more common requirement in general.
> I've run into cases far more often where I need to maintain insertion order than sort order, but I'm not confident about whether this is due more to the type of stuff I work on or if it's just a more common requirement in general.
I've had the exact opposite experience, so I'm guessing it has a lot to do with the type of stuff you've (and I've) worked on.
He normally taught databases and one of my great regrets was dropping his 500 level database class one semester when things were rough. (Another regret was not taking networking from Professor Landweber https://en.wikipedia.org/wiki/Lawrence_Landweber .)
It could be that my professors were simply not that good, so that I don't really understand this. I mostly groked the subjects by working through the books and exercises in them. The professors' lectures almost always went over some detail too quickly, so in a 3 hour lecture, I usually ended up following only the first hour or so.
My question is maybe better put as: what quality or approach makes them great lecturers?
Having a class in databases from the person who was the DeWitt Clause in Oracle licenses.
I had his class before (the aforementioned sophomore data structures and algorithms) and I knew he was a good lecturer. That wasn't a question. It was a "too much going on that semester and I wasn't able to do the assignments in a timely manner".
Professor Landweber's class had a waiting list that I scoffed at as a college student and looked for something that I could get into and fill the requirement without realizing who he was. It wasn't until later while reading the UNIX and Linux System Administration Handbook and seeing an image in there that was attributed to him that the name clicked but he wasn't teaching that next semester and I had filled that requirement elseclass.
Another regret was dropping numerical methods with Professor de Boor ( https://en.wikipedia.org/wiki/Carl_R._de_Boor ) when I couldn't get my head around splines rather than going to his office hours (he was also the undergraduate advisor and very helpful).
Those classes could have changed the trajectory of my career. I was looking at being a sysadmin instead and writing perl and keeping some big iron running. My career pivoted from sysadmin to webmaster to web developer (perl) to web developer (Java). I have no regrets from that path - I do have regret for not learning what I could from the experts in their field when I was in a place and time to learn from them.
Slightly random anecdote. Had an intern one summer at a large tech company I worked for, who was determined to inject a red-black tree in to the project they were working on for us, using hand-rolled code.
I'm not sure what was going through his head at the time, whether he'd just learned how to write them at college, or if he just thought they were the pinnacle of engineering.
None of the cases they were trying to shoe-horn them in to made sense. Places where you might debate that even a straight array would be sufficient, due to the limited number of items. The worst case of searching to the end of the array wouldn't matter. Doubly so given they weren't working on anything anywhere near a hot path, or a time sensitive operation. Standard library components would have been perfectly fine, and not carried the maintenance burden.
They did, on at least a couple of occasions, smugly explain what a black-red tree was to their mentor, who had likely implemented their first red-black tree at college before that Intern had ever even touched a computer for the first time. He showed remarkable patience.
> I'm not sure what was going through his head at the time, whether he'd just learned how to write them at college, or if he just thought they were the pinnacle of engineering.
That's understandable and rather common with new fresh grads. They have the energy and motivation but may not quite have the experience. We were all there once, including me.
My history professor from 8th grade used say: "There is a difference between knowing what you're doing and doing what you know".
> I'm not sure what was going through his head at the time, whether he'd just learned how to write them at college, or if he just thought they were the pinnacle of engineering.
I think it's easy to forget that a lot of interns might never have actually worked on a "real" codebase before; I know I certainly hadn't before my first internship! For someone who had only ever worked on homework code in classes before and maybe some side projects for fun, it's very easy to imagine that they honestly didn't understand that implementing the hardest data structure they knew by hand wouldn't actually be an impressive display of engineering to the company they were hoping to get a job offer from.
First code reviews are always interesting with interns, I love reading through them to learn stuff (I'm an operations focussed / sysadmin type, coding is more something I've picked up as I've gone along and wanted to automate stuff)
This particular intern was dead set on this red-black tree, despite it being rejected constructively, repeatedly, every time they tried to introduce it, and despite mentoring on the subject from the experienced developer who was his mentor, and guidance from other engineers in the group.
The only intern I every had to deal with who didn't seem to recognise that production code is different from academic code, where the balance on clever vs maintainable is vastly different :)
I remember part of it being that they beat efficiency into your brain throughout most of early CS. In college, using the naive approach is treated as being ignorant.
Then you work on real software and realize that computers are pretty fast and basic hashmaps & arrays will likely suffice for the size of the datasets you're working with, and beyond that, performance is more about tool/database choice than rolling your own algorithms.
In many cases just using a hash table and sorting the keys is faster than using a B-tree anyway. I tried switching from a separate sort to a hash table in the Bevy game engine for binning of transparent objects recently and it was a big regression.
Yep, this is the thought process I had; a red-black tree or B-tree is something I'd consider for a map that needed sort order, and a hashmap would be what I'd use by default if I didn't care about order.
For additional context, I work almost entirely in Rust nowadays, and BTreeMap and HashMap are the two map types in the standard library (https://doc.rust-lang.org/std/collections/index.html). I've ended up needing insertion-ordered maps quite often in my work though, which is not something I've seen taught very often or in standard libraries, although there are plenty of third-party implementations out there, and in most languages it's not too difficult to hand-roll an implementation using an unordered map and a linked list by keeping both the value and a reference to the node containing the element in the map, which provides constant time insertion, lookup and deletion. (Rust is somewhat notorious for _not_ being a language where implementing something like this by hand is super easy though, since tracking mutable references to individual nodes of a linked list across separate hashmap values is not something the borrow checker is enthusiastic about)
Not really. Hashmaps don't have guaranteed performance characteristics and require the "correct" hash function to work properly. In particular, their performance degrades as they become more "full" requiring a re-allocation to a table of generally n-times the size. Balanced trees could be regarded a more generic solution to the mapping problem and are more suitable in many applications.
Trees require a lot more memory indirections than hashmaps, which makes them slower to traverse. They require a lot more interior pointers, which makes them bigger. It’s easy for a hashmap to beat a tree.
Yes really, that’s why the vast majority of languages default to a hashmap (if they even have tree-based maps at all). Turns out the average application is a lot more interested in a low-constant-factor O(1) best case than in a high-constant-factor O(log n) worst case.
> Hashmaps don't have guaranteed performance characteristics
Of course they do.
> require the "correct" hash function to work properly.
And tree maps require correct comparators to work correctly.
> In particular, their performance degrades as they become more "full" requiring a re-allocation to a table of generally n-times the size.
Yes? Also “n-times” seems like pretty fuddy fud. And it seems weird to be more worried by the reallocation of a hashmap when a tree requires an allocation per node, so your average balanced tree requires a heap allocation per entry. And two pointers, for up to 16x overhead not including the allocator’s own metadata.
> Balanced trees could be regarded a more generic solution to the mapping problem and are more suitable in many applications.
And the amphibious cycle could be regarded as the more generic solution to the transport problem.
>> Hashmaps don't have guaranteed performance characteristics
>Of course they do
No, they do not, at least not worst case O(1). If you write a very poor (but functional!) hash function or an adversary is choosing your items, then congratulations, your hashmap is now a linked list with extra steps.
Sure, you could use a robust cryptographic hash function, but you didn't because (a) it's not the default in your language and (b) it's slow and you are "a lot more interested in a low-constant-factor O(1) best case".
I mean it's kind of pedantic but yes hash maps do have guaranteed performance characteristics. They are not guaranteed to be O(1), but they will never be something like O(2^N) either. You can give guarantees about their performance if you can make guarantees about the distribution of the inputs and the properties of the hash used.
In my own experience, most uses of hash maps are typically O(logn). A good rule of thumb is hash maps are preferable unless you need predictable worst case performance or your hash maps are directly exposed to the public.
> A good rule of thumb is hash maps are preferable unless you need predictable worst case performance or your hash maps are directly exposed to the public.
Also, obviously, features only one of them provides e.g. if you need range queries then a hash table is quite useless. Meanwhile if you need insertion ordering, that’s reasonably easy to do with a hashmap.
Also if it's big enough that recursing will blow out your stack.
Yes, you can throw the nodes on an array in the heap to allow tail recursion, but at that point you're going to be losing to a map with all but the biggest datasets, and even then the copies really aren't desirable
> Yes? Also “n-times” seems like pretty fuddy fud. And it seems weird to be more worried by the reallocation of a hashmap when a tree requires an allocation per node, so your average balanced tree requires a heap allocation per entry.
Yes, it requires allocation for every entry whereas a hash map requires it for some entries. You don't know which ones. So real time constraints are out the window. But no it's not something your average web programmer (including me) needs to worry about. Some people do, though.
To clarify, I definitely don't mean I've been implementing these structures by hand! Even when using data structures from the standard library or other dependencies though, I just don't end up needing sort-ordered collections very often in my work, and I thought that was interesting (although sibling comments might indicate that this isn't necessarily typical).
When we first were taught about B trees in my databases course in college, he professor quipped something like "these are the reason you'll never actually use a red-black tree in practice". I've actually found that in practice I rarely ever up actually needing to use any sort of data structure that maintains sort order; for whatever reason, I've run into cases far more often where I need to maintain insertion order than sort order, but I'm not confident about whether this is due more to the type of stuff I work on or if it's just a more common requirement in general.