Amazing, servicing *thousands* of users for $150 a month. I'd clearly start writ...

pjscott · on Aug 29, 2013

You don't know the size of their dataset, or the number of queries per second they need to support, or the latency requirements they have to meet. They do, and found that the obvious solution -- prefix queries on B-trees -- was too slow in practice. That their actual measurements trump your sarcasm, and you would know this if you'd read beyond the first paragraph before contemptuously dismissing the whole thing.

_3u10 · on Aug 30, 2013

You don't know that I didn't read the whole article, and since I did I know their dataset is < 8GB because they are running EC2 medium instances and they said they have a budget of $150 which means they have a max of 2 servers. Also because I read the code I know it doesn't write to disk which means the dataset has to fit in RAM unless they are relying on the pager to swap memory for them.

Since an 8 GB dataset is a fucking joke I made fun of their post in a sarcastic way.

Also databases that lock are a fucking joke too, maybe I'll just repeat webscale a few times and that will make it performant.

Should be titled "Startup throws out Mongo DB, gets decent performance"

argusdusty · on Aug 30, 2013

EC2 medium instances come with 480GB space. Our database currently takes roughly 60GB.

Ferret isn't running on this entire database, though - it's for auto-complete searches over our artist names, roughly 4MB.

EDIT: Since I can't seem to reply to fleitz' below comment, I'll post a response here:

1. Boyer-Moore takes linear time. Ferret takes logarithmic time. Also, it's intended for searching over a single string, so using it on a dictionary would require some sort of hack like a termination character, taking slightly more memory and time.

2. A trie was one of the original iterations of Ferret. The reason it lost out was memory usage, requiring quadratic memory from every word (because this is a suffix search, not just a prefix search).

thedufer · on Aug 30, 2013

I don't understand why the trie would take quadratic memory. I would expect it to be high-constant linear. Can you explain?

argusdusty · on Aug 30, 2013

Because a trie works great for prefix searches, but we want substring searches, so we'd have to modify the trie to be more like a suffix tree, in it's simplest form inserting each possible suffix into the trie and doing a prefix search over that. For each word, adding up the length of each suffix gives quadratic memory.

thedufer · on Aug 30, 2013

That's linear. It's quadratic on each word, but that doesn't matter (this is why I mentioned high-constant). It's linear on the size of the data set, which is the important part. I'd be happy to code it up if you'd like. Even a binary tree on all suffixes (which would be somewhat simpler to write) would have the linear memory/logarithmic search time that you're looking for.

Edit: I can't reply to you yet, so here goes:

Calling it O(n.m^2) is, at best, an abuse of big-O notation. Since m is a constant (1), that is large-constant O(n). Which is what I've been saying.

Yes, it increases the size of the data set. That would be the "large-constant" part.

(1) If this is not the case, I would love to hear why.

argusdusty · on Aug 30, 2013

I did say "requiring quadratic memory from every word." I'm not sure you can really call that linear - it's O(n m^2), where n is the number of words, and m is the average word length. It's not the O(n m) that Ferret achieves.

For even a medium sized dictionary of a few dozen MB, you'll find yourself quickly running out of memory. The 4MB dictionary running on the demo would jump to 328MB. Both a Trie and a binary search tree (both of which I've coded and tested) take significantly more memory, and the binary search tree is somewhat slower (try doing a tree traversal between two points as fast as the same traversal over an array).

EDIT: I'll just reply to your edit here for simplicity and because I really don't want to start another thread as I have to get to sleep. I view m as about as constant as n - adding or subtracting words generally changes them both. Try thinking of m as c/n, where c is the total number of characters in your dictionary. O(n m^2) -> O(c^2 / n), whereas ferret uses O(c). The 1,000,000 most frequent English words might have an m of 6-7, but the 100,000 longest English words might have an m of 12-15, taking more memory in a trie, but less memory in Ferret.

But the point is really irrelevant, tbh. We both know how m and n work, and how much memory the trie and Ferret cost for different dictionaries, which is all that should really matter.

_3u10 · on Aug 30, 2013

4MB? Wouldn't boyer-moore, or a trie be very efficient for this?

thedufer · on Aug 30, 2013

boyer-moore doesn't make sense if you're searching the same data set over and over again - it doesn't do any pre-processing on the data, which is why it can't be better than linear time.