Web search 30M rows sub-second, cheaply

arafalov · on Nov 22, 2015

Did you try looking at Solr or Elasticsearch? Inverted index (Lucene) is what they are based on. You even said "search" in your own article....

thedarkproject · on Nov 22, 2015

Would recommend looking into ElasticSearch too. Recently had my first hands-on experiences and I am pretty amazed.

lqdc13 · on Nov 22, 2015

Not sure about 2GB limit though. It doesn't work well at low RAM. Probably better to just use your own suffix tree or some trie for all the stuff you want to search for and use that. I bet you can get it down to something like 100MB.

loopbit · on Nov 22, 2015

Depending on the index size (and type of queries, and schema, and expected performance...), 2GB might be more than enough. Solr does a very good work in caching results, query filters and so on in memory. Apart from that, it's disk based.

Of course, performance won't be stellar if it has to go to disk to find every result, but even with that you should be able to get sub-second results with some tweaking.

endymi0n · on Nov 22, 2015

He didn't really define what he was going to "search" for, so yeah, when meaning full text search, Elastic is probably the way to go. When going for exact lookups, Aerospike will actually make use of his SSDs, when going for slice & dice on wide rows, BigQuery or Redshift will fit his bill - unfortunately he didn't say much about his exact use case. But inventing your own data store is probably one of the worst ideas ever except for gaining knowledge / as a rite of passage.

isoos · on Nov 22, 2015

ElasticSearch (and Solr and Lucene) can be used to do index (and lookup) exact expressions, e.g. string keys/enums, or numbers. It can work with the combination of text search and exact search terms. I've had great results of using these for general purpose search and database lookups.

jgord · on Nov 22, 2015

.. I should give a fair and proper comparison to Lucene / Solr / Elasticsearch .. and indeed Sphinx - but haven't used these much.

I guess I assumed the entry-point is a 2 to 4 node cluster ? But this may be my own wrong assumptions.. they may be really ram efficient.

I confess I have a somewhat irrational dislike of what I call big-java .. I really wanted to like Cassandra+Spark for instance, but the install is just such a heavy download of code, I just feel an aversion to installing that.

I have worked with Java, but not recently.. there may be incredibly efficient and small codebases that belie the big-java stereotype.

Iv'e been enjoying the node.js / npm ecosystem myself - but I know smart people who love java-land. We can tolerate each others rants and still share a beer :]

acdha · on Nov 23, 2015

ElasticSearch is usually less work to install or manage than the average node app – it completely belies the Java stereotypes.

The Lucene-based search engines are also very well optimized. I'd tend to think of a 30M row dataset as a single machine size unless you need failover or the records are quite large.

Can_Not · on Nov 25, 2015

> ElasticSearch is usually less work to install or manage than the average node app – it completely belies the Java stereotypes.

Wow, that is definitely the opposite of true.

wumpus · on Nov 22, 2015

Or Xapian.

pgaddict · on Nov 23, 2015

Sadly the post is very light on important details. For example, which version of PostgreSQL and which kind of indexes? Because GIN indexes do pretty much exactly what he describes as 'tag->id' (but in form that is compressed and very efficient for processing).

jgord · on Nov 23, 2015

You and others make a good case - my post was to summarize what I ended up doing, as an interesting hack / datapoint - not to slam postgres [ or Redis, or Lucene/solr, for that matter ]

The main takeaway is really how fast SSDs are and how much bang for your buck you can get on a single commodity host instance [ thru some creative wrangling of data structures ]

I will endeavor to followup with a fair comparison using postgres...

Can you suggest the best approach / config to keep the top-index working set in ram ?

pgaddict · on Nov 23, 2015

I think the form of the blog post (pretty much just short bullet points) makes it rather difficult to get an idea of what's the intended takeaway.

I do share your conclusion that SSDs are very good value for workloads that need to do a lot of random I/O.

Regarding the best config for PostgreSQL, it really depends on the hardware. If you really want to use boxes with just 2GB of RAM, use smaller shared buffers and let the page cache do the rest.

If you need more help with tuning PostgreSQL for the benchmark later, either ping me at tomas(at)pgaddict.com or join pgsql-performance@postgresql.org (I do occasionally lurk there, but others will surely help too).

jgord · on Nov 23, 2015

yeah.. I started with bullets as a draft, intending to expand into prose, but it seemed borderline readable, so I thought Id leave it as an exercise in blogging.

awesome, thanks Thomas - Ill check out your blog too.

joshu · on Nov 22, 2015

This is one of those "computer science matters" things.

Learn your data structures. Learn what the storage engines use. Use the right structure for the right problem.

jgord · on Nov 22, 2015

..your right.

The spirit of my hack was really just that - to apply the well known data structures and algos to solve a problem.

Its unusual to want fast, large and cheap all together - if you really want to squeeze things you might look at how modern SSD performance can be exploited - my hack is just one instance of that.

lukaslalinsky · on Nov 23, 2015

Most comments here are about using an existing search solution, but sometimes inventing your own solution has a lot of benefits.

I needed something similar when I was building https://acoustid.org/ and for the first iteration I used PostgreSQL's GIN indexes, which are nice and easy to work with, but once the index grew beyond some point, it was getting very slow. I spent some time designing a minimalistic inverted index that does just what I need, nothing else, while compacting the data as much as possible, to keep it in RAM, and the results were just excellent. It was super fast compared to the GIN indxes. Now I have 10x more data and it's still as fast as it was before. If you are interested, here is the code - https://bitbucket.org/acoustid/acoustid-index

jgord · on Nov 24, 2015

Really interesting to see you had a similar use-case...

[ you explain the approach much better than OP :]

fizx · on Nov 22, 2015

So, it seems like someone doesn't realize that postgres supports a non-cobbled-together full text search?

http://www.postgresql.org/docs/8.3/static/textsearch-indexes...

jgord · on Nov 22, 2015

I was aware, but definitely worth reiterating.

My post should have a disclaimer - "unless you have several years using many sql and nosql datastores and some background in algorithms and data-structures - DONT DO THIS"

My reason for _not_ using postgres and its GIST style indexes and full search primitives ( which I like and use often, btw ) was that I could find no way to configure postgres to hold the full top level index in RAM. I could run thru a query over all data to load it into the working set, but ideally I would like to mark it 'preload' - do you know a way to do this ? I did consider putting the inverted index in its own postgres DB on a RAM disk, but it seemed non-recommended.

The spirit of the post was : in most cases yes you can find something good off the open source shelf.. [ particularly redis which was very close to being what I needed ] but sometimes rolling a custom data structure that fits the problem can be a surprising win.

Particularly now that SSDs give a really different performance profile to what we are used to. I think DB implementations will adapt to fit SSD characteristics better over the next couple years.. just as they have made the desktop much more snappy.

New DB engines such as rethink and aerospike are very likely tuned more towards SSDs out of the box .. I haven't spent much time with those yet.

Theres a class of problems can be solved by scaling vertically, if you approach it by creatively applying the basics of algorithms and data structures. This is why many people use redis, its a superb bag of useful data-structures you can co-opt to your needs.

fizx · on Nov 23, 2015

Did you benchmark against GIST? Its hard to tell because you posted very loose benchmarks on a shared VM, but the performance wasn't particularly impressive.

Is there a reason that you're unwilling to trust Postgres's query planner/buffer cache to keep your index in RAM? Even if it didn't, the premise of the article is that SSDs are fast. You're spending an order of magnitude more time retrieving content than searching for it, so why optimize the fast part?

Also, consider that GIST or another fulltext aware index (e.g. Lucene/Sphinx/etc) should compress your index (with vints, etc) far better than your approach, and therefore both query faster and save you precious RAM.

The article itself was sparse on details, but until I learn more, this seems to me to be a classic example of optimizing before measuring.

spyder · on Nov 23, 2015

With a quick google search I found this stackexchange answer to preload data into cache with the pg_prewarm extension (from Postgres 9.4)

http://dba.stackexchange.com/a/116536

But this doesn't guarantees that the data will stay in the cache because: "Prewarmed data also enjoys no special protection from cache evictions, so it is possible for other system activity may evict the newly prewarmed blocks..."

In MySQL you can create tables with the MEMORY storage engine: https://dev.mysql.com/doc/refman/5.5/en/memory-storage-engin...

jgord · on Nov 23, 2015

prewarm extension sounds interesting, Ill take a look.

_3u10 · on Nov 23, 2015

That's the exactly way to 'preload', turn the firewall on at boot, 'preload' by running a query and/or reading the index then open the firewall to normal connections.

nreece · on Nov 22, 2015

I prefer, and suggest, Sphinx over Elasticsearch. It has better performance, easier ETL and handles large datasets better.

boyter · on Nov 22, 2015

Agreed. Elastic search is fantastic, but for simple search that needs to be fast Sphinx is hard to beat. I have been using it for years on searchcode.com and have never hit an issue with it that was not due to my own mistakes.

A fantastic piece of software.

STRML · on Nov 22, 2015

Wow. What a great plug. Loving searchcode.com. Thank you for running such a great service! I have endless frustration trying to search Google or DDG for these things.

boyter · on Nov 23, 2015

Thank you!

jgord · on Nov 22, 2015

I had a look at the obvious things, by no means exhaustive - but I hadn't heard of Sphinx, thanks for the suggestion.

jtchang · on Nov 22, 2015

I recommend elasticsearch as well. Just make sure to firewall it off and be careful opening it up. Elasticsearch a few versions back had a remote exploit.

GlennS · on Nov 22, 2015

In particular, ElasticSearch offers the possibility to include snippets of scripts from a number of languages (e.g. Groovy) in your search. If you do so, the file-access capability of those languages remains available.

I'd be surprised if it's an issue these days, because:

1. It's disabled by default, and has been for a while.

2. If you do turn it on, the config options let you be more specific about where those scripts are allowed to come from.

_3u10 · on Nov 23, 2015

Snatching defeat from the jaws of victory, Postgres right choice, missed tsearch2

Also 1TB of data fits easily in RAM

illumen · on Nov 23, 2015

1TB of ram still costs more than 2GB of ram + SSD