Out of interest, what would you say the weaknesses are of the full-text search i...

pilif · on March 24, 2014

I can't speak for the original poster, but I have three issues with the postgres offering:

1) it has suboptimal support for handling compound words (like finding the "wurst" in "bratwurst"). If the body you're searching is in a language that uses compounds (like german), then you have to use ispell dictionaries which have rudimentary support for compounds and which aren't maintained any more in many cases because ispell has been more or less replaced by hunspell which has far superior compound support which in turn is not supported by postgres.

2) If you use a dictionary for FTS (which you have to if you need to support compounds), the dictionary has to be loaded once per connection. Loading a 20MB dictionary takes about 0.5 seconds, so if you use Postgres FTS, you practically have to use persistent connections or some kind of proxy (like pgbouncer). Not a huge issue, but more infrastructure to keep in mind.

3) It's really hard to do google-suggest like query suggestions. In the end I had to resort to a bad hack in my case.

Nothing unsolvable, but not-quite-elastic search either.

karavelov · on March 24, 2014

1) I do not have experience with languages with heavy use of compound words so no suggestion from me.

2) There is a shared dict extension that load the dict only once. See: http://pgxn.org/dist/shared_ispell/

3) About the suggestions: you should look at pg_trgm contrib extension.

pilif · on March 24, 2014

Thank you very much for 2) - this will allow me to considerably lessen the amount of infrastructure I'm having to take care of (pgbouncer can go now).

About 3: That's what I was using before moving to tsearch, but pg_trgm based suggestions sometimes were really senseless and confusing to users. Also, using pg_trgm for suggestion will cause suggestions to be shown that then don't return results when I later run the real search using tsearch.

I'm happy with my current hack though, so I just wanted to give a heads-up.

arethuza · on March 24, 2014

Thanks - I'd be delighted to get to the point where I have those kinds of problems!

What I will do is bear those limitations in mind and if I ever do have those problems there is, I guess, a reasonable chance that postgres development will have addressed them by them or I will redesign things to use a separate search engine.

JohnBooty · on March 24, 2014

This was a really informative post; thanks!

  If you use a dictionary for FTS (which you have to if 
  you need to support compounds), the dictionary has to 
  be loaded once per connection. Loading a 20MB dictionary
  takes about 0.5

Does this happen for every connection, or just those that use FTS?

I'm thinking of a case where only maybe ~0.5% of queries will use FTS. I'm working on discussion software, and I'd like the discussions to be searchable, but realistically only a small percentage of queries will actually involve full text searching.

pilif · on March 24, 2014

Only happens for connections that use FTS and only if you have configured it to use a dictionary (I believe there's no need to in case of an english body). But keep in mind that this means that you'll have to pay at least 0.5 seconds per search. If you do some google suggest style incremental searching on key press, you'll need to be faster than that.

So either use a persistent connection for full text searching (I'm connecting via pgbouncer for FTS-using connections) or, as karavelov recommended below, use http://pgxn.org/dist/shared_ispell/ which will load the dictionary once into shared memory (much less infrastructure needed for this one)

JohnBooty · on March 24, 2014

Thanks so much for the tip, you saved me some hours for sure!

Totoradio · on March 24, 2014

About your first point, you should look at GiST and GIN indexes, they are made for full text searches: http://www.postgresql.org/docs/current/static/textsearch-ind...

pilif · on March 24, 2014

This isn't a problem of the index type (I'm using GIN in this case), but of the tsearch lexizer which only has limited support for hunspell.

The step that causes the trouble is ts_lexize(), not storing and consequently looking it up.

Again, I made it work for my case, but it was some hassle and involved running a home-grown script over an ispell dictionary I've created by converting a hunspell one into an ispell one.

aragot · on March 24, 2014

1) Subwords

Why not using: MY_COLUMN ~ '.wurst\y.' ? Here is the doc for LIKE, SIMILAR and regexes: http://www.postgresql.org/docs/9.2/static/functions-matching...

pilif · on March 24, 2014

Because using a real full text solution goes much farther than using regexes and LIKE. It also allows the use of indexes which many regexes and many LIKE patterns would not allow.

For example, I can now find the "wurst" in "Weisswürste" which, yes, I could do with a regex, but I can also find the "haus" in "Krankenhäuser" and all other special cases in the language I'm working with without having to write special regexes for every special term I might come up with.

rch · on March 24, 2014

Exactly. The only reason I have to bother with Solr at the moment is to get efficient ngram indexing for sequence data; which consists of lists of 7-8 character strings from various sources. What I have works, but feels like overkill for my case.

Intermernet · on March 24, 2014

Thankyou.

That's an incredibly helpful answer, which provides solutions as well as pertinent info.

petergeoghegan · on March 24, 2014

Full text search is expected to benefit considerably from work in Postgres 9.4: https://www.postgresql.eu/events/sessions/pgconfeu2013/sessi...

These advances within the GIN inverted index infrastructure will also greatly benefit jsonb, since it has two GIN operator classes (this is more or less the compelling way to query jsonb).

elchief · on March 24, 2014

Some differences between Solr and PostgreSQL Text Search:

1. Solr doesn't handle multi-word synonyms (without a hack), PG does. (ex: "Northern Ireland" => "UK")

2. Solr uses TF-IDF out of the box, and PG doesn't.

3. PG is good enough for 90% of cases, but Solr has some advanced stuff that PG doesn't. Like integration with OpenNLP, things like the WordDelimiterFilter. (Andre3000 = Andre 3000)

4. PG is kinda annoying in that it will parse "B.C." as a hostname, even though I want it to be a province.

5. Solr is faster than PG, but PG has everything in one server.

6. Solr handles character-grams and word-grams better.

jeltz · on March 24, 2014

> 5. Solr is faster than PG, but PG has everything in one server.

With the GIN optimizations in 9.4 this need to re-evaluated. Solr will probably still be faster but maybe not enough for it to matter.

bjourne · on March 24, 2014

Another weakness not mentioned is document similarity search. You have one document and you want to find the N most similar in your collection to offer suggestions like Amazon does. You can do it with fts in pg, but it is way to slow to be practical. Custom fts engines are much better at that task.

netghost · on March 24, 2014

If search is your application's main purpose, you'll want something more flexible.

Otherwise, it's fairly easy to implement, and you get full SQL support so joins, transactions, etc. So you can always prototype it and see what limitations you run into.

If you're just getting started with Postrges, I have some examples of using its' FTS search here: http://monkeyandcrow.com/blog/postgres_railsconf2013/

It's targeted at Rails, but I always show the SQL first, so you should be able to adapt it.