Yeah agreed, at the very least the thing should be able to conceive to switch to...

btilly · on March 20, 2021

That one is annoying. IN tends to be implemented as a loop. I've spend many queries up by putting an IN into a temporary table, then joining it.

I have actually seen it be a win to combine it like this:

    SELECT ...
    FROM foo
      JOIN (
        select 'a' as value
          union all
        select 'b' as value
          ...
      ) bar

AdrianB1 · on March 20, 2021

In the tests I've done, WHERE id IN (x,y,z) was unrolled as WHERE id = x OR id = y OR id = z. It is fine if the list is short, I had to find solutions when a developer in my team built a dynamic query where the IN list had ~ 4000 elements. A JOIN solved the problem fast and easy in that particular case.

paulmd · on March 21, 2021

if you need to do very large "IN" clauses, one option can be a global temporary table where you preload values, so instead of doing an IN you do a join against the GTT.

toast0 · on March 20, 2021

Where unrolling is an excellent term, thanks.

The best I can understand is the engines are (or were) just not setup so they can do repeated single key lookups, which is really the right strategy for a where in. As a result, they do some kind of scan, which looks at a lot more data.

tpetry · on March 20, 2021

The problem is these repeated single key lookups are random io for the database engine. So the database engine has to predict a threshold for when a scan or random io is cheaper which is very hard to get right, and your io layer changes this threshold drastically. A spinning disk may be faster all the time with sequential io, and for flash based systems theres a wide variety of performance profiles.

To tackle this problem postgresql has a setting where you can tune the cost factor for random io.

arp242 · on March 21, 2021

As an aside, this setting is almost certainly too high for you by default. The default of random_page_cost=4 assumes that random pages are 4 times more expensive to read than sequential pages, but with SSDs this cost is much much lower and chances are you're using an SSD these days.

I managed to speed up queries by about 10x by just adjusting this. One Weird Trick To Speed Up Your PostgreSQL And Put DBAs Out Of a Job!

guenthert · on March 21, 2021

Thanks for the hint. I wonder where the value of 4 comes from. Is this an old value derived from a (then) fast RAID across many spinning rust disks? As you pointed out, today --- using SSDs -- the cost should be lower; I'd think in some cases (e.g. here a DB used for a catalog of a back-up system) the backing store is exactly one spinning rust drive, where I'd expect a much higher cost for random access (about 80MB/s / (100/s*8KiB)).

Ah, [1] has the answer: they acknowledge that random accesses might be much slower, but expect a good share of accesses to be satisfied by the cache (that'll be a combination of various caches).

[1] https://www.postgresql.org/docs/13/runtime-config-query.html

arp242 · on March 22, 2021

https://postgresqlco.nf/doc/en/param/ is a pretty good additional resource on these kind of things by the way, e.g. https://postgresqlco.nf/doc/en/param/random_page_cost/