Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Speeding up SQL queries by orders of magnitude using UNION (foxhound.systems)
402 points by tate on March 20, 2021 | hide | past | favorite | 156 comments


Huh. Like other commenters, I’m surprised to see a multi part JOIN be the answer for “I want the results of Query A as well as the results of Query B”.

If anything, I think many folks would say “whatever, make the two requests in parallel and zip them together in the application”. If not, WITH expressions to make the two independent queries and then a final SELECT to merge them however you prefer would be reasonable (whether that’s a join flavor or UNION/INTERSECT/etc.), but the super explosion JOIN just doesn’t cross my mind.

I think there must be a different example that they tried to anonymize for this post, otherwise they don’t need the IDs (and probably want a sum of quantity). If the author is around, what was the more realistic example behind:

“In order to audit inventory, the logistics team at the corporate headquarters requests a tool that can generate a report containing all meal_items that left a given store’s inventory on a particular day. This requires a query that includes items that were both sold to customers as well as recorded as employee markouts for the specified store on the specified day.”?


Where this fail (and I don’t see the article addressing this concern) is if you want to do pagination. Say you are retrieving page 17 of 8000. Will the UNION still be performant then? After all, it will require a stable sort order that may depend on one of the branches. You certainly can’t pagination without doing a whole lot of logic in your application if you use two separate queries. My typical solution to this is to do a query not employee IDs, another one of customer IDs and then do an IN clause. Or you could do an intermediate ID of some kind that will collapse your diamond branch into a single IN.


In my experience this is the number one driver for infinite scroll, dropping paging simplifies a lot


Having infinite scroll doesn't avoid paging results from a query. Like the OP mentioned, if you have 8000 page * results/page records, you may only want to retrieve 50 records at a time to pass back to the client. In that case, you will still need a way to query for the next set / page. Infinite scroll is a UX implementation.


Yes - very true, however when we used paging we found users were going "I know my results are on page 19, skip to there" but when they were faced with the task of scrolling to page 19, their behaviour changed and they used filtering


Did that make the user experience worse or better?


Haha no comment


> "I know my results are on page 19, skip to there"

What on Earth is your application where that's the case? ..A forum maybe? (Revisiting a thread looking for a particular reply I mean.)


Have you ever used pornhub?


Document registry, but I've heard users doing this in just about every application that allows it - just doing a broad search and memorising the page that the thing they actually want is on


Was about to say, either doc registry, or library catalog.

That's the neat thing about consistent search indexing that isn't being maliciously subverted or tweaked. Once a "good enough" structure is achieved, genearally someone can quickly find their way around. I call it the Librarian effect. I first grokked it's impact when I'd volunter to reshelve books in the library. Given a relatively ordered cart of books to put back on shelves, and an arbitrary starting point in the library to begin restacking (imagine going to the bathroom and some student comes and moves your cart on you), once you've done it a few times, you can usually unconsciously navigate the entire data store.

Compare this to most Search Engines, Reddit, or most Internet forums, where you have people maliciously attacking your overall index structure in a constant conflict for top page rank, orconstantly shifting page numbers based on time elapsed since last reply/engagement sorts.

Swear to God, might one day make a Dewey Decimal System-like search Engine. I've not encountered any other way of organizing information that seems intuitively mappable quite like it. Though I'll admit, I'm not sure what gives it the "stickyness" for me, the numbers, or the physical location. I'll preferentially choose a library whose physical layout I'm familiar with over one I don't, but if you take that physical layout and switch around all the numbers, I'll find you, and beat you with the heaviest reference volume in the Library once I find it.

Seriously though, even if you throw me in an unfamiliar layout, I'll generally quickly refresh on general identifiers then do a quick jog through the Library to nail down the layout. Haven't done many non-Dewey libraries, and hate the ones set up like bookstores where you don't have interleaved levels of sorting.


If users often search for known things then why not optimize the engine/UI to the use case instead of removing the use case?

Like, remembering the association <search term - selected results> and prominently display these?


Search results were boosted based on usage, but the performance benefits of infinite scroll outweighed the use case


Yes it can - if you do it right. When you do paging, you need to count n·50 or whatever items

  LIMIT <page_size> OFFSET <page-number>*<page_size
With reasonable indexes, the OFFSET is the killer. When doing infinite scroll (or variable-size paging by predetermined items), you can ask for the next batch with a greater-than filter:

  SELECT ... where data > last_item ..
  LIMIT 50; -- no OFFSET
Note that you should probably use query-by-data even if performance is not a concern, as row addition/removal create artifacts in infinite scroll.

That being said, from user POV, I usually don't like infinite scroll - it can be time consuming, as you usually can't binary / guesstimate search the result set.


> If anything, I think many folks would say “whatever, make the two requests in parallel and zip them together in the application”

That was my first reaction.

Keep in mind that high load applications have limited database connections. Running queries in parallel will be an anti-pattern in such applications.


Keep your matching logic simple.

With SQL, matching is (by far) the most computationally expensive bit of most queries. Your goal should be to use only sargable[1] operators, and eliminate any functions, in all your WHERE clauses and JOINs.

Complicated matching criteria will force the engine to make tough trade-off decisions, which can be affected by out-of-date table statistics, lousy indexes, and a million other things, and will often force the engine to rule-out more optimized algorithms and fall back to the cruder, primitive algorithms.

[1] For most queries, "OR" is not usually a sargable operator: https://en.wikipedia.org/wiki/Sargable


> Your goal should be to use only sargable[1] operators, and eliminate any functions, in all your WHERE clauses and JOINs.

Or just use indexes properly; it's fairly common for modern databases to support indexes on functions of column values; using an indexed (deterministic) function in WHERE/JOIN criteria isn’t problematic. The Wikipedia article you link addresses valid concerns, but the specifics it suggests don't seem accurate for the features of modern RDBMSs.


Also common for ancient RDBMSes


Really seems like you don’t know what sargable means (and didn’t bother to find out if this is your reply).


I think it might be the other way around.


No, its pretty clear from reading his entire response that he understands exactly what sargable means. "just use indexes properly" is pretty ambiguous, but the rest of his comment provides context. It doesn't make much sense to me to be indexing on fn(x) for predicates unless an alternative sargable predicate is too complex or not possible, but that doesn't mean that it's never the best solution.


> unless an alternative sargable predicate is too complex or not possible...

As you note, functional indexes can sometimes be the best solution - some common examples: when working with different geospatial projections in PostGIS; when building a full-text index using `to_tsvector`. For anyone who works with relational databases that support this, definitely a good tool to have in your toolkit!


Shouldn’t OR operations in an otherwise sargable[0] query just constitute merging two sargable queries? I don’t see why OR should affect the property

[0] new term for me, and wow do I not like it


This depends on the rdbms engine. Even if it might make sense, the planned could go with nested loops because of statistics and cost estimation.

In the article, the devil is in how the second explodes in size. We are suddenly cross-joining employees with customers. Left join of stores, employees and markouts gives you more than 2 rows. It gives you all the employees assigned to the store. Regardless only 2 have markouts.

Next, you join full customers table with it. And keep on multiplying the size by further left joining. Result size might quickly exceed the memory limits for efficient joins, and the rdbms might have to resort to poor nested loops.


the problem with author slow query is joining 7 tables (forcing SQL engine to compute cartesian of 7 tables), this is the primary source of slow down.

the faster query with UNION is joining 4 tables and unions with 5 tables, so the faster query is two orders of magnitude faster, because the SQL engine has to compute cartesian product on 5 instead of 7 tables


Author here. Between this and your other response, where you expound on the same point, I think you're being far too hand wavy about what causes performance issues. The number of joins alone doesn't have much to do with the performance characteristics of any query.

What's more important for performance of queries on larger data sets than the number of joins is that there are indexes and that the query is written in a way that can utilize indexes to avoid multiplicative slowdowns. The reason the UNION query is fast is because the query on either side effectively utilizes the indexes so that the database engine can limit the number of rows immediately, rather than filter them out after multiplying them all together. I can expand this schema to have a UNION query with two 10-table joins and it would still perform better than the 7 table query.

I think someone new to SQL is likely to read your statement and think "okay joins are slow so I guess I should avoid joins". This is not true and this belief that joins are slow leads people down the path that ends at "SQL just doesn't scale" and "let's build a complicated Redis-backed caching layer".

SQL performance is a complex topic. The point of our post was to illustrate that a UNION query can simplify how your join your tables and allow you to write constituent queries that have better performance characteristics. Morphing this into "the number of joins is smaller so the performance is better" is just incorrect.


I think it's good to note in the article that the second query is slow because the RDBMS doesn't use indexes (and which ones). Currently, the text is hand waving the problem and moves on.

If the article had, instead, listed indexes, shown they were used in simple cases, shown they weren't used in the second query, dug into why they weren't (maybe they were but it was still hella slow) - that would be a ton of value!


Sorry, if I was hand-wavy. I was trying to give other people a simple framework that I use myself (Big O calculated as a number of rows in each table). ALthough I dont know how many rows you have in each table, Big O frameworks still works, because cartesian of tables is being dominated by two huge different datasets (customers's orders being joined to employees' orders). The Union simply calcualtes them separately, rather than doing cartesian of two completely different datasets that represent different entities.

Well written predicates and indexes can help, as well as poorly written predicates make it worse. So there is balance. This is not a shortcut or a silver bullet, it is a trade-off being made. More indexes->faster selects and slower updates/inserts. One bad index=>failed insert and possible losing customer data (happened to me once)

I agree with you that "SQL performance is a complex topic." and one should definitely study query Execution Plan to understand the bottlenecks and make optimization decisions


Sql queries never really adhere to that simple Big O analysis though. Sure, if you had zero indexes so every operation was nested sequential scans, then it’s Cartesian, but that’s never the case. Most often you get really fast index lookups, by design.


I invite you to share execution plans of queries #2 and #4 with the public so that people can decide what is actually slowing down, whether itnis realy cartesian or anything else


The author didn't mention the reason the query is slow.

I dare to challenge that the non-optimal query is something that a person with advanced SQL skills would do. A seasoned engineer knows that by joining two tables, you are looking at, worst case, NxM comparisons.

The problem is that you are joining the customers table with THREE other tables - the merged result of stores, employees and markouts. No matter how hard you try, you can't escape the fact that you are joining way more rows than just stores and customers.

I would call it a layman issue SQL beginners learn the hard way. Advanced SQL users know the very basics of the db they are working with.


I don't even understand the thought process that would have led to the non-optimal query. It would simply never occur to me to use anything other than UNION ALL for this kind of scenario, because it is the only appropriate translation of the english-language business request into a SQL query.


After working 20+ years in the industry I'm always surprised by the aversion (or lack of thought) to draw up a Venn diagram on a white board to visualise the relation, let alone then write it out using set notation.


It could be that the query was gradually modified. It's pretty simple to spot the union when the problem's being focused on, laid out nicely in a blog post, and the answer already given!


Probably the result of some early attempt at either sorting or removing duplicates


> I would call it a layman issue SQL beginners learn the hard way. Advanced SQL users know the very basics of the db they are working with.

So?

Does every blog post have to be for expert users of the system? What's wrong with a blog post that explains mistakes that a beginner might make, and a method for mitigating it.


Completely nothing is wrong with it! That's not what I'm arguing with. I'm challenging the statement in the Conclusion section:

> Arriving at query #2 to get the combined results was the intuitive way of thinking through the problem, and something that someone with intermediate or advanced SQL skills could come up with.

The issue in question is simple. It's a very good lecture for people early in their SQL journey. It would be a superb read if the author had dug into the details why it's slow.

But advanced engineers should know that LEFT JOINing 6+ tables will be huge. I disagree it's a mistake an experienced person would do.


Ah, fair play. I must’ve missed that part of the article, or just didn’t have it top of mind when reading your comment.


I think an advanced SQL user would probably notice that the entire problem here is caused by an incorrectly normalized schema. An order has to be placed by a customer, which is usually a type of person, but sometimes the customer person has a status of staff member, which in the example usecase was denormalized into a different table, resulting in the naturally inefficient access pattern.


This is ABC of SQL: every SQL developer has to learn this lesson soon or later: Every additional JOIN you make creates a cartesian product of all previously joined records and the number of rows grows exponentially => you join tables A to B to C to D -> SQL engine has to create AxBxCxD number rows, each letter representing number of rows in that table)

Of course specifying useful predicates that filter down dataset and having indexes on these columns helps avoid row explosion, but you can't have an index on each and every column, especially on OLTP database.

in example #2 the author joined 7 tables (multiply row count of 7 tables and this gives you an indea of how many rows SQL engine has to churn through) - big O complexity of query is (AxBxCxDxExFxG) of course it will be SLOW.

in example #4 he joins 4 tables and unions with 5 table join, so the big O complexity of query is AxBxCxD(1+E).

same as in programming, there is Big O complexity in your SQL queries, so it helps to know O() compelxity of queries that you write

TLDR: learn the SQL Big-O and stop compaining about the query planner, pls


> Every additional JOIN you make creates a cartesian product of all previously joined records and the number of rows grows exponentially => you join tables A to B to C to D -> SQL engine has to create AxBxCxD number rows, each letter representing number of rows in that table)

I think a cartesian product would be:

SELECT * FROM A, B; Given sizes a, b the resulting number of rows would be a X b or O(N^2).

With a join SELECT * FROM A INNER JOIN B ON A.id = B.id then the result rows would be MAX(a,b) or O(N).

A JOIN is a predicate that filters down the dataset.


all queries with JOINs can be rewritten as cartesians: select * from a join B on col1=col2 will be rewritten by query optimizer into:

select * from A, B /* this is cartesian that slows down your query */ WHERE a.col1=B.col2

in fact both queries produce the same execution plan if you check yourself


I think the point is rather that a join with a condition (which is the norm) is almost never actually executed as a cartesian product. take for example from tfa

  select * from customer_order_items
the grain is one row per customer order item, and the next two joins don't change that

  select * from customer_order_items join customer_orders on order_id
the grain is still one row per customer order item

  select * from customer_order_items join customer_orders on order_id join customers on customer_id
the grain is still one row per customer order item

..etc..

of course later on they screw it up and take the cartesian product of customer_order_items by employ_markouts, but it's just 2 big factors not 7 - their query did finish after a few seconds. usually mistakes involving cartesian products with 7 factors just run for hours and eventually throw out of memory.


All queries could technically be expressed as a cartesian product but that is not necessary and not what happens in practice. Both of the above might produce the same plan because they are both treated as joins. One is expressed as an explicit join, the other as an implicit join, but neither requires the query engine to produce a cartesian product on execution. If it did, queries I run in seconds would require more storage or RAM than probably exists in my entire workplace, and I'm not using anything that would usually be considered "big data".


>>queries I run in seconds would require more storage or RAM than probably exists in my city

my explanation to this: CPUs ave very fast. My CPU is 4Ghz so a single core can do a lot of computations in one second, and a SQL engine is smart enough to make cartesian computation (as part of a query plan) and discard the result if row does not meet predicate condition.

in fact I agree that not entire cartesian is being computed, if you specify enough predicates. But the query still multiplies rows. In the author's article he is joining employees when customer_id columns in NULL so this is technically a cartesian, because NULL predicate is not very selective (=there are a lot of rows with value of NULL)


You are not using the term cartesian product correctly as applied to how database engines execute queries. Multiple people have detailed the problems with your ideas on how such things work. A fast processor cannot overcome the the need for multiple terabytes of RAM that would be required to process common queries if databases worked as you describe. Databases are significantly more likely to get bottlenecked by RAM than CPU, and your incorrect understanding of how databases work would require as much RAM as a large "big data" cluster to run queries on moderately sized data sets. Then even if it had that RAM, the bandwidth of the system's bus channels would be incapable of transporting it fast enough to return results in 500ms. Certainly not on my system with a few dozen other people working at the same time, and especially not when I have to query the live production system with a few thousand end users instead of the data warehouse. Databases do not work this way.


multiple people can be wrong, so it is not a valid argument. I can write 10 nested loops and overwrite value of a single cpu register with a single value, so many times, that it will exceed any big data cluster capacity. This is what I meant by CPUs are fast.

if you study engine internals, especially MergeJoin, HashJoin, NestedLoopJoin - they all do comoute cartesian while simultaneously applying predicates. Some operations are faster because of sorting and hashing, but they still do cartesian.


If they are applying predicates to reduce the number of rows processed then they are not performing cartesian joins. You don't need to take the word of people here, you merely need to read any source about how database engines process queries. I am of course open to the possibility of being wrong should you find authoritative sources that show cartesian joins produced for all queries. However, nearly two decades of working with a variety of engines tells me you are unlikely to find such a source. Your comments betray a fundamental lack of understanding on the topic. Your unwillingness to recognize the exponential resources required for your understanding to be correct also betrays a fundamental misunderstanding. You even contradict yourself by in one comment insisting that such queries would be "SLOW" while in another comment stating that you obtain speedy results due to a fast CPU. Which in itself betrays yet another fundamental misunderstanding in how databases utilize resources and where bottlenecks arise.

Last, you have offered no significant support for your claims, and as the initiator of the discussion making the claim of cartesian joins, the burden of proof is on you to provide evidence for your claims.

In any case, your persistence in digging your hole deeper and deeper on this issue is embarrassing, and I won't be cruel enough to enable that further. Feel free to reply, but I am done, with a final recommendation that you continue your learning and not stop at the point of your current understanding. You at least seem earnest in your interests in this topic, so you should pursue it further.


> if you study engine internals, especially MergeJoin, HashJoin, NestedLoopJoin

it's called merge join because it employs the merge algorithm (linear time constant space)[1] - this can only be used when the inputs are sorted. likewise the hash join entails building a hash table and using it (linear time linear space)[2] but the inputs don't have to be sorted. the point of these is to avoid the O(N*M) nested loop

[1] https://en.wikipedia.org/wiki/Merge_algorithm

[2] https://en.wikipedia.org/wiki/Hash_join


Umm.. virtually no join on tables with more than a handful of rows is done as a cartesian product. A suitable set of columns is used to sort or hash each side of the join, such that the actual join logic is basically linear performance (each row in table A is probably only ever paired with approximately 1 row in table B when evaluating the predicates). A cartesian product would involve testing the predicates on every combination of A and B (each row in table A being paired with each row in table B)

(Note when I say linear performance, I mean the join predicate is executed a linear amount of times, but the initial sort/hash operation was probably more like O(n log n))


Cartesian product means a very specific thing - testing the predicates on every possible combination. This is not what occurs in practice


? not how it works behind the scenes at all


> you join tables A to B to C to D -> SQL engine has to create AxBxCxD number rows

this isn't really true. not every join is a cross join, and not every join even changes the grain of the result set. it is important to pay attention when your joins do change the grain of the result set, especially when they grow it by some factor - then it's more like a product as you say.


all queries with JOINs can be rewritten as cartesians: select * from a join B on col1=col2 will be rewritten by query optimizer into: select * from A, B /* this is cartesian that slows down your query */ WHERE a.col1=B.col2

in fact both queries produce the same execution plan if you check yourself


the query is not the only input into the plan generator, the other inputs are the schema (with constraints and indexes) and their statistics - so you can't really say "check the plan for yourself" without all the inputs. and the only time the plan for "select * from A, B" is going to be the same as "select * from A, B where A.col1=B.col2" is when col1 and col2 are both neither primary keys nor foreign keys and that would be a weird condition under which to run such a query.


pls reread my post, I wrote that query select * from A join B on col1=col2 is the the same plan as for select * from a, b where col1=col2

this is true regardless of schema and statistics, it is simply how query planner works under the hood

and you can check it yourself on any database


Those are both inner joins, just a different syntax for writing them.


They might be logically equivalent, but they are not identical if more than two tables are joined. For instance, the current Postgres docs say [1]:

> This latter equivalence does not hold exactly when more than two tables appear, because JOIN binds more tightly than comma. For example FROM T1 CROSS JOIN T2 INNER JOIN T3 ON condition is not the same as FROM T1, T2 INNER JOIN T3 ON condition because the condition can reference T1 in the first case but not the second.

[1] https://www.postgresql.org/docs/current/queries-table-expres...


I don't understand what you mean by saying each join creates a cartesian product. Join criteria avoid this. If I left outer join 7 rows on table A and 5 on table B and join on a unique key then I still only get 7 rows, not the cartesian product of 7x5 rows. I query across a dozen or more tables on a regular basis with aggregates, window functions, with statements, having clauses and so on. Performance is rarely a problem.


Something that helps with exploding cartesian problem is using json_agg for the joined tables. IT's available in postgres, it works great.


Your understanding is flawed, these aren’t cross joins.


I've found denormalization to be helpful for such performance issues in that it reduces the number of joins. It can also be useful for debugging if you want to keep track of values from other tables at the time that data was processed within the primary table.


A certain amount of denormalization can be helpful, but often there's a trade-off in terms of maintainability.


Not who you replied to.

But yes,

As an example, we keep our data normalized and add extra tables to do the operations we need. It’s expensive since we have to precompute, etc. But then on lookups it’s quicker.

Like everything, it depends on what you want to optimize for.


Yep, in my case we wanted to make user-facing queries fast (i.e. a reports page). The extra complexity (not much; we just have to remember when to update certain values in other tables when they change) is definitely worth it since load time dropped from 30 seconds sometimes to almost instant.

Denormalizing everything is definitely a pain; keeping data in sync in mongodb (which lacks traditional joins) on a previous project was not fun; now using postgresql.


denormalization should only be seen as a last resort


denormalization aka memoization aka "dynamic programming of SQL" :-) - all three provide speed-up because they compute value once and reuse it later


Except that your source of truth aka your raw data is now memoized, hardly a win. This is not a goal of memoization or dynamic programming at all.

Perfect for a reporting database though.


Just to note, the easiest heuristic I've been using to figure out when optimization might be useful is when you have an OR clause on more than one table. Column A or B is fine, but Table A column A or Table B column A is gonna cause interesting problems.

Also:

union = distinct values

union all = no distinct


Important difference between union and union all, especially if you don't query a unique id column


The author doesn't describe why they thought a cross product of two large tables, using a group by to remove duplicates, would be a good basis for this query.


The author(s?) almost certainly have no idea what they're talking about. Wtf are "diamond schemas" and "multi-branch joins"? This blog is the sole google result for that made up nonsense that's supposedly there to imply that it's somehow normal to put an OR expression with columns from two different tables into a join clause. They're using select * in a group by query with no aggregates. And they're talking about keywords as operations as if their SQL is directly executed and doing stuff rather than merely being the input for a plan generator (this despite the first paragraph - maybe one of them knows better than the other?).

They say that query #2 is "the intuitive way of thinking through the problem, and something that someone with intermediate or advanced SQL skills could come up with" and is slow because it "eliminated the optimizer's ability to create an efficient query plan" - and they're wrong on both counts. That query is a nightmare no advanced SQL expert would write, and the plan that was generated was probably perfectly optimal to give them what they asked for. They just don't realize they asked for ridiculous nonsense that accidentally matches their expected results.


> The author(s?) almost certainly have no idea what they're talking about.

> They're using select * in a group by query with no aggregates.

The author wrote something akin to

SELECT A.*, B.x FROM A JOIN B GROUP BY A.id, b.x

This is perfectly valid SQL, AS per the SQL standard: "Each <column reference> in the <group by clause> shall unambiguously reference a column of T. A column referenced in a <group by clause> is a grouping column". From SQL-92 reference, section 7.7, http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt

When using offensive language, you'd better be sure of the technical quality of what you write, if you don't want to show an ugly portrait of yourself.


> This is perfectly valid SQL

no, it's not. this query only runs because postgres is not compliant with SQL-99 T301. the query is unambiguously invalid under SQL-92.

> When using offensive language

fwiw I regret my tone but this article deserves to be criticized. I love and respect rdbms technology and it's exciting to see anything "SQL" in a headline on HN - but then it bums me out to see bogus terms tossed around and insane queries presented as normal. if this were some beginner blog writing about lessons learned that's one thing, but this is a professional consultancy firm writing with the air of "you can trust us, we're experts".


A.* assumes every column in A is functionally dependent on the A.id. This may be the case, but is a huge source of bugs I have seen related to MySQL, and would set my spidey sense off if I saw it in code. The problem is that older versions of MySQL allowed non-functionaly dependent columns in the select list and would randomly put one of the values in the result set (every other sane database at the time properly errored out). At this point, I would consider it bad practice to run MySQL in anything other than only full group by mode.

See here: https://dev.mysql.com/doc/refman/5.7/en/group-by-handling.ht...

And here: https://www.percona.com/blog/2019/05/13/solve-query-failures...


Everyone who the store is selling to is a customer; an employee is a customer too except the price might be zero. What if a field is added to customers table: employee_id (would be null by default as most customers would not be employee). Then the query can be run without any union and bypassing stores/employees/employee_markouts etc:

    SELECT
        meal_items.*,
        customer_id,
        customers.employee_id
    FROM customers
        INNER JOIN customer_orders
            ON customers.id = customer_orders.customer_id
        INNER JOIN customer_order_items
            ON customer_orders.id = customer_order_items.customer_order_id
        INNER JOIN meal_items
            ON customer_order_items.meal_item_id = meal_items.id
Of course this means that customer_id will never be null but that kind of is the point.

This can also be approached by adding a customer_id field to the employees table instead (in this example the no. of employees and customers are both 20000 but typically the no. of employees is less than the no. of customers so adding customer_id to employees table would be more typical - in which case the query would obviously change).


I think this is called denormalization, right? What you could try is to instead of modifying the domain for this specific use-case, create a read model instead (if you don't want to pollute your domain) . Sometimes its very easy to write data, but reading is a pain. For such cases, you can prepare read models that contains all the data necessary.


This is not specifically denormalization because I'm not combining/flattening two models - instead, I'm moving the special treatment for employees from a totally separate path to a property of the existing customer model without removing/merging the employee model. Regarding read optimizations, there are several ways, some of which are vendor specific e.g. PostgreSQL/Oracle allow materialized views which can make accessing this data simple (if often required though they have their own issues).


UNION is also helpful for reducing round trips between the db client and db server.

I've run into lots of cases where SQL join is clear to write, but slow, but a simple query to get ids, plus a union of fetching data for each id is fast. Again, a union of single id fetches is often faster than a single query with in (x, y, z). We can wish the sql engine could figure it out, but my experience from a few years ago is that they usually don't.


Does anyone with enough db experience know why the engine can't figure it out? Is it easy to explain?

I'm not a db expert, but this seems like something that should be solvable.


Yeah agreed, at the very least the thing should be able to conceive to switch to this... lets call it "WHERE unrolling" internally when asked for " WHERE id IN (x, y, z)"

I'd be delighted to learn what's going on


That one is annoying. IN tends to be implemented as a loop. I've spend many queries up by putting an IN into a temporary table, then joining it.

I have actually seen it be a win to combine it like this:

    SELECT ...
    FROM foo
      JOIN (
        select 'a' as value
          union all
        select 'b' as value
          ...
      ) bar


In the tests I've done, WHERE id IN (x,y,z) was unrolled as WHERE id = x OR id = y OR id = z. It is fine if the list is short, I had to find solutions when a developer in my team built a dynamic query where the IN list had ~ 4000 elements. A JOIN solved the problem fast and easy in that particular case.


if you need to do very large "IN" clauses, one option can be a global temporary table where you preload values, so instead of doing an IN you do a join against the GTT.


Where unrolling is an excellent term, thanks.

The best I can understand is the engines are (or were) just not setup so they can do repeated single key lookups, which is really the right strategy for a where in. As a result, they do some kind of scan, which looks at a lot more data.


The problem is these repeated single key lookups are random io for the database engine. So the database engine has to predict a threshold for when a scan or random io is cheaper which is very hard to get right, and your io layer changes this threshold drastically. A spinning disk may be faster all the time with sequential io, and for flash based systems theres a wide variety of performance profiles.

To tackle this problem postgresql has a setting where you can tune the cost factor for random io.


As an aside, this setting is almost certainly too high for you by default. The default of random_page_cost=4 assumes that random pages are 4 times more expensive to read than sequential pages, but with SSDs this cost is much much lower and chances are you're using an SSD these days.

I managed to speed up queries by about 10x by just adjusting this. One Weird Trick To Speed Up Your PostgreSQL And Put DBAs Out Of a Job!


Thanks for the hint. I wonder where the value of 4 comes from. Is this an old value derived from a (then) fast RAID across many spinning rust disks? As you pointed out, today --- using SSDs -- the cost should be lower; I'd think in some cases (e.g. here a DB used for a catalog of a back-up system) the backing store is exactly one spinning rust drive, where I'd expect a much higher cost for random access (about 80MB/s / (100/s*8KiB)).

Ah, [1] has the answer: they acknowledge that random accesses might be much slower, but expect a good share of accesses to be satisfied by the cache (that'll be a combination of various caches).

[1] https://www.postgresql.org/docs/13/runtime-config-query.html


https://postgresqlco.nf/doc/en/param/ is a pretty good additional resource on these kind of things by the way, e.g. https://postgresqlco.nf/doc/en/param/random_page_cost/


I'm not a db expert, but this seems like something that should be solvable.

It is very highly database specific. MySQL tends to be terrible. PostgreSQL tends to be quite good. The reasons why it did not figure it out in particular cases tend to be complicated.

Except for MySQL, in which case the problem is simple. There is a tradeoff between time spent optimizing a query and time spent running it. Other databases assume that you prepare a query and then use it many times so can spend time on optimizations. MySQL has an application base of people who prepare and use queries just once. Which means that they use simple heuristics and can't afford the overhead of a complex search for a better plan.


I was under the impression that mysql wss smart enough to skip parts of the range that isn't needed. https://dev.mysql.com/doc/refman/8.0/en/range-optimization.h...


"MySQL has an application base of people who prepare and use queries just once". Is it really likely that mysql is unique in its usage patterns? Isn't it more likely that the devs have decided to keep things simple because that's the historical story for mysql. For example it originally didn't even support foreign key constraints if I recall correctly (that's a while ago now of course).


Yeah, I'm pretty sure that it is unique to MySQL and SQLite.

Back in the 90s and early 2000s, MySQL told people to just run queries. PHP encouraged the same.

Every other database was competing on ability to run complex queries. And so it became common knowledge that applications written for MySQL didn't port well to, say, Oracle. Exactly because of this issue.

Those applications and more like them are out there. And mostly expect to run on MySQL. So that use case has to be supported.


I agree on MySQL but hard disagree on SQLite because the API makes it rather obvious very quickly to any developer that there is a benefit to preparing beforehand.


> "MySQL has an application base of people who prepare and use queries just once". Is it really likely that mysql is unique in its usage patterns?

It's quite plausible that it is distinct in the usage patterns it has chosen to optimize for, and that that design choice has been instrumental in determining what use cases it became popular for and which use cases led people to migrate off of it for a different engine, yes.

To the extent that is at play here it is somewhat of an oversimplification (even if the target audience was a factor in the sewing decision) to describe that as simple unidirectional causation from usage to design, and possibly more accurate (though still oversimplified) the other way around.


> It is very highly database specific.

I was about to comment the same thing. Every engine has its quirks, and the programmer learns them over time. I went from years on MSSQL to MySQL and it was a bit rough to be generous. But now I know many of the MySQL quirks and it's fine.


I imagine its bad query optimizer doing the wrong thing. Maybe a different db engine would do better.


I dont like it when people who cannot write efficient SQL/design efficient database schema start blaming "bad" query optimizer.

pro SQL developers extremely rarely have a problem with query optimizer, and when they have it is extremely exotic.

it is usually ORM users that have a problem with SQL performance and blame the optimizer


The comment i'm responding to is asking why two equivalent queries have different runtimes. The answer pretty obviously is because the optimizer is smart enough to chose the best query plan in one case and not the other. I think its fairly obvious that in theory a better optimizer could figure out the right plan in both cases.

That's not to say that its impossible to work around or that in a real application that you would ever be "stuck" by this. At the end of the day you deal with the software you have, but the optimizer can still have weaknesses without it totally derailing your app.


the optimizer is constrained, it still has to give you what you asked for. i can't even translate into english what's being asked for in query #2 but it's only a coincidence that it has the same result as the more precise and correct thing being asked for in the final query.


There are two possible cases:

a) with everything known to the database (i.e. ids not null) the two queries produce the same result.

b) for some contents of the schema the results differ, but in the presented case some additional condition (not known to the db) holds that makes them equivalent.

Only in case b) it can be called coincidence. In case a) it’s fair to ask if the optimizer can’t do better. After all it tries to avoid cartesian products for simple joins even though that’s the text book definition.

“Not worth” to improve the optimizer is still a valid answer.


the query in question is definitely case b - it's a group by query with an explicit granularity, there's just no actual aggregates being asked for but the very presence or absence of a row in the results carries some meaning. the explicit grain is something like employee markouts plus customers from stores with at least one employee without markouts.


I agree they shouldn't start there, but you don't necessarily need exotic queries to run into edge cases that work differently across say MySQL, MSSQL, and Oracle. The classic example being exists vs left join null.


There could be a lot of reasons that are highly engine dependent, for this specific case.

A general answer, perhaps... not specific to the case you've specified.

Query optimization is a science with multiple dimensions [1]. I'd wager every problem in computer science plays a role somehow in query optimization.

Query performance is based on a combination of actions you take to optimize the design of your system to get the best performance (e.g. data modelling, index design, hardware, query style, and more), and the patterns the system can recognize based on your inputs and the data itself, with the resources it has available, to optimize your queries.

There are known patterns for optimization that are discovered over the years, many hard learned from practical experience. This is why older "popular" engines sometimes are more mature and more performant - they have optimizations built for the common use cases over long periods of time. That is not to say older engines are always better, just that they have often had more exposure to the variety of problems that occur.

The reason why the engine "can't figure it out" is that most engines, even the best ones, are quite complex - combinations of known rules as well as more fuzzy logic, where the engine uses a combination of information and heuristics to essentially explore a possible solution space, to try to find the optimal execution plan. Making the right decision, well, can be difficult and given the nature of these things, sometimes the optimizer makes the wrong decision (this is why "hints" exist, sometimes, you can force the optimizer to do what you see is obvious - but this is suboptimal for you).

In some cases, finding an optimal execution plan can actually be quite computationally expensive, and/or quite time consuming, or the engine in question may simply have no logic coded to handle the case. Optimization is all about finding the balance between finding the most performant query plan, but in the least amount of time, with the least computational and I/O impact to the overall system, that returns the right result. Optimizers are also highly depending on the capabilities of the engineering teams that build them.

It is not an easy problem, and it is an area which one could liken to almost machine learning/artificial intelligence, in one way. There are so many possible options, the problem space so big, with so many different ways to approach a given scenario, that it can be difficult for the "engine" to decide.

This is why known patterns were created, for example, dimensional data models for analytical queries vs. 3rd normal form. Dimensional data models enable certain optimizations, for example, star schemas [2]. If you take a combination of implementing known patterns, along with optimizers written by engineers that exploit those patterns, you can get to a world of better performance.

However, in a world that is, let's say.. more "open ended" - for example, the world of data in a "data lake", where data models are not optimized, data comes in unpredictable multiple shapes/sizes, then it often comes down to combinations of elegant/complex engines that can interpret the shapes of data, cardinality, and other characteristics, make use of much larger distributed compute and system performance, and in some cases - often brute force to arrive at the best query plan or performance possible.

There are so many levels of optimization.. for example, if you were to look at things like Trino [3], which started its genesis as PrestoDb in Facebook - you will see special CPU optimizations (e.g. SIMD instructions), vectorized/pipelined operations - there are storage engine optimizations, memory optimizations, etc. etc. It truly is a complex and fascinating problem.

Source: I was a technical product manager for a federated query engine.

[1] https://en.wikipedia.org/wiki/Query_optimization

[2] https://en.wikipedia.org/wiki/Star_schema

[3] https://trino.io/


I often read things that say that if you just write the query properly, you can trust the query optimizer to select the right plan. Just trust the optimizer! Unfortunately, in my experience, the more complex your query gets, the more opportunities the optimizer has to get it terribly wrong. I've learned to accept that, if the "clear/clean" version of the SQL (that the human likes to read and write) does not produce acceptable results, you just have to drop more explicit hints to the optimizer about what you want it to do. Query optimizers are truly awesome pieces of software, you just have to learn to work around their limitations when you hit them.


I'm very interested to learn what you are saying.

Can you give an example ?


At work I ended up with something similar. A reporting query has about 20 sum()'ed columns, and found the easiest and by far fastest was just to have 20+ queries glued by UNION ALL, and aggregate the sums over that.

It's heavy in it's raw form, but if you add some search terms to narrow things down the DB really cuts through.

It's about 320 lines of SQL, so to make debugging a bit easier I added a dummy "identifier" column to each of the union'ed sub-queries, so first would have value 100, second 200 etc, so I could quickly identify which of the sub-queries produced a given row. Not counting in 1's allowed for easy insertion of additional sub-queries near relevant "identifiers".


> Retrieving both employee and customer meal items

When you're getting two separate sets of things in one query you should expect to use union. This isn't so much a performance optimization as just expressing the query properly.

Imagine starting with just this weird SQL that gets both employee and customer meal items using only joins, and then trying to infer the original intent.


Is there any way to exploit parallelism in the customer vs employee queries then merge the results?

My intuition would have been to tackle this in 3 steps with a temp table (create temp table with employee markouts, append customer meals, select ... order by). I can see the overheads of my solution but I’m wondering if i could beat the Union all approach somehow. My temp table wouldn’t benefit from an index so I’m left wondering about parallelism.

If i remember correctly though, there’s no concurrency, and therefore no opportunity for parallelism for transactions in the same session.


MS SQL will do most of the work in parallel, but you can control if that is what you want.


I'm pretty sure later versions of PG can indeed execute both in parallel.


Not sure about each side of the UNION, but postgres can do parallel index scans for even simple criteria: https://www.postgresql.org/docs/10/parallel-plans.html#PARAL...

Also the article says these tests were done on postgres 12.6


Temp table as in using WITH?


Author of the original article here. Temporary tables are different than using WITH (which are common table expressions, or CTEs). In many database engines, can make a temporary table that will persist for a single session. The syntax is the same as table creation, it just starts with CREATE TEMPORARY TABLE ....

More info in the PostgreSQL docs [1]:

> If specified, the table is created as a temporary table. Temporary tables are automatically dropped at the end of a session, or optionally at the end of the current transaction (see ON COMMIT below). Existing permanent tables with the same name are not visible to the current session while the temporary table exists, unless they are referenced with schema-qualified names. Any indexes created on a temporary table are automatically temporary as well.

[1]: https://www.postgresql.org/docs/13/sql-createtable.html


If this is a database for reporting, using a temporary table is probably fine and a union all wouldn’t concern me.

On Mysql, using a union all creates a temp table which can perform catastrophically under database load. I’ve seen a union all query with zero rows in the second half render a database server unresponsive when the database was under high load, causing a service disruption. We ended rewriting the union all query as two database fetches and have not seen a single problem in that area since.

I was shocked by this union all behavior, but it is apparently a well known thing on MySQL.

I can’t speak to Postgres behavior for this kind of query.


Yeah, you _really_ have to watch out for this. I once spent months chasing down a serious but rare performance problem in a large-scale mysql 5.6 deployment, which was eventually root-caused to kernel slab reclaim pressure, caused by very high XFS metadata mutation rate, caused by MySQL creating an on-disk temporary ISAM file for every query with a union, which was most of the traffic.


In the past we worked on a system that used MySQL 8. We used UNION (not UNION ALL, but I assume it doesn't matter) in several places, applying it to improve performance as we described in the article. There were definitely cases in the system where one side of the UNION would return zero rows, but we never ran into any of the types of issues you're describing.


What is the best book to learn performant SQL querying? I recently needed to query a large database for exploratory purposes. The database has historical data about many years of sales transactions of a huge company. It is using partitioning keys. I know pretty well how to write queries to get what I want but given the size of the database I need to write them performantly. How can I learn that best?


SQL Performance Explained, hands-down https://sql-performance-explained.com/

Covers all the main DBMS, explained very simply from the bottom up, maintained constantly and has been read and recommended by people for years and years.


Perhaps I'm missing something, but why is query 2's meal_items a left join when the where clause then forces it to act as if it were an inner join? Would changing that have any impact?


Author here, you are indeed correct that Query #2's final join can be an INNER join. However, I just tested it against our test data set and it makes no impact on the performance.


Ah well, thanks for humouring me. Good thing those query optimisers do the right thing on occasion.


Is what the article is essentially saying is that [1] is faster than [2]?

I have used [1] many times for that reason although [2] is probably more intuitive for what you want to do.

[1]

  SELECT vegetable_id, SUM(price) as price, SUM(weight) as weight
  FROM
  (
    SELECT vegetable_id, price, NULL as weight
    FROM prices
    UNION ALL
    SELECT vegetable_id, NULL as price, weight
    FROM weights
  )
  GROUP BY vegetable_id
[2]

  SELECT vegetable_id, price, weight
  FROM prices 
    JOIN weights 
      ON price.vegetable_id = weights.vegetable_id


The relative performance of these two queries will vary by data volume. Swap in a sales table for the weights table, and make that a massive sales table at that, joining it to much smaller prices can be much faster than a group by. Stated differently, a join can be faster than a group by. This is even more true when the small table can fit into memory and a hash join can be used, and the data in the group by can't fit into memory.


Yes I'd say that I would intuitively do that only if the tables were both sufficiently large.

I assume somewhere there is a similar assumption in TFA.


This queries have different results. [2] retrieves only the vegetable_ids which are in both tables, [1] gives all ids which are in prices or weights. If vegetable_id null exists in either table [1] result in an extra row with id null, this doesn't happen for query [2]


If I replace JOIN with FULL OUTER JOIN, you'll get what you describe. It was just an quick example, but you are right.

There are also things to say about what happens if either table has duplicate vegetable_id:s. At some point it is assumed that you have processed it in such a way that the table is well formed for the purpose.


Do you lose the indices on the columns after the UNION?


Definitely, but that is more than made up for by the much smaller size.


I'm now curious (but too lazy to setup and test it) how the following query would fare:

  SELECT meal_items.*, employee_markouts.employee_id, customer_orders.customer_id
  FROM meal_items
  LEFT JOIN employee_markouts ON employee_markouts.meal_item_id = meal_items.id
  LEFT JOIN employees ON employees.id = employee_markouts.employee_id
  LEFT JOIN customer_order_items ON customer_order_items.meal_item_id = meal_items.id
  LEFT JOIN customer_orders ON customer_orders.id = customer_order_items.customer_order_id
  LEFT JOIN customers ON customers.id = customer_orders.customer_id
  WHERE (employees.store_id = 250 AND 
         employee_markouts.created >= '2021-02-03' AND
         employee_markouts.created < '2021-02-04') OR
        (customers.store_id = 250 AND
         customer_orders.created >= '2021-02-03' AND
         customer_orders.created < '2021-02-04')


Author here. This doesn't give the correct results. It produces meal_items that have both customer_id and employee_id. Here's an excerpt (the full result set is thousands of rows, as opposed to the expected 45):

    id |label       |price|employee_id|customer_id|
    ---|------------|-----|-----------|-----------|
    ...
    344|Poke        | 4.18|       3772|      13204|
    344|Poke        | 4.18|       3313|      13204|
    344|Poke        | 4.18|       2320|      13204|
    344|Poke        | 4.18|        632|      13204|
    344|Poke        | 4.18|       4264|      13204|
    344|Poke        | 4.18|        699|      13204|
    344|Poke        | 4.18|       1070|      13204|
    344|Poke        | 4.18|       3022|      13204|
    344|Poke        | 4.18|       1501|      13204|
    344|Poke        | 4.18|        808|      13204|
    344|Poke        | 4.18|       2793|      13204|
    344|Poke        | 4.18|       1660|      13204|
    344|Poke        | 4.18|        932|      13204|
    ...
To be clear, there are ways to write this query without UNION that have both good performance and give the correct results, but they're very fiddly and harder to reason about that just writing the two comparatively simple queries and then mashing the results together.


> Author here. This doesn't give the correct results.

Right you are! :) Should have given it more thought before posting. Thanks for jumping in.


this is probably how I'd write it (assuming pg properly pushes the predicates down to the subquery in the lateral join)

  select mi.*, x.employee_id, x.customer_id
  from meal_items mi
  join lateral (
    select e.store_id, em.created, em.employee_id, customer_id = null
    from employee_markouts em
    join employees e on e.id = em.employee_id
    where em.meal_item_id = mi.id
    union all
    select c.store_id, co.created, employee_id = null, co.customer_id
    from customer_order_items coi
    join customer_orders co on co.id = coi.customer_order_id
    join customers c on c.id = co.customer_id
    where coi.meal_item_id = mi.id) x
  where x.store_id = 250 and x.created between '2021-02-03' and '2021-02-03'
  order by mi.id


OR really messes things up. UNION instead makes things fast.


Oh, I wouldn't presume that it will be faster than the union but I wonder how it compares to the original query in the article (the one with OR).


Not speaking negatively about the author of this post, but union should have been the obvious tool to use in a scenario like this.

SQL is very powerful and really not so complicated. Schema, index, and constraint design can be quite challenging, depending on the needs, but that's true whether using SQL or an ORM.

The problem I see often is that people use ORMs and never learn SQL. Also often, one has to really work to understand how their ORM is handling certain situation to avoid problems like N+1 or cartesian products.

In the end, it seems that rather than learning SQL, the developer ends up learning just as much or more about a specific ORM to get the same quality/performing results they would have gotten by using direct SQL.


I suspect query #2 would have performed better if they used a COALESCE instead of an OR condition. One of the most important optimization lessons you can learn is not to use OR for any join conditions.

Eg instead of this:

    ON (customer_order_items.meal_item_id = meal_items.id
        OR employee_markouts.meal_item_id = meal_items.id
       )
Do this:

    ON meal_items.id = COALESCE( employee_markouts.meal_item_id, customer_order_items.meal_item_id )
The latter can optimize properly. Probably still slower than the UNION, but would be more in line with expected performance. It might be more useful too in certain cases, such as if building a view.


No need to wrap the union queries in a new select to order the result. An order by after the last select in the unions orders the whole result set as long as the column names match in each part of the union.


This depends on the database engine. The syntax will error out in some (perhaps most).


Works in mysql, postgres, oracle, MS, DB2. Can you name an engine where it fails?


Seems I was completely wrong. I've learned something new today - thanks.


That works in MS SQL, not guaranteed in others.


Works in mysql, postgres, oracle, MS, DB2


I know of at least one commercial DBMS that will perform this transformation automatically in some circumstances, and I’d be surprised if most of the other major systems couldn’t do the same.


All database systems could implement many more optimizations. Heck, all these tricks to get the queries to complete faster are only needed because the database engine did not find this fast execution plan.

The problem is the more optimizations you implement the more possibilities the query planner has to evalue which may result in (1) a too long query planning phase or (2) the possibility of de-optimizing a good query because an optimization was falsely predicted to be faster


Clever optimization engines are IMO no excuse for blindly throwing queries at it and not knowing what's going on. Every optimizer has some corner cases in which it fails and I'd argue that as a good developer you should know when they can happen.

Additionally, in this example, the rewritten query is easier to read and maintain.


Why is it so important to express this as one SQL query? Why not just run the two simple queries? It feels as though two different types of things are being joined.


Depends on your goals. If you're a software engineer, it won't make a difference, just send two queries to the DB and combine the results before sending them off. If you're an analyst, you'll often get repeated ad-hoc requests. If you have to run 2 queries, then stitch them together, each time you are asked for a report, you're wasting tons of (really boring) time. Even if you ultimately create some code to run and stitch the results together for you, it's just faster and easier to write 1 query wherever possible. Just let the database do the work!


> If you're a software engineer, it won't make a difference, just send two queries to the DB and combine the results before sending them off.

This isn't true if you have certain use cases in your application, pagination being one of them. There's no simple way to implement pagination when you have two or more queries that return and unknown number of results and you're tasked with maintaining a consistent order across page changes.

If you make a single query do all the work, it's easy to implement paging in any number of ways, the most performant being to filter by the last id the client saw. If your users aren't likely to paginate too far into the data, LIMIT and OFFSET will work fine as well.


You're falling into the trap of thinking like the developer looking at the implementation, not the end user. The goal is:

>In order to audit inventory, the logistics team at the corporate headquarters requests a tool that can generate a report containing all meal_items that left a given store’s inventory on a particular day

That's one thing. The end user doesn't see these as two separate requests just because that's how it's modeled in the database.


The end user here being the person writing the union SQL query?



Would be cool to post the sqldump of the data somewhere so we could play with the queries and the database with the same data.


Am I the only one who rejects this data model? Customers belong to store and orders don't? I can see how a customer might have a primary store, but i would link the customer to the order and link the order to the store. Union's are awesome, but proper data modeling is better. :)


I wonder if i'm the only one who sees the phrase "order(s) of magnitude" and expects a vague story.

I don't mean to denigrate this article, just wonder if others have the same reaction.


speeding up sql queries isnt hard but it takes time and its worth it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: