Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is what the article is essentially saying is that [1] is faster than [2]?

I have used [1] many times for that reason although [2] is probably more intuitive for what you want to do.

[1]

  SELECT vegetable_id, SUM(price) as price, SUM(weight) as weight
  FROM
  (
    SELECT vegetable_id, price, NULL as weight
    FROM prices
    UNION ALL
    SELECT vegetable_id, NULL as price, weight
    FROM weights
  )
  GROUP BY vegetable_id
[2]

  SELECT vegetable_id, price, weight
  FROM prices 
    JOIN weights 
      ON price.vegetable_id = weights.vegetable_id


The relative performance of these two queries will vary by data volume. Swap in a sales table for the weights table, and make that a massive sales table at that, joining it to much smaller prices can be much faster than a group by. Stated differently, a join can be faster than a group by. This is even more true when the small table can fit into memory and a hash join can be used, and the data in the group by can't fit into memory.


Yes I'd say that I would intuitively do that only if the tables were both sufficiently large.

I assume somewhere there is a similar assumption in TFA.


This queries have different results. [2] retrieves only the vegetable_ids which are in both tables, [1] gives all ids which are in prices or weights. If vegetable_id null exists in either table [1] result in an extra row with id null, this doesn't happen for query [2]


If I replace JOIN with FULL OUTER JOIN, you'll get what you describe. It was just an quick example, but you are right.

There are also things to say about what happens if either table has duplicate vegetable_id:s. At some point it is assumed that you have processed it in such a way that the table is well formed for the purpose.


Do you lose the indices on the columns after the UNION?


Definitely, but that is more than made up for by the much smaller size.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: