Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>>queries I run in seconds would require more storage or RAM than probably exists in my city

my explanation to this: CPUs ave very fast. My CPU is 4Ghz so a single core can do a lot of computations in one second, and a SQL engine is smart enough to make cartesian computation (as part of a query plan) and discard the result if row does not meet predicate condition.

in fact I agree that not entire cartesian is being computed, if you specify enough predicates. But the query still multiplies rows. In the author's article he is joining employees when customer_id columns in NULL so this is technically a cartesian, because NULL predicate is not very selective (=there are a lot of rows with value of NULL)



You are not using the term cartesian product correctly as applied to how database engines execute queries. Multiple people have detailed the problems with your ideas on how such things work. A fast processor cannot overcome the the need for multiple terabytes of RAM that would be required to process common queries if databases worked as you describe. Databases are significantly more likely to get bottlenecked by RAM than CPU, and your incorrect understanding of how databases work would require as much RAM as a large "big data" cluster to run queries on moderately sized data sets. Then even if it had that RAM, the bandwidth of the system's bus channels would be incapable of transporting it fast enough to return results in 500ms. Certainly not on my system with a few dozen other people working at the same time, and especially not when I have to query the live production system with a few thousand end users instead of the data warehouse. Databases do not work this way.


multiple people can be wrong, so it is not a valid argument. I can write 10 nested loops and overwrite value of a single cpu register with a single value, so many times, that it will exceed any big data cluster capacity. This is what I meant by CPUs are fast.

if you study engine internals, especially MergeJoin, HashJoin, NestedLoopJoin - they all do comoute cartesian while simultaneously applying predicates. Some operations are faster because of sorting and hashing, but they still do cartesian.


If they are applying predicates to reduce the number of rows processed then they are not performing cartesian joins. You don't need to take the word of people here, you merely need to read any source about how database engines process queries. I am of course open to the possibility of being wrong should you find authoritative sources that show cartesian joins produced for all queries. However, nearly two decades of working with a variety of engines tells me you are unlikely to find such a source. Your comments betray a fundamental lack of understanding on the topic. Your unwillingness to recognize the exponential resources required for your understanding to be correct also betrays a fundamental misunderstanding. You even contradict yourself by in one comment insisting that such queries would be "SLOW" while in another comment stating that you obtain speedy results due to a fast CPU. Which in itself betrays yet another fundamental misunderstanding in how databases utilize resources and where bottlenecks arise.

Last, you have offered no significant support for your claims, and as the initiator of the discussion making the claim of cartesian joins, the burden of proof is on you to provide evidence for your claims.

In any case, your persistence in digging your hole deeper and deeper on this issue is embarrassing, and I won't be cruel enough to enable that further. Feel free to reply, but I am done, with a final recommendation that you continue your learning and not stop at the point of your current understanding. You at least seem earnest in your interests in this topic, so you should pursue it further.


> if you study engine internals, especially MergeJoin, HashJoin, NestedLoopJoin

it's called merge join because it employs the merge algorithm (linear time constant space)[1] - this can only be used when the inputs are sorted. likewise the hash join entails building a hash table and using it (linear time linear space)[2] but the inputs don't have to be sorted. the point of these is to avoid the O(N*M) nested loop

[1] https://en.wikipedia.org/wiki/Merge_algorithm

[2] https://en.wikipedia.org/wiki/Hash_join


Umm.. virtually no join on tables with more than a handful of rows is done as a cartesian product. A suitable set of columns is used to sort or hash each side of the join, such that the actual join logic is basically linear performance (each row in table A is probably only ever paired with approximately 1 row in table B when evaluating the predicates). A cartesian product would involve testing the predicates on every combination of A and B (each row in table A being paired with each row in table B)

(Note when I say linear performance, I mean the join predicate is executed a linear amount of times, but the initial sort/hash operation was probably more like O(n log n))


Cartesian product means a very specific thing - testing the predicates on every possible combination. This is not what occurs in practice




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: