The question im interested in is why MySQL did not use the index for the lookup.

barrkel · on Aug 6, 2018

Since there's nothing limiting the set of rows from project_commits, it might as well table-scan it. The primary key on commits will be used for each left join lookup.

Nested loops aren't that different, performance-wise, than sorted merge joins. Sorting takes O(n log n); whereas the nested loop does n lookups, each taking O(log n), for a similar O(n log n). Memory allowing, a hash join has more potential for speedup.

There should be a locality win from a sorted merge - depending on the correlation between scan order and foreign key order, the index lookups in the nested loop may be all over the place. Usually this doesn't matter much because you don't normally do 5+ billion row joins.

jarym · on Aug 6, 2018

Not sure if MySQL is the same but Postgres won't use an index if the statistics suggest doing so will be slower than a full table scan - would expect MySQL to be similar.

With 5bn+ rows and a memory constraint, the type of index begins to make a difference - e.g. in Postgres I would have tried using a bloom index.

protomyth · on Aug 6, 2018

It can be very frustrating that, even with updated statistics, the optimizer decides to go a different way with a query. While using Sybase, this happened quite a lot when tables got rather large (corporation large not Google large). The normal response is to force the indexes on the query. Due to log space issues, I do remember having an awk script between queries back in the Ingres days.

alexpotato · on Aug 6, 2018

I believe that is the case with the old default being if more than 30% of the table was being returned, don't use the index.

I think that is configurable however.

mutatio · on Aug 6, 2018

The query includes `date_format(created_at, '%x%v1') as week_commit` - meaning it would need to be computed for every row.

mnw21cam · on Aug 6, 2018

That's only in the SELECT statement, which only needs to be performed after the join has been done.

qoi2ijds0 · on Aug 6, 2018

Only for every returned row. The projection does not affect the underlying result set, it merely "shapes" the results you have received. WHERE clauses are always run before SELECT

VintageCool · on Aug 6, 2018

The query in the article has no WHERE clause. It must calculate that function for every row in `commits`.

mnw21cam · on Aug 6, 2018

Reading the article, the database did use an index for the lookup. However, if that index lookup is hitting spinning rust, then it can take 10ms or so per entry, which adds up.

In contrast, if you have sorted/indexed both tables on the join key, then the database is able to do a merge join. This is effectively what the command line implementation did, and is much faster, assuming you can get hold of sorted data quickly to begin with. If the tables are unsorted, then the extra cost of sorting them first needs to be added.

Most databases will evaluate several methods of running the query. Something like Postgres EXPLAIN will provide details of the method being used.

dpcx · on Aug 6, 2018

MySQL used the index on the joined table, but didn't use an index on the primary table because there isn't anything to filter on. The join condition would still cause one table to require a table scan to gather all of the records.