Since there's nothing limiting the set of rows from project_commits, it might as well table-scan it. The primary key on commits will be used for each left join lookup.
Nested loops aren't that different, performance-wise, than sorted merge joins. Sorting takes O(n log n); whereas the nested loop does n lookups, each taking O(log n), for a similar O(n log n). Memory allowing, a hash join has more potential for speedup.
There should be a locality win from a sorted merge - depending on the correlation between scan order and foreign key order, the index lookups in the nested loop may be all over the place. Usually this doesn't matter much because you don't normally do 5+ billion row joins.
Not sure if MySQL is the same but Postgres won't use an index if the statistics suggest doing so will be slower than a full table scan - would expect MySQL to be similar.
With 5bn+ rows and a memory constraint, the type of index begins to make a difference - e.g. in Postgres I would have tried using a bloom index.
It can be very frustrating that, even with updated statistics, the optimizer decides to go a different way with a query. While using Sybase, this happened quite a lot when tables got rather large (corporation large not Google large). The normal response is to force the indexes on the query. Due to log space issues, I do remember having an awk script between queries back in the Ingres days.
Only for every returned row. The projection does not affect the underlying result set, it merely "shapes" the results you have received. WHERE clauses are always run before SELECT
Reading the article, the database did use an index for the lookup. However, if that index lookup is hitting spinning rust, then it can take 10ms or so per entry, which adds up.
In contrast, if you have sorted/indexed both tables on the join key, then the database is able to do a merge join. This is effectively what the command line implementation did, and is much faster, assuming you can get hold of sorted data quickly to begin with. If the tables are unsorted, then the extra cost of sorting them first needs to be added.
Most databases will evaluate several methods of running the query. Something like Postgres EXPLAIN will provide details of the method being used.
MySQL used the index on the joined table, but didn't use an index on the primary table because there isn't anything to filter on. The join condition would still cause one table to require a table scan to gather all of the records.