The last point I think is most important: "very subtle and silently introduced mistakes" -- LLMs may be able to complete many tasks as well (or better) than humans, but that doesn't mean they complete them the same way, and that's critically important when considering failure modes.
In particular, code review is one layer of the conventional swiss cheese model of preventing bugs, but code review becomes much less effective when suddenly the categories of errors to look out for change.
When I review a PR with large code moves, it was historically relatively safe to assume that a block of code was moved as-is (sadly only an assumption because GitHub still doesn't have indicators of duplicated/moved code like Phabricator had 10 years ago...), so I can focus my attention on higher level concerns, like does the new API design make sense? But if an LLM did the refactor, I need to scrutinize every character that was touched in the block of code that was "moved" because, as the parent commenter points out, that "moved" code may have actually been ingested, summarized, then rewritten from scratch based on that summary.
For this reason, I'm a big advocate of an "AI use" section in PR description templates; not because I care whether you used AI or not, but because some hints about where or how you used it will help me focus my efforts when reviewing your change, and tune the categories of errors I look out for.
I was about to write my own tool for this but then I discovered:
git diff --color-moved=dimmed-zebra
That shows a lot of code that was properly moved/copied in gray (even if it's an insertion). So gray stuff exactly matches something that was there before. Can also be enabled by default in the git config.
I used autochrome[0] for Clojure code to do this. (I also made some improvements to show added/removed comments, top-level form moves, and within-string/within-comment edits the way GitHub does.)
At first I didn't like the color scheme and replaced it with something prettier, but then I discovered it's actually nice to have it kinda ugly, makes it easier to detect the diffs.
That's a great solution and I'm adding it to my fallback. But also, people might be interested in diff-so-fancy[0]. I also like using batcat as a pager.
When using a reasonably smart llm, code moves are usually fine, but you have to pay attention whenever uncommon words (like urls or numbers) are involved.
It kind of forces you to always put such data in external files, which is better for code organization anyway.
If it's not necessary for understanding the code, I'll usually even leave this data out entirely when passing the code over.
In Python code I often see Gemini add a second h to a random header file extension. It always feels like the llm is making sure that I'm still paying attention.
In particular, code review is one layer of the conventional swiss cheese model of preventing bugs, but code review becomes much less effective when suddenly the categories of errors to look out for change.
When I review a PR with large code moves, it was historically relatively safe to assume that a block of code was moved as-is (sadly only an assumption because GitHub still doesn't have indicators of duplicated/moved code like Phabricator had 10 years ago...), so I can focus my attention on higher level concerns, like does the new API design make sense? But if an LLM did the refactor, I need to scrutinize every character that was touched in the block of code that was "moved" because, as the parent commenter points out, that "moved" code may have actually been ingested, summarized, then rewritten from scratch based on that summary.
For this reason, I'm a big advocate of an "AI use" section in PR description templates; not because I care whether you used AI or not, but because some hints about where or how you used it will help me focus my efforts when reviewing your change, and tune the categories of errors I look out for.