You're correct that low level details very much matter in the HPC space. The types of optimizations described in this paper are exactly that! To elucidate, check out Halide's cool visualizations/documentation (which this paper compares itself to) https://halide-lang.org/tutorials/tutorial_lesson_05_schedul...