Indeed. The power of attention is that it searches the space of functions and surfaces the best function given the constraints. This is why I think linear attention will never come close to the ability of standard attention, the quadratic term is a necessary feature of searching over all pairs of inputs and outputs.