Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Indeed. The power of attention is that it searches the space of functions and surfaces the best function given the constraints. This is why I think linear attention will never come close to the ability of standard attention, the quadratic term is a necessary feature of searching over all pairs of inputs and outputs.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: