(Post author here.) Curious to hear more details about your workload, because a 5+-year-old Fermi would truly be hard pressed to outperform Maxwell or even a Kepler K40, let alone Pascal.
It's parameter sweeps of a delay differential equations, one simulation per thread. This requires a lot of complex array indexing and global memory access, so arithmetic density isn't near optimal. Still, it's a real world workload that benefits hugely from GPU acceleration.
Moving from a GTX 480 to a Kepler or Maxwell card, the numbers go up, but not the performance. I might have a corner case, but before investing in new hardware, I would want to benchmark first and not blindly follow the numbers.
People bought 400 series cards for their compute performance long after they were outdated. If your software wants an nvidia card it was either that or go up to quadro. People bought the first titan for the same reason.