Unless there is some major advance in nonconvex optimization here, (sub)gradient descent on the function you describe in your post is almost certainly converging to a local minimum as well. I guess maybe the surprise is that local minima can perform well? But again, this does not seem like such a surprise when you consider the number of parameters being fitted. I do not have much experience with LSTMs per se, but modern i.e. conv-nets are basically O(10^8)-dimensional nonlinear maps. It just doesn't seem all that unexpected that you can embed a huge number of patterns into such an object.