Whilst the theory behind the mathematics is clear, it's not necessarily clear it...

Whilst the theory behind the mathematics is clear, it's not necessarily clear it should work well in practice.

The weights are initialized as noise, the optimization problem isn't convex so there are numerous local optima to get stuck in, and the distance between the relevant input and output can be dozens or hundreds of timesteps in length. The last point is quite important as you can end up with very odd gradients (exploding / vanishing gradients). Troublesome gradients are why traditional RNNs don't work particularly well - they can't actually make the connection between input and output.

Others can likely explain the intricacies better :)