In late 90s/early 2000s the mainstream thought around numerical optimization was that it was easy-ish when it was a linear problem, and if you had to rely on nonlinear optimization you were basically lost. People did EM (an earlier subgenre of what is now called Bayesian learning) but knew that it was sensitive to initialization and that they probably didn't hit a good enough maximum. Late 90s neural networks were basically a parlor trick - you could make it do little tricks but almost everything we have now including lots of compute, good initialization, regularization techniques, and pretraining, was absent in the late 90s.
Then in the mid and later 2000s the mainstream method was convex optimization and you had a proof that there was one global optimum and a wide range of optimization methods were guaranteed to reach it from most initialization points. Simultaneously, the theory underlying SVMs and CRFs was developed - that you could actually do a large variety of things and still use these easy, dependable optimization techniques. And people hammered home the need for regularization techniques.
In the late 2000s to early 2010s, several things again came together - one being the discovery of DropOut as a regularization technique - and the understanding that it was one, the other being the development of good initializers that made it possible to use deeper networks. Add to that improved compute power - including the development of CUDA which started out as a way to speed up texture computation but then led to general purpose GPU computing as we know it today.
All this enabled a rediscovery of NN learning which could take off where linear learning methods (SVMs, CRFs) had plateaued before. And often you had a DNN that did what the linear classifier before did but could learn features in addition to that - and could be seen as finding a solution that was strictly better.
But the lack of global optimum means that - even with good initializers and regularization packaged into the NN modules we use in modern DNN software implementations - the whole thing is way more finicky than CRFs ever were. (It would be wrong to say that CRFs are trivial to implement or never finicky at all, just as many well-understood NN architectures have a good out-of-the-box experience with TF/PyTorch etc. - so take this as a general statement that may not hold for all cases).
Deep learning is a form of optimization. Optimization involves moving along a high-dimensional surface to find the lowest point. In principle this can be nearly impossible because the surface might be covered in dramatic peaks/valleys/saddles obscuring the route to the lowest point. Some simulations have implied that this is not what the surfaces corresponding to deep networks look like, and that they instead look like a big gentle slope down to the minimum, with only small bumps along the way.
Picture linear regression. If you have a bunch of data points you want to fit a line to, for any given line you can add up the vertical distance between all your data points and the line and come up with a measure of how inaccurate the line is. This is called your "loss". If you keep trying different values for the slope and intercept, you would find that this function is shaped like a big bowl, or valley. Regression is the process of repeatedly taking a step downhill until you can't go anywhere but up, and that must be the optimal line.
Neural networks train in a similar way. You have a "loss" function that adds up how wrong your predictions are compared to the training data. You try different values for the weights in your neural network to see which ones send you downhill the fastest, step them in that direction, and repeat. Since the loss function in this case is more complex, it's not a single valley, but potentially many valleys. You can end up at a decent local minimum.