Actually, there's already a better way to perform initialization, based on the so called lottery ticket hypothesis [1]. I haven't gotten to the article, so I'll just regurgitate the abstract, but basically there frequently are subnetworks which may be exposed by pruning trained networks which perform on par with full sizes neural nets with ≈20% of the parameters and substantially quicker training time. It turns out that with some magic algorithm described in the paper, one can initialize weights to quickly find these "winning tickets" to drastically reduce neural network size and training time.
As far as I understand there is no quick magic algorithm to find them: you train the full architecture as usual the long and hard way, then you identify the right subnetwork and you can retrain faster from the architecture and initialization of just this subnetwork
>Additionally, we provide strong counterexamples to two recently proposed theories that models learned through pruning techniques can be trained from scratch to the same test set performance of a model learned with sparsification as part of the optimization process. Our results highlight the need for large-scale benchmarks in sparsification and model compression.
1. https://arxiv.org/abs/1803.03635