Not really the whole dataset. For ImageNet, I have mini_batch size of 256, and I need to allocate the whole network on GPU for this mini_batch (which is 256 * the neurons in the network), and plus parameters (parameters is about 200MiB * 3 for updates and momentum). Also to speed up certain operations, the data need to be reshaped, and there are 500MiB scratch space just for that purpose. In total, I am using close to 6GiB GPU memory. You probably can get to only use 4GiB memory if the batch size is 128.
The code is never optimized to the extreme. I optimized the code to the point of being able to finish at reasonable time (9 days for 100 epochs). The convolutional kernel is parametrized (with template and some macro tricks, forward and backward propagation convolutional kernels are parametrized into hundreds of small functions) and the best parameters were chose with a mini benchmark in the beginning of training process.
So... duh. You would need > 1.2 TB to fit all of ImageNet on the card :). Thanks for clarifying, and pardon my brain lapse! Also, thanks for putting this out there - if I get some time I may send some pull requests your way. Awesome stuff.
You need 1.2TB SSD for sure to train the complete ImageNet dataset. The data is loaded into GPU memory only one batch (256 images) at a time. But the loading part will be the bottleneck if you use a rotational disk.
The code is never optimized to the extreme. I optimized the code to the point of being able to finish at reasonable time (9 days for 100 epochs). The convolutional kernel is parametrized (with template and some macro tricks, forward and backward propagation convolutional kernels are parametrized into hundreds of small functions) and the best parameters were chose with a mini benchmark in the beginning of training process.