Training Tensorflow Inception-v3 Imagenet on modest hardware setup

T

2

11

I've been training Inception V3 on a modest machine with a single GPU (GeForce GTX 980 Ti, 6GB). The maximum batch size appears to be around 40.

I've used the default learning rate settings specified in the inception_train.py file: initial_learning_rate = 0.1, num_epochs_per_decay = 30 and learning_rate_decay_factor = 0.16. After a couple of weeks of training the best accuracy I was able to achieve is as follows (About 500K-1M iterations):

2016-06-06 12:07:52.245005: precision @ 1 = 0.5767 recall @ 5 = 0.8143 [50016 examples]
2016-06-09 22:35:10.118852: precision @ 1 = 0.5957 recall @ 5 = 0.8294 [50016 examples]
2016-06-14 15:30:59.532629: precision @ 1 = 0.6112 recall @ 5 = 0.8396 [50016 examples]
2016-06-20 13:57:14.025797: precision @ 1 = 0.6136 recall @ 5 = 0.8423 [50016 examples]

I've tried fiddling with the settings towards the end of the training session, but couldn't see any improvements in accuracy.

I've started a new training session from scratch with num_epochs_per_decay = 10 and learning_rate_decay_factor = 0.001 based on some other posts in this forum, but it's sort of grasping in the dark here.

Any recommendations on good defaults for a small hardware setup like mine?

Thundersquall answered 8/7, 2016 at 4:59 Comment(0)

S

21

TL,DR: There is no known method for training an Inception V3 model from scratch in a tolerable amount of time from a modest hardware set up. I would strongly suggest retraining a pre-trained model on your desired task.

On a small hardware set up like yours, it will be difficult to achieve maximum performance. Generally speaking for CNN's, the best performance is with the largest batch sizes possible. This means that for CNN's the training procedure is often limited by the maximum batch size that can fit in GPU memory.

The Inception V3 model available for download here was trained with an effective batch size of 1600 across 50 GPU's -- where each GPU ran a batch size of 32.

Given your modest hardware, my number one suggestion would be to download the pre-trained mode from the link above and retrain the model for the individual task you have at hand. This would make your life much happier.

As a thought experiment (but hardly practical) .. if you feel especially compelled to exactly match the training performance of the model from the pre-trained model by training from scratch, you could do the following insane procedure on your 1 GPU. Namely, you could run the following procedure:

Run with a batch size of 32
Store the gradients from the run
Repeat this 50 times.
Average the gradients from the 50 batches.
Update all variables with the gradients.
Repeat

I am only mentioning this to give you a conceptual sense of what would need to be accomplished to achieve the exact same performance. Given the speed numbers you mentioned, this procedure would take months to run. Hardly practical.

More realistically, if you are still strongly interested in training from scratch and doing the best you can, here are some general guidelines:

Always run with the largest batch size possible. It looks like you are already doing that. Great.
Make sure that you are not CPU bound. That is, make sure that the input processing queue's are always modestly full as displayed on TensorBoard. If not, increase the number of preprocessing threads or use a different CPU if available.
Re: learning rate. If you are always running synchronous training (which must be the case if you only have 1 GPU), then the higher batch size, the higher the tolerable learning rate. I would a try a series of several quick runs (e.g. a few hours each) to identify the highest learning possible which does not lead to NaN's. After you find such a learning rate, knock it down by say 5-10% and run with that.
As for num_epochs_per_decay and decay_rate, there are several strategies. The strategy highlighted by 10 epochs per decay, 0.001 decay factor is to hammer the model for as long as possible until the eval accuracy asymptotes. And then lower the learning rate. This is a simple strategy which is nice. I would verify that is what you see in your model monitoring that the eval accuracy and determining that it indeed asymptotes before you allow the model to decay the learning rate. Finally, the decay factor is a bit ad-hoc but lowering by say a power of 10 seems to be a good rule of thumb.

Note again that these are general guidelines and others might even offer differing advice. The reason why we can not give you more specific guidance is that CNNs of this size are just not often trained from scratch on a modest hardware setup.

Snath answered 10/7, 2016 at 21:33 Comment(0)

H

1

Excellent tips. There is precedence for training using a similar setup as yours. Check this out - http://3dvision.princeton.edu/pvt/GoogLeNet/ These people trained GoogleNet, but, using Caffe. Still, studying their experience would be useful.

Holm answered 6/12, 2016 at 9:50 Comment(0)

Recommended topics

Hot tags