I'm trying to understand how TrainingParameterScheduleDouble
works in the CNTK C# API. Unfortunately, there is no documentation and the previous SO thread here appears to be incorrect/incomplete, so I've tried to reverse engineer the behavior myself. Can anyone confirm my conclusions and answer the lingering questions that I have?
Overload #1
TrainingParameterScheduleDouble(value, minibatchSize)
This sets the learning rate to value
per minibatchSize
number of samples, regardless of the actual minibatch size passed to GetNextMinibatch
. Thus, using minibatchSize: 1
is an easy way to specify a per-sample learning rate.
It seems to me that calling the second parameter minibatchSize
is very misleading in this context, since it's totally unrelated to the actual size of each minibatch. I think a better name would have been something like perNumSamples
, or am I missing something?
Overload #2
TrainingParameterScheduleDouble(value)
This is the same as setting minibatchSize: 0
above, and has the effect of using the "natural" minibatchSize
that's passed to GetNextMinibatch
as the number of samples.
So if we have GetNextMinibatch(64)
then new TrainingParameterScheduleDouble(0.001)
will result in a 64x slower learning rate than new TrainingParameterScheduleDouble(0.001, 1)
.
Overload #3
TrainingParameterScheduleDouble(schedule)
This changes the learning rate over time, using the "natural" minibatch size. So a schedule of (30, 0.321), (1, 0.123)
will use a per-actual-minibatch learning rate of 0.321 for the first 30 minibatches and a rate of 0.123 thereafter.
Overload #4
TrainingParameterScheduleDouble(schedule, epochSize)
epochSize
causes IsSweepBased()
to return False
instead of True
, but otherwise has no apparent effect on the learning rate or anything else. This is surprising. Can anyone explain the purpose of epochSize
in this context?
Overload #5
TrainingParameterScheduleDouble(schedule, epochSize, minibatchSize)
This is the only way to change the learning rate over time without using the natural minibatch size. So a schedule of (30, 0.321), (1, 0.123)
with minibatchSize: 1
will use a per-sample learning rate of 0.321 for the first 30 samples (regardless of the actual minibatch size) and a rate of 0.123 thereafter. As before, the epoch size has no apparent effect.
Assuming this is correct, it's not clear to me what happens if the learning rate changes in the middle of a minibatch. Can anyone clarify?