Tensorflow: How to set the learning rate in log scale and some Tensorflow questions
Asked Answered
L

3

8

I am a deep learning and Tensorflow beginner and I am trying to implement the algorithm in this paper using Tensorflow. This paper uses Matconvnet+Matlab to implement it, and I am curious if Tensorflow has the equivalent functions to achieve the same thing. The paper said:

The network parameters were initialized using the Xavier method [14]. We used the regression loss across four wavelet subbands under l2 penalty and the proposed network was trained by using the stochastic gradient descent (SGD). The regularization parameter (λ) was 0.0001 and the momentum was 0.9. The learning rate was set from 10−1 to 10−4 which was reduced in log scale at each epoch.

This paper uses wavelet transform (WT) and residual learning method (where the residual image = WT(HR) - WT(HR'), and the HR' are used for training). Xavier method suggests to initialize the variables normal distribution with

stddev=sqrt(2/(filter_size*filter_size*num_filters)

Q1. How should I initialize the variables? Is the code below correct?

weights = tf.Variable(tf.random_normal[img_size, img_size, 1, num_filters], stddev=stddev)

This paper does not explain how to construct the loss function in details . I am unable to find the equivalent Tensorflow function to set the learning rate in log scale (only exponential_decay). I understand MomentumOptimizer is equivalent to Stochastic Gradient Descent with momentum.

Q2: Is it possible to set the learning rate in log scale?

Q3: How to create the loss function described above?

I followed this website to write the code below. Assume model() function returns the network mentioned in this paper and lamda=0.0001,

inputs = tf.placeholder(tf.float32, shape=[None, patch_size, patch_size, num_channels])
labels = tf.placeholder(tf.float32, [None, patch_size, patch_size, num_channels])

# get the model output and weights for each conv
pred, weights = model()

# define loss function
loss = tf.nn.softmax_cross_entropy_with_logits_v2(labels=labels, logits=pred)

for weight in weights:
    regularizers += tf.nn.l2_loss(weight)

loss = tf.reduce_mean(loss + 0.0001 * regularizers)

learning_rate = tf.train.exponential_decay(???) # Not sure if we can have custom learning rate for log scale
optimizer = tf.train.MomentumOptimizer(learning_rate, momentum).minimize(loss, global_step)

NOTE: As I am a deep learning/Tensorflow beginner, I copy-paste code here and there so please feel free to correct it if you can ;)

Locally answered 22/11, 2017 at 10:0 Comment(2)
Which TF and Python versions are you using?Clintonclintonia
@MaxB I am using TF 1.40 and Python 2.7.14 or 3.6.3Locally
A
2

Other answers are very detailed and helpful. Here is a code example that uses placeholder to decay learning rate at log scale. HTH.

import tensorflow as tf

import numpy as np


# data simulation
N = 10000
D = 10
x = np.random.rand(N, D)
w = np.random.rand(D,1)
y = np.dot(x, w)

print y.shape

#modeling
batch_size = 100
tni = tf.truncated_normal_initializer()
X = tf.placeholder(tf.float32, [batch_size, D])
Y = tf.placeholder(tf.float32, [batch_size,1])
W = tf.get_variable("w", shape=[D,1], initializer=tni)
B = tf.zeros([1])

lr = tf.placeholder(tf.float32)

pred = tf.add(tf.matmul(X,W), B)
print pred.shape
mse = tf.reduce_sum(tf.losses.mean_squared_error(Y, pred))
opt = tf.train.MomentumOptimizer(lr, 0.9)

train_op = opt.minimize(mse)

learning_rate = 0.0001

do_train = True
acc_err = 0.0
sess = tf.Session()
sess.run(tf.global_variables_initializer())
while do_train:
  for i in range (100000):
     if i > 0 and i % N == 0:
       # epoch done, decrease learning rate by 2
       learning_rate /= 2
       print "Epoch completed. LR =", learning_rate

     idx = i/batch_size + i%batch_size
     f = {X:x[idx:idx+batch_size,:], Y:y[idx:idx+batch_size,:], lr: learning_rate}
     _, err = sess.run([train_op, mse], feed_dict = f)
     acc_err += err
     if i%5000 == 0:
       print "Average error = {}".format(acc_err/5000)
       acc_err = 0.0
Avrilavrit answered 2/12, 2017 at 19:23 Comment(8)
Sorry for my poor maths. Please can you explain why learning_rate /= 2 means decay learning rate at log scale?Locally
@Locally The reasoning is similar to why a search in binary search tree is O(log2(N)) - you are halving your search space. Here, for example, let's say you started with learning rate of 32, then after 5 epochs your learning rate will be 32/2/2/2/2/2 = 1, and log2(32) = 5. If you plotted log(lr) to the base 2 against epoch number, it would be a straight line. You can choose any base you like and divide by that.Avrilavrit
So if you plot the graph for decay at log scale, it should look like concave upwards (with negative slope) rather than concave downwards with negative slope (curve description)?Locally
At log scale it will be a straight line with negative slope.Avrilavrit
I understand in maths at log scale it will be a straight line. But how did you tell the Tensorflow to interpret the learning rate at log scale (is it just as simple as learning_rate /= 2)? I am confused so this question may sound silly.Locally
The learning rate schedule is something that you decide. TF does not try to interpret how you want to change the learning rate. When you specify something like tf.train.exponential_decay, what you are doing is that you are making the learning rate a function of the global step and passing its value on to the optimizer. You can define your schedule explicitly through a placeholder (as above) or through a helper function provided by TF, e.g. tf.train.exponential_decayAvrilavrit
So are you saying tf.train.exponential_decay is already implementing the decay learning rate at log scale?Locally
Yes. learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step,epoch_size/batch_size,0.5, staircase=True) should be similar to the above.Avrilavrit
B
4

Q1. How should I initialize the variables? Is the code below correct?

That's correct (although missing an opening parentheses). You could also look into tf.get_variable if the variables are going to be reused.

Q2: Is it possible to set the learning rate in log scale?

Exponential decay decreases the learning rate at every step. I think what you want is tf.train.piecewise_constant, and set boundaries at each epoch.

EDIT: Look at the other answer, use the staircase=True argument!

Q3: How to create the loss function described above?

Your loss function looks correct.

Blinni answered 23/11, 2017 at 0:33 Comment(2)
Actually I have not used the variable logits and not sure how should I associate this with labels and pred to construct the first loss function. Please can you suggest how to modify these two lines?Locally
Thanks for your suggestion to use tf.train.piecewise_constant. Please can you show me how to use this function to set the training rate in log scale?Locally
P
4

Q1. How should I initialize the variables? Is the code below correct?

Use tf.get_variable or switch to slim (it does the initialization automatically for you). example

Q2: Is it possible to set the learning rate in log scale?

You can but do you need it? This is not the first thing that you need to solve in this network. Please check #3

However, just for reference, use following notation.

learning_rate_node = tf.train.exponential_decay(learning_rate=0.001, decay_steps=10000, decay_rate=0.98, staircase=True)

optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate_node).minimize(loss)

Q3: How to create the loss function described above?

At first, you have not written "pred" to "image" conversion to this message(Based on the paper you need to apply subtraction and IDWT to obtain final image).

There is one problem here, logits have to be calculated based on your label data. i.e. if you will use marked data as "Y : Label", you need to write

pred = model()

pred = tf.matmul(pred, weights) + biases

logits = tf.nn.softmax(pred)

loss = tf.reduce_mean(tf.abs(logits - labels))

This will give you the output of Y : Label to be used

If your dataset's labeled images are denoised ones, in this case you need to follow this one:

pred = model()

pred = tf.matmul(image, weights) + biases

logits = tf.nn.softmax(pred)

image = apply_IDWT("X : input", logits) # this will apply IDWT(x_label - y_label)

loss = tf.reduce_mean(tf.abs(image - labels))

Logits are the output of your network. You will use this one as result to calculate the rest. Instead of matmul, you can add a conv2d layer in here without a batch normalization and an activation function and set output feature count as 4. Example:

pred = model()

pred = slim.conv2d(pred, 4, [3, 3], activation_fn=None, padding='SAME', scope='output')

logits = tf.nn.softmax(pred)

image = apply_IDWT("X : input", logits) # this will apply IDWT(x_label - y_label)

loss = tf.reduce_mean(tf.abs(logits - labels))

This loss function will give you basic training capabilities. However, this is L1 distance and it may suffer from some issues (check). Think following situation

Let's say you have following array as output [10, 10, 10, 0, 0] and you try to achieve [10, 10, 10, 10, 10]. In this case, your loss is 20 (10 + 10). However, you have 3/5 success. Also, it may indicate some overfit.

For same case, think following output [6, 6, 6, 6, 6]. It still has loss of 20 (4 + 4 + 4 + 4 + 4). However, whenever you apply threshold of 5, you can achieve 5/5 success. Hence, this is the case that we want.

If you use L2 loss, for the first case, you will have 10^2 + 10^2 = 200 as loss output. For the second case, you will get 4^2 * 5 = 80. Hence, optimizer will try to run away from #1 as quick as possible to achieve global success rather than perfect success of some outputs and complete failure of the others. You can apply loss function like this for that.

tf.reduce_mean(tf.nn.l2_loss(logits - image))

Alternatively, you can check for cross entropy loss function. (it does apply softmax internally, do not apply softmax twice)

tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, image))

Philbin answered 27/11, 2017 at 3:5 Comment(10)
As per the paper, the model() function should have a lot of conv. As per tensorflow tutorial, each conv should use a weight for filter. So we should have a list of weights. Should we return weights in model() and use them to calculate the loss function? In addition, I did not see how the regularization is applied to the loss function (I guess it should be similar to this website). Last I would like to implement the log scale as per the paper if possible. Is it possible to define custom learning rate?Locally
You don't apply loss to weights. You apply loss to the output. Loss is exactly defined as (What you get as output - what you should get). Regularization is completely a different thing. Consider understanding the loss function at first. Then, jumping to the regularization would make more sense.Philbin
It seems like you are lost in the paper. Consider following roadmap to achieve success: Train a network disregarding as-is of paper, be sure that everything works fine and nn learns output (current one should not learn) Then, search for how to apply regularization on weight parameters (There are plenty of explanation for it) Then, search for log scale learning rate. Last two steps are to refine the output as an enhancement. Without solving the first one, last two does not make sense.Philbin
Yes I am lost as I am the beginner and I appreciate you have pointed out the roadmap. It would be very helpful if you can put the above description into code and explain it step-by-step how should I construct the loss, how to apply regularization, set learning rate in log scale and use optimizer to set the learning rate and minimize the loss. The reason I asked if we should apply weights to loss is that I have googled a lot of codes and saw some of them did it this way. My understanding is we use these weights (for conv2d) to create the model so it does make sense to evaluate them in lossLocally
One more thing is that the paper did mention L2 loss and regularization so I do not think this website is wrong to put regularization to the loss. Please can you have a look on this website and see if it does make sense?Locally
Regularization is a thing that you use to prevent overfitting. In some cases, some neurons may become too responsive to noise. That's why we sum the difference between weights and biases and add it to loss function. However, this is not essential for initial training. What you are asking as tutorial is a topic of a book rather than a post. The paper that you are trying to implement is a high end one. Start with easier implementations available in public (i.e. VGG 16 classification). As soon as you can train simple network, jump into this problem. Buy a book that explains step by step.Philbin
I have updated the code above to use softmax_cross_entropy_with_logits. I hope it now makes more sense. Please feel free to give comments.Locally
Yes, this makes much sense and it is the correct usage of the regularization except from a minor mistake. regularizer = tf.nn.l2_loss(weights) has to be regularizer = tf.nn.l2_loss(weight) (not array but element). I also edited exponential decay function as reference. If my answer is correct reply to your question, please mark it as accepted one.Philbin
If you answer my key question "How to set learning rate in log scale", I will accept this as an answer. I had a look on the API and believe there is no such function in Tensorflow, so is it possible to define something like math functions to have our custom learning rate? If so, how?Locally
Also check #33920448 for how to set learning rate over time manually.Philbin
A
2

Other answers are very detailed and helpful. Here is a code example that uses placeholder to decay learning rate at log scale. HTH.

import tensorflow as tf

import numpy as np


# data simulation
N = 10000
D = 10
x = np.random.rand(N, D)
w = np.random.rand(D,1)
y = np.dot(x, w)

print y.shape

#modeling
batch_size = 100
tni = tf.truncated_normal_initializer()
X = tf.placeholder(tf.float32, [batch_size, D])
Y = tf.placeholder(tf.float32, [batch_size,1])
W = tf.get_variable("w", shape=[D,1], initializer=tni)
B = tf.zeros([1])

lr = tf.placeholder(tf.float32)

pred = tf.add(tf.matmul(X,W), B)
print pred.shape
mse = tf.reduce_sum(tf.losses.mean_squared_error(Y, pred))
opt = tf.train.MomentumOptimizer(lr, 0.9)

train_op = opt.minimize(mse)

learning_rate = 0.0001

do_train = True
acc_err = 0.0
sess = tf.Session()
sess.run(tf.global_variables_initializer())
while do_train:
  for i in range (100000):
     if i > 0 and i % N == 0:
       # epoch done, decrease learning rate by 2
       learning_rate /= 2
       print "Epoch completed. LR =", learning_rate

     idx = i/batch_size + i%batch_size
     f = {X:x[idx:idx+batch_size,:], Y:y[idx:idx+batch_size,:], lr: learning_rate}
     _, err = sess.run([train_op, mse], feed_dict = f)
     acc_err += err
     if i%5000 == 0:
       print "Average error = {}".format(acc_err/5000)
       acc_err = 0.0
Avrilavrit answered 2/12, 2017 at 19:23 Comment(8)
Sorry for my poor maths. Please can you explain why learning_rate /= 2 means decay learning rate at log scale?Locally
@Locally The reasoning is similar to why a search in binary search tree is O(log2(N)) - you are halving your search space. Here, for example, let's say you started with learning rate of 32, then after 5 epochs your learning rate will be 32/2/2/2/2/2 = 1, and log2(32) = 5. If you plotted log(lr) to the base 2 against epoch number, it would be a straight line. You can choose any base you like and divide by that.Avrilavrit
So if you plot the graph for decay at log scale, it should look like concave upwards (with negative slope) rather than concave downwards with negative slope (curve description)?Locally
At log scale it will be a straight line with negative slope.Avrilavrit
I understand in maths at log scale it will be a straight line. But how did you tell the Tensorflow to interpret the learning rate at log scale (is it just as simple as learning_rate /= 2)? I am confused so this question may sound silly.Locally
The learning rate schedule is something that you decide. TF does not try to interpret how you want to change the learning rate. When you specify something like tf.train.exponential_decay, what you are doing is that you are making the learning rate a function of the global step and passing its value on to the optimizer. You can define your schedule explicitly through a placeholder (as above) or through a helper function provided by TF, e.g. tf.train.exponential_decayAvrilavrit
So are you saying tf.train.exponential_decay is already implementing the decay learning rate at log scale?Locally
Yes. learning_rate = tf.train.exponential_decay(starter_learning_rate, global_step,epoch_size/batch_size,0.5, staircase=True) should be similar to the above.Avrilavrit

© 2022 - 2024 — McMap. All rights reserved.