Tensorflow: Linear regression with non-negative constraints
Asked Answered
S

2

5

I am trying to implement a linear regression model in Tensorflow, with additional constraints (coming from the domain) that the W and b terms must be non-negative.

I believe there are a couple of ways to do this.

  1. We can modify the cost function to penalize negative weights [Lagrangian approach] [See:TensorFlow - best way to implement weight constraints
  2. We can compute the gradients ourselves and project them on [0, infinity] [Projected gradient approach]

Approach 1: Lagrangian

When I tried the first approach, I would often end up with negative b.

I had modified the cost function from:

cost = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)

to:

cost = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
nn_w = tf.reduce_sum(tf.abs(W) - W)
nn_b = tf.reduce_sum(tf.abs(b) - b)
constraint = 100.0*nn_w + 100*nn_b
cost_with_constraint = cost + constraint
Keeping the coefficient of nn_b and nn_w to be very high leads to instability and very high cost.

Here is the complete code.

import numpy as np
import tensorflow as tf

n_samples = 50
train_X = np.linspace(1, 50, n_samples)
train_Y = 10*train_X + 6 +40*np.random.randn(50)

X = tf.placeholder("float")
Y = tf.placeholder("float")

# Set model weights
W = tf.Variable(np.random.randn(), name="weight")
b = tf.Variable(np.random.randn(), name="bias")

# Construct a linear model
pred = tf.add(tf.multiply(X, W), b)

# Gradient descent
learning_rate=0.0001
# Initializing the variables
init = tf.global_variables_initializer()

# Mean squared error
cost = tf.reduce_sum(tf.pow(pred-Y, 2))/(2*n_samples)
nn_w = tf.reduce_sum(tf.abs(W) - W)
nn_b = tf.reduce_sum(tf.abs(b) - b)
constraint = 1.0*nn_w + 100*nn_b
cost_with_constraint = cost + constraint
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost_with_constraint)

training_epochs=200
with tf.Session() as sess:
    sess.run(init)

    # Fit all training data
    cost_array = np.zeros(training_epochs)
    W_array = np.zeros(training_epochs)
    b_array = np.zeros(training_epochs)

    for epoch in range(training_epochs):
        for (x, y) in zip(train_X, train_Y):
            sess.run(optimizer, feed_dict={X: x, Y: y})
            W_array[epoch] = sess.run(W)
            b_array[epoch] = sess.run(b)
            cost_array[epoch] = sess.run(cost, feed_dict={X: train_X, Y: train_Y})

The following is the mean of b across 10 different runs.

0   -1.101268
1    0.169225
2    0.158363
3    0.706270
4   -0.371205
5    0.244424
6    1.312516
7   -0.069609
8   -1.032187
9   -1.711668

Clearly, the first approach is not optimal. Further, there is a lot of art involved in choosing the coefficient of penalty terms.

Approach 2: Projected gradient

I then thought to use the second approach, which is more guaranteed to work.

gr = tf.gradients(cost, [W, b])

We manually compute the gradients and update the W and b.

 with tf.Session() as sess:
    sess.run(init)


    for epoch in range(training_epochs):
        for (x, y) in zip(train_X, train_Y):
            W_del, b_del = sess.run(gr, feed_dict={X: x, Y: y})
            W = max(0, (W - W_del)*learning_rate) #Project the gradient on [0, infinity]
            b = max(0, (b - b_del)*learning_rate) # Project the gradient on [0, infinity]

This approach seems to be very slow.

I am wondering if there is a better way to run the second approach, or guarantee the results with the first approach. Can we somehow allow the optimizer to ensure that the learnt weights are non-negative?

Edit: How to do this in Autograd

https://github.com/HIPS/autograd/issues/207

Sedulous answered 28/3, 2017 at 9:36 Comment(2)
I haven't read everything so might not apply to you, but if you only want to use positive w, you can do things like learning the logarithm of b instead of b. In effect, what this means you use exp(b) instead of b in the prediction ... then the effective bias will always be >0.Wakashan
Thanks for the comment @etarion, both W and b should be non-negative. Do you then recommend that we learn the log of both W and b? Doesn't that change Y = W.X + b?Sedulous
B
8

If you modify your linear model to:

pred = tf.add(tf.multiply(X, tf.abs(W)), tf.abs(b))

it will have the same effect as using only positive W and b values.

The reason your second approach is slow is that you clip the W and b values outside of the tensorflow graph. (Also it will not converge because (W - W_del)*learning_rate must instead be W - W_del*learning_rate)

edit:

You can implement the clipping using tensorflow graph like this:

train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

with tf.control_dependencies([train_step]):
    clip_W = W.assign(tf.maximum(0., W))
    clip_b = b.assign(tf.maximum(0., b))
    train_step_with_clip = tf.group(clip_W, clip_b)

In this case W and b values will be clipped to 0 and not to small positive numbers.

Here is a small mnist example with clipping:

import tensorflow as tf

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

x = tf.placeholder(tf.uint8, [None, 28, 28])
x_vec = tf.cast(tf.reshape(x, [-1, 784]), tf.float32) / 255.

W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
y = tf.matmul(x_vec, W) + b

y_target = tf.placeholder(tf.uint8, [None])
y_target_one_hot = tf.one_hot(y_target, 10)

cross_entropy = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(labels=y_target_one_hot, logits=y))

train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)

with tf.control_dependencies([train_step]):
    clip_W = W.assign(tf.maximum(0., W))
    clip_b = b.assign(tf.maximum(0., b))
    train_step_with_clip = tf.group(clip_W, clip_b)

correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_target_one_hot, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

with tf.Session() as sess:
  tf.global_variables_initializer().run()

  for i in range(1000):
    sess.run(train_step_with_clip, feed_dict={
        x: x_train[(i*100)%len(x_train):((i+1)*100)%len(x_train)],
        y_target: y_train[(i*100)%len(x_train):((i+1)*100)%len(x_train)]})

    if not i%100:
      print("Min_W:", sess.run(tf.reduce_min(W)))
      print("Min_b:", sess.run(tf.reduce_min(b)))

  print("Accuracy:", sess.run(accuracy, feed_dict={
      x: x_test,
      y_target: y_test}))
Bigelow answered 2/4, 2017 at 17:55 Comment(3)
Thanks for the answer. I get the following error: ---> 11 optimizer_step = tf.identity(optimizer_step). TypeError: Can't convert Operation 'GradientDescent_2' to Tensor (target dtype=None, name=u'input', as_ref=False) I think it would be great if you can help with a minimal fully working example for the second approach.Sedulous
@Nipun Batra Yes I made a mistake there, the identity operator can only be used on tensors. I edited the answer to a working version with an example.Bigelow
Thanks. This answer is super useful. I will now update my question to reflect the answer for the simple dummy data case! I have also awarded you the bounty :)Sedulous
V
3

I actually was not able to reproduce your problem of getting negative bs with your first approach. But I do agree that this is not optimal for your use case and can result in negative values.

You should be able to constrain your parameters to non-negative values like so:

W *= tf.cast(W > 0., tf.float32)
b *= tf.cast(b > 0., tf.float32)

(exchange > with >= if necessary, the cast is necessary as the comparison operators will produce boolean values. You then would optimize for the "standard cost" without the additional constraints. However, this does not work in every case. For example, it should be avoided to initialize W or b with negative values in the beginning.

Your second (and probably better) approach can be accelerated by defining the update logic in the general computational graph, i.e. after the definition of cost

params = [W, b]
grads = tf.gradients(cost, params)
optimizer = [tf.assign(param, tf.maximum(0., param - grad*learning_rate))
             for param, grad in zip(params, grads)]

I think your solution is slow because it creates new computation nodes every time which is probably very costly and repeated a lot inside the loops.

update using tensorflow optimizer

In my solution above not the gradients are clipped but rather the resulting update values. Along the lines of this answer you could clip the gradients to be at most the value of the updated parameter like so:

params = [W, b]
opt = tf.train.GradientDescentOptimizer(learning_rate)
grads_and_vars = opt.compute_gradients(cost, params)
clipped_grads_vars = [(tf.clip_by_value(grad, -np.inf, var), var) for grad, var in grads_and_vars]
optimizer = opt.apply_gradients(clipped_grads_vars)

This way an update will never decrease a parameter to a value below 0. However, I think this will not work in the case the updated variable is already negative. Also, if the optimizing algorithm somehow multiplies the clipped gradient by a value greater than 1. The latter might actually never happen, but I'm not 100% sure.

Vince answered 31/3, 2017 at 18:36 Comment(2)
Thanks. This does seem to be getting to the required solution. However, it would be great if the computed (and clipped) gradient can then be passed onto the optimization algorithm. That way, the optimizer will take care of things like adapting the learning rate, etc. I also think that it may speedup as the optimizer code would be highly efficient.Sedulous
Thanks. I think the more principled approach is tf.assign(param, tf.maximum(0., param - grad*learning_rate)) . Would be super useful if you can help with a complete working example.Sedulous

© 2022 - 2024 — McMap. All rights reserved.