Gradient descent algorithm won't converge
Asked Answered
J

6

8

I'm trying to write out a bit of code for the gradient descent algorithm explained in the Stanford Machine Learning lecture (lecture 2 at around 25:00). Below is the implementation I used at first, and I think it's properly copied over from the lecture, but it doesn't converge when I add large numbers (>8) to the training set.

I'm inputting a number X, and the point (X,X) is added to the training set, so at the moment, I'm only trying to get it to converge to y=ax+b where a=1=theta\[1\] and b=0=theta\[0\]. The training set is the array x and y, where (x[i],y[i]) is a point.

void train()
{
    double delta;
    for (int i = 0; i < x.size(); i++)
    {
        delta = y[i]-hypothesis(x[i]);
        theta[1] += alpha*delta*x[i];
        theta[0] += alpha*delta*1;
    }
}

void C_Approx::display()
{
    std::cout<<theta[1]<<"x + "<<theta[0]<<" \t "<<"f(x)="<<hypothesis(1)<<std::endl;
}

some of the results I'm getting: I input a number, it runs train() a few times, then display()

1
0.33616x + 0.33616   f(x)=0.67232
1
0.482408x + 0.482408     f(x)=0.964816
1
0.499381x + 0.499381     f(x)=0.998762
1
0.499993x + 0.499993     f(x)=0.999986
1
0.5x + 0.5   f(x)=1

An example of it diverging after it passed 8:

1
0.33616x + 0.33616   f(x)=0.67232
2
0.705508x + 0.509914     f(x)=1.21542
3
0.850024x + 0.449928     f(x)=1.29995
4
0.936062x + 0.330346     f(x)=1.26641
5
0.951346x + 0.231295     f(x)=1.18264
6
0.992876x + 0.137739     f(x)=1.13062
7
0.932206x + 0.127372     f(x)=1.05958
8
1.00077x + 0.000493063   f(x)=1.00126
9
-0.689325x + -0.0714712      f(x)=-0.760797
10
4.10321e+08x + 4.365e+07     f(x)=4.53971e+08
11
1.79968e+22x + 1.61125e+21   f(x)=1.9608e+22
12
-3.9452e+41x + -3.26957e+40      f(x)=-4.27216e+41

I tried the solution proposed here of scaling the step and ended up with similar results. What am I doing wrong?

Janerich answered 16/10, 2011 at 8:45 Comment(0)
B
10

Your implementation is good. Generally, stochastic gradient descent might diverge when α is too large. What you would do with a large dataset is take a reasonably sized random sample, find α that gives you the best results, and then use it for the rest.

Bellona answered 16/10, 2011 at 14:34 Comment(4)
How would you determine α based on a random sample?Janerich
@howardh, simply by trying different values and choosing one that converges quickly to a small J(θ).Bellona
So I just create a new set of data points selected randomly from the original training set, call train() with that set with a certain α, and if the error doesn't decrease with every step, I decrease α and repeat?Janerich
It does not have to decrease every single step. It might be easier to take a range of values, like 1, 0.1, 0.01, etc, and try each over the whole small training set. SGD is one of the simplest classification methods, and sometimes a very effective one, but it takes a bit of guesswork.Bellona
L
4

I have experienced the same problem (albeit in Java) because my learning rate was too big.
For short, I was using α = 0.001 and I had to push it to 0.000001 to see actual convergence.

Of course these values are linked to your dataset.

Lowgrade answered 3/3, 2014 at 10:5 Comment(0)
D
1

When your cost function increases or cycles up and down, you usually have too large a value for alpha. What alpha are you using?

Start out with an alpha = 0.001 and see if that converges? If not try various alphas (0.003, 0.01, 0.03, 0.1, 0.3, 1) and find one that converges quickly.

Scaling the data (normalization) won't help you with only 1 feature (your theta[1]) as normalization only applies to 2+ features (multivariate linear regression).

Also bear in mind that for a small number of features you can use the Normal Equation to get the correct answer.

Datum answered 20/10, 2011 at 15:8 Comment(0)
M
1

use backtracking line search to guaranty convergence. It is very simple to implement. See Stephen Boyd, Convex Optimization for reference. You can choose some standard alpha, beta values for backtracking line search, for example 0.3 and 0.8.

Makalu answered 22/4, 2016 at 6:19 Comment(0)
C
1

It's not clean from your description what problem you're solving. Also it's very dangerous to post links to external resources - you can be blocked in stackoverflow.

In any case - gradient descend method and (subgradient descend too) with fixed step size (ML community call it learning rate) should not necesseray converge.

p.s. Machine Learning community is not interesting in "convergence condition" and "convergence to what" - they are interested in create "something" which pass cross-validation with good result.

If you're curious about optimization - start to look in convex optimization. Unfortunately it's hard to find job on it, but it append clean vision into what happens in various math optimization things.

Here is source code which demonstrate it for simple quadratic objective:

#!/usr/bin/env python
# Gradiend descend method (without stepping) is not converged for convex         
# objective

alpha = 0.1

#k = 10.0 # jumping around minimum
k = 20.0   # diverge
#k = 0.001  # algorithm converged but gap to the optimal is big

def f(x): return k*x*x
def g(x): return 2*k*x

x0 = 12
xNext = x0
i = 0
threshold = 0.01

while True:
    i += 1
    xNext = xNext + alpha*(-1)*(g(xNext))
    obj = (xNext)
    print "Iteration: %i, Iterate: %f, Objective: %f, Optimality Gap: %f" % (i, xNext, obj, obj - f(0.0))

    if (abs(g(xNext)) < threshold):
        break
    if i > 50:
        break

print "\nYou launched application with x0=%f,threshold=%f" % (x0, threshold)
Cadell answered 8/6, 2018 at 21:40 Comment(0)
F
0

If I understand you correctly, your training set only has a non-zero gradient at the edge of a line? Unless you start at the line (actually start exactly at one of your training points) you won't find the line. You are always at a local minimum.

Ferland answered 16/10, 2011 at 14:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.