Estimating linear regression with Gradient Descent (Steepest Descent)

Asked 4/4, 2017 at 20:59 Answered 5/4, 2017 at 19:17

Solved r regression linear-regression gradient-descent

Example data

X<-matrix(c(rep(1,97),runif(97)) , nrow=97, ncol=2)
y<-matrix(runif(97), nrow= 97 , ncol =1)

I have succeed in creating the cost function

COST<-function(theta,X,y){
### Calculate half MSE 
    sum((X %*% theta - y)^2)/(2*length(y))
}

How ever when I run this function , it seem to fail to converge over 100 iterations.

theta <- matrix (0, nrow=2,ncol=1)
num.iters <- 1500
delta = 0 

GD<-function(X,y,theta,alpha,num.iters){
    for (i in num.iters){

        while (max(abs(delta)) < tolerance){

            error <- X %*% theta - y
            delta <- (t(X) %*% error) / length(y)
            theta <- theta - alpha * delta
            cost_histo[i] <- COST(theta,X,y)
            theta_histo[[i]] <- theta

  }
  }
        return (list(cost_histo, theta_histo))
  }

Can someone help me ?

Cheers

Illumine answered 4/4, 2017 at 20:59 Comment(7)

I have tried this it dosen't work....In the function GD , Every line works if I run it outside the loop.....but when I run the loop I get NaN.... In the GD2 I just get the mentionned error anyways... – Illumine 4/4, 2017 at 21:10

It is odd that the error mentions ` t(X) %*% t(theta)` but that does not seem to be in your code. – Victorvictoria 4/4, 2017 at 21:13

Sorry I have edited the error....but situation cause error and t(X) %*% theta OR t(theta) %*% X OR X %*% t(theta) gives me an error as well – Illumine 4/4, 2017 at 21:14

Could you please include the loop in your code to make a complete minimum working example? – Gony 4/4, 2017 at 23:13

It is there, in the functions – Illumine 4/4, 2017 at 23:31

Ok I will do that , I just didn't get what you meant by some.tolerance – Illumine 5/4, 2017 at 14:3

I have edited the post – Illumine 5/4, 2017 at 14:19

Algorithmic part of your implementation is correct. Problems lie in

The loop structure in GD is not right; the for loop is redundant and variables lack proper initialization.
Simple implementation of gradient descent by using a fixed alpha is dangerous. It is usually suggested that this alpha should be chosen small enough to hope that we always search down the objective function. However, this is rare in practice. For example, how small is sufficient? If it is small, then convergence speed is a problem; but if it is large, we may be trapped in a 'zig-zag' searching path and even a divergence!

Here is a robust version of Gradient Descent, for estimation of linear regression. The improvement comes from the step halving strategy, to avoid "zig-zag" or divergence. See comments along the code. Under this strategy, it is safe to use large alpha. Convergence is guaranteed.

# theta: initial guess on regression coef
# alpha: initial step scaling factor
GD <- function(X, y, theta, alpha) {
  cost_histo <- numeric(0)
  theta_histo <- numeric(0)
  # an arbitrary initial gradient, to pass the initial while() check
  delta <- rep(1, ncol(X))
  # MSE at initial theta
  old.cost <- COST(theta, X, y)
  # main iteration loop
  while (max(abs(delta)) > 1e-7) {
    # gradient 
    error <- X %*% theta - y
    delta <- crossprod(X, error) / length(y)
    # proposal step
    trial.theta <- theta - alpha * c(delta)
    trial.cost <- COST(trial.theta, X, y)
    # step halving to avoid divergence
    while (trial.cost >= old.cost) {
      trial.theta <- (theta + trial.theta) / 2
      trial.cost <- COST(trial.theta, X, y)
      }
    # accept proposal
    cost_histo <- c(cost_histo, trial.cost)
    theta_histo <- c(theta_histo, trial.theta)
    # update old.cost and theta
    old.cost <- trial.cost
    theta <- trial.theta
    }
  list(cost_histo, theta_histo = matrix(theta_histo, nrow = ncol(X)))
  }

On return,

the length of cost_histo tells you how many iterations have been taken (excluding step halving);
each column of theta_histo gives theta per iteration.

Step halving in fact speeds up convergence greatly. You can get more efficiency if you use a faster computation method for COST. (Most useful for large datasets. See https://mcmap.net/q/298305/-fastest-way-to-compute-row-wise-dot-products-between-two-skinny-tall-matrices-in-r)

COST<-function(theta,X, y) {
  c(crossprod(X %*% theta - y)) /(2*length(y))
  }

Now, let's consider its implementation on your example X, y.

oo <- GD(X, y, c(0,0), 5)

After 107 iterations it converges. We can view the trace of MSE:

plot(oo[[1]])

Note that at the first few steps, MSE decreases very fast, but then it is almost flat. This reveals the fundamental drawback of gradient descent algorithm: convergence gets slower and slower as we get nearer and nearer to the minimum.

Now, we extract the final estimated coefficient:

oo[[2]][, 107]

We can also compare this with direct estimation by QR factorization:

.lm.fit(X, y)$coef

They are pretty close.

Mccollum answered 5/4, 2017 at 15:55 Comment(13)

I have change the title ! – Illumine 5/4, 2017 at 15:56

I have also tried normalizing the data (value-mean)/range , still dosen't works – Illumine 5/4, 2017 at 16:3

Oh I got it , you change the max(abs(delta)) > 1e-7 instead of < in your previous post – Illumine 5/4, 2017 at 16:7

When I normalize the data its works but gives me the error Error in while (max(abs(delta)) > 1e-07) { : valeur manquante là où TRUE / FALSE est requis when I don't any idea why – Illumine 5/4, 2017 at 16:12

Why would it work with this alpha and the same function and data in Octave but not in R ? – Illumine 5/4, 2017 at 16:32

It works with alpha= 0.001 but I did the same function in Octave(MATLAB) and I don't get the same result . so it is quite curious pretty close thought (0.1 differnce) – Illumine 5/4, 2017 at 16:49

This is veryyyy gooodd !!! Thank you, I am now writing this in Python ! Thank you !!!! – Illumine 5/4, 2017 at 17:18

I want to understand why you do the average of the theta ? is this conventionnal ? – Illumine 5/4, 2017 at 17:21

So you reduce the step over the iteration ? AAAAHH S.M.A.R.T. !!! Thank you – Illumine 5/4, 2017 at 17:24

I really like it but the while loop makes it very demanding for my computer ! I will works on that ! – Illumine 5/4, 2017 at 17:41

I would ask you the same and go checkout my previous post ! :) – Illumine 5/4, 2017 at 17:42

Just like it if you like it ! Or improve it :) ! Keep in touch ! – Illumine 5/4, 2017 at 17:51

It seem to give trouble to my CPU , I have replicate the function in Python , I will compare convergence time – Illumine 5/4, 2017 at 17:55

The crossprod makes it surprisingly slower then the previous methods :

Previous method (14 secs mean on 50 iterations):

Crossprod method (16 secs mean on 50 iterations):

Illumine answered 5/4, 2017 at 19:17 Comment(0)

Recommended topics

Hot tags