Why does single-layer perceptron converge so slow without normalization, even when the margin is large?
Asked Answered
Y

2

7

This question is totally re-written after I confirmed my results (the Python Notebook can be found here) with a piece of code written by someone else (can be found here). Here is that code instrumented by me to work with my data and to count epochs till convergence:

import numpy as np
from matplotlib import pyplot as plt

class Perceptron(object):
    """Implements a perceptron network"""
    def __init__(self, input_size, lr=0.1, epochs=1000000):
        self.W = np.zeros(input_size+1)
        #self.W = np.random.randn(input_size+1)
        # add one for bias
        self.epochs = epochs
        self.lr = lr

    def predict(self, x):
        z = self.W.T.dot(x)
        return [1 if self.W.T.dot(x) >=0 else 0]

    def fit(self, X, d):
        errors = []
        for epoch in range(self.epochs):
            if (epoch + 1) % 10000 == 0: print('Epoch',epoch + 1)
            total_error = 0
            for i in range(d.shape[0]):
                x = np.insert(X[i], 0, 1)
                y = self.predict(x)
                e = d[i] - y
                total_error += np.abs(e)
                self.W = self.W + self.lr * e * x
                #print('W: ', self.W)
            errors += [total_error]
            if (total_error == 0):
                print('Done after', epoch, 'epochs')
                nPlot = 100
                plt.plot(list(range(len(errors)-nPlot, len(errors))), errors[-nPlot:])
                plt.show()
                break

if __name__ == '__main__':
    trainingSet = np.array([[279.25746446, 162.44072328,   1.        ],
                            [306.23240054, 128.3794866 ,   1.        ],
                            [216.67811217, 148.58167262,   1.        ],
                            [223.64431813, 197.75745016,   1.        ],
                            [486.68209275,  96.09115377,   1.        ],
                            [400.71323154, 125.18183395,   1.        ],
                            [288.87299305, 204.52217766,   1.        ],
                            [245.1492875 ,  55.75847006,  -1.        ],
                            [ 14.95991122, 185.92681911,   1.        ],
                            [393.92908798, 193.40527965,   1.        ],
                            [494.15988362, 179.23456285,   1.        ],
                            [235.59039363, 175.50868526,   1.        ],
                            [423.72071607,   9.50166894,  -1.        ],
                            [ 76.52735621, 208.33663341,   1.        ],
                            [495.1492875 ,  -7.73818431,  -1.        ]])
    X = trainingSet[:, :2]
    d = trainingSet[:, -1]
    d = np.where(d == -1, 1, 0)
    perceptron = Perceptron(input_size=2)
    perceptron.fit(X, d)
    print(perceptron.W)

The training set consists of 15 points, with a large separation margin. The Perceptron algorithm finds a separator as shown below, but after as many as 122,346 epochs:

enter image description here

As the Wikipedia article explains, the number of epochs needed by the Perceptron to converge is proportional to the square of the size of the vectors and inverse-proportional to the square of the margin. In my data, the size of the vectors is large, but the margin is large as well.

I seek to understand why so many epochs are required.

Update: As per the request in the comments, I updated the code to plot the total errors of the last 100 epochs. Here is the plot:

enter image description here

P.S.:After scaling the features to be distributed as N(0,1), the algorithm converges after two epochs. However, I do not grasp why the algorithm would not converge in a reasonable amount of time even without such scaling.

Yvetteyvon answered 13/12, 2019 at 9:31 Comment(22)
Would you mind trying to scale the data beforehand and get back to us? Using a standard scaler or minmax scalerSenskell
@AlwaysLearning, it's explained here but scaling helps convergence of the gradient and hopefully avoids saturation. Meaning that it avoids weights to be close to 1 and simply repeating the previous inputs without learning anything. It's more of a numerical problem in the computation.Senskell
I don't know, I didn't read the code because I didn't want to download stuff from dropbox. But if your network is stuck in a loop of just repeating the same inputs all the time then it's possible you have a huge number of epochs. Anyways, Almost all networks require scaling of the data beforehand.Senskell
@Yvetteyvon If you're willing, you can take a look at scikit learn implementation of the MLP and check how they do it, there's also a plethora of tutorials out there, be it in plain Python or numpySenskell
@Yvetteyvon Done! That’s a bunch of code. I’ll take a look at it tomorrow :)Edaedacious
@AlexanderCécile If you could also delete the very first and the very last comments of yours, that would help. The actual code of Perceptron is quite short -- four short functions beginning with perceptronUpdate. I will very much appreciate your pass.Yvetteyvon
@NaturalFrequency I have done what you suggested and confirmed my results. The question is now re-written.Yvetteyvon
@Yvetteyvon Your activation function seems weird too, are you going for ReLu? If that's the case then it should return x when x >= 0 and 0 otherwise, not 1. Another thing, your learning rate is bonkers, 1 should totally not be the learning rate, more like 0.1.Senskell
@NaturalFrequency Note that this code (not mine!) uses 0 and 1 (not -1 and 1) as labels. I got rid of the activation function, so it should not confuse you and others. Changing the learning rate makes no difference whatsoever.Yvetteyvon
Weird, try to initiale the first weights W as a vector of random weights instead of zeros. self.W = np.random.randn(input_size+1)Senskell
@NaturalFrequency Makes things even a little worse: 123,677 epochs.Yvetteyvon
Then I don't know but you should normalize, have random weights and mess around with new data to see if the problem is reproducibleSenskell
@NaturalFrequency As I wrote, normalizing solves the problem. Other data does not exhibit the problem either. I do not want to simply forget about it, because there is potential for increased understanding here. Thank you anyways for trying. (and, an upvote would be nice; it's not a bad question after all, ah?)Yvetteyvon
You might want to plot the actual error convergence wrt epoch. I'm curious about the assessment of convergence if (total_error == 0); regardless of language, floating-point equality is fraught with issues, and I highly suspect that your error is somewhere very close to zero but not exactly for a very long time.Taverner
@Taverner In the shown code, total_error is an integer (the number of misclassified samples), isn't it?Yvetteyvon
@AlwaysLearning, after walking through the code more carefully, yes, total_error is an integer. Sorry for wasting your time. Still a plot of error versus epoch will likely be helpful in determining if convergence is just hovering for a while.Taverner
@Taverner I added the plot of the total errors for the last 100 epochs. If this is what you suspected, how can this be intuitively understood and how does it agree with the theory?Yvetteyvon
Have you tried running without scaling but with very small learning rate, e.g. 0.01 or 0.001? How does convergence change as a result?Physicalism
@TomaszBartkowiak Interestingly enough, the learning rate does not have any influence at all. The number of epochs stays exactly the same. This is actually one more thing I would like to understand...Yvetteyvon
Just a note, although it does not solve your question: do some resampling of your negative minority examples and you get a less unreasonable number of epochs (12K). I'd say anyway it is a numerical problem and not much to scratch from it. Good luck.Soper
@Soper Can you detail what you mean by a "numerical problem" and point me to further reading about the phenomenon? This could be the reply.Yvetteyvon
Ok I'll try to post something closed to an answer, hope I'll find some time in a day or twoSoper
S
1

The problem you're facing could be summarized in a simple statement: the numbers of your example do not favor convergence or your perceptron.

Honestly I'm not sure what exactly can be learned from your synthetic example; anyway, please don't take me wrong, it is always so good to play around in the laboratory and learn from it. There's a number of recommendations that are generic when fitting neural nets, and some of them are reflected in comments to your question. This paper is old but good and you'll see it referenced around.

About your problem in particular: it is not really a matter of standarizing but centering. The problem is that when you re-evaluate your weights

self.W = self.W + self.lr * e * x

your error term e will be either +1 or -1 depending on the example that you mis-classify (e.g. +1 if the example target is 1 and it is classified as 0), but mostly +1 since there are more positive classes, and your coordinates in x and mostly positive values. So, most of the times, you will be adding up to your weights, not subtracting, and this way it is obviously quite slow for the perceptron to find a solution.

If you just scale your X

X = scale(X, with_mean=True, with_std=False)

convergence takes 1461 epochs only.

The classifier looks like this

enter image description here

and it makes sense that the boundary is very closed to the positive classes, since there are many of them; as soon as the perceptron gets all the positive classes right, the job is nearly done.

Additionally, if you rebalance your data -I've done it in this lazy way as a test

trainingSet = np.array([[279.25746446, 162.44072328,   1.        ],
                        [306.23240054, 128.3794866 ,   1.        ],
                        [216.67811217, 148.58167262,   1.        ],
                        [223.64431813, 197.75745016,   1.        ],
                        [486.68209275,  96.09115377,   1.        ],
                        [400.71323154, 125.18183395,   1.        ],
                        [288.87299305, 204.52217766,   1.        ],
                        [245.1492875 ,  55.75847006,  -1.        ],
                        [245.1492875 ,  55.75847006,  -1.        ],
                        [245.1492875 ,  55.75847006,  -1.        ],
                        [245.1492875 ,  55.75847006,  -1.        ],
                        [ 14.95991122, 185.92681911,   1.        ],
                        [393.92908798, 193.40527965,   1.        ],
                        [494.15988362, 179.23456285,   1.        ],
                        [235.59039363, 175.50868526,   1.        ],
                        [423.72071607,   9.50166894,  -1.        ],
                        [423.72071607,   9.50166894,  -1.        ],
                        [423.72071607,   9.50166894,  -1.        ],
                        [423.72071607,   9.50166894,  -1.        ],
                        [423.72071607,   9.50166894,  -1.        ],
                        [ 76.52735621, 208.33663341,   1.        ],
                        [495.1492875 ,  -7.73818431,  -1.        ],
                        [495.1492875 ,  -7.73818431,  -1.        ],
                        [495.1492875 ,  -7.73818431,  -1.        ],
                        [495.1492875 ,  -7.73818431,  -1.        ]])

it takes 2 epochs (surprisingly) to get this classifier

enter image description here

Hope it helps.


EDIT after comments

(1) About errors that are adding up or subtracting only

Let's take an example of the positive class

[279.25746446, 162.44072328,   1.        ]

For these, since d is equal to 0, e can only be 0 if the classifier gets it right and -1 if it gets it wrong.

e = d[i] - self.predict(x)

(predict returns either 0 or 1)

When adding up to the weight, it adds nothing if the classifier gets it right, and -1 * x * learning rate if wrong. For this example, assuming lr == 1, it will subtract exactly (1, 279.25746446, 162.44072328) if there is an error in this positive example.

Now, take a look to all the positive examples. If you don't transform the X, all coordinates have positive values, thus all the classification errors will subtract to the weights.

Now let's take a negative example:

[245.1492875 ,  55.75847006,  -1.        ]

For these, since d is equal to 1, e can only be 0 if the classifier gets it right and +1 if it gets it wrong. Again, all coordinates are positive, except for one coordinate in the 3rd negative example. Thus nearly all mistake for the negative class will be adding.

But there are only 3 examples of the negative class, and 12 of the positive class. Thus the errors will be mostly subtracting and not adding to the weights. (Sorry I've put it the other way around in my text before the edit). It's reasonable then to think that convergence will be slow if you do nothing, faster if you center the data. (One could even wonder how it converges.)

(2) About resampling

I meant to say that convergence with resampling (and centering) is surprisingly fast, 2 epochs. However it is reasonable that resampling makes convergence faster, since there is more balance between errors pulling the output to one direction or to the other.

Hope it is more clear now.


EDIT after more comments

I understand that maybe the importance of balance between samples and how they are pulling the solution is not really intuitive. Actually, the way I faced your question was probably the opposite: by looking at your loss function, and thinking about what the problem could be, and similar problems I faced in the past and intuitions I had, I thought about rebanlancing - then tried to relabalance and after to center the data and confirmed my intuitions about your loss function. Only afterwards I tried to build an explanation for you.

Of course, it is not that I process the loss function in my mind and known what it is doing. Anyway I would suggest that you build your own intuitions, since your target is learning, and you could do it this way: plot how the separation line moves epoch after epoch.

From your code:

labels = [1, 0]
labelColors = ['blue', 'green']

def showData(X, y, plt = plt): 
    colors = [(labelColors[0] if el == labels[0] else labelColors[1]) for el in y] 
    plt.scatter(X[:,0],X[:,1],c=colors)

def plotW(xs, w):
    plt.plot(xs, (w[0] + w[1] * xs)/-w[2], color = 'red', linewidth=4)

import numpy as np
from matplotlib import pyplot as plt
from sklearn.preprocessing import scale

class Perceptron(object):
    """Implements a perceptron network"""
    def __init__(self, input_size, lr=0.1, epochs=1000000):
        self.W = np.zeros(input_size+1)
        #self.W = np.random.randn(input_size+1)
        # add one for bias
        self.epochs = epochs
        self.lr = lr

    def predict(self, x):
        z = self.W.T.dot(x)
        return [1 if self.W.T.dot(x) >=0 else 0]

    def fit(self, X, d):
        errors = []
        for epoch in range(self.epochs):
            if (epoch + 1) % 10000 == 0: print('Epoch',epoch + 1)
            total_error = 0
            for i in range(d.shape[0]):
                x = np.insert(X[i], 0, 1)
                y = self.predict(x)
                e = d[i] - y
                total_error += np.abs(e)
                self.W = self.W + self.lr * e * x
                #print('W: ', self.W)
            errors += [total_error]
            showData(X, d)
            plotW(X[:,0], self.W)
            plt.show()
            if epoch == 100:
                break
            if (total_error == 0):
                print('Done after', epoch, 'epochs')
                nPlot = 100
                plt.plot(list(range(len(errors)-nPlot, len(errors))), errors[-nPlot:])
                plt.show()
                break

if __name__ == '__main__':
    trainingSet = np.array([[279.25746446, 162.44072328,   1.        ],
                            [306.23240054, 128.3794866 ,   1.        ],
                            [216.67811217, 148.58167262,   1.        ],
                            [223.64431813, 197.75745016,   1.        ],
                            [486.68209275,  96.09115377,   1.        ],
                            [400.71323154, 125.18183395,   1.        ],
                            [288.87299305, 204.52217766,   1.        ],
                            [245.1492875 ,  55.75847006,  -1.        ],
                            [ 14.95991122, 185.92681911,   1.        ],
                            [393.92908798, 193.40527965,   1.        ],
                            [494.15988362, 179.23456285,   1.        ],
                            [235.59039363, 175.50868526,   1.        ],
                            [423.72071607,   9.50166894,  -1.        ],
                            [ 76.52735621, 208.33663341,   1.        ],
                            [495.1492875 ,  -7.73818431,  -1.        ]])
    X = trainingSet[:, :2]
    X = scale(X, with_mean=True, with_std=False)
    d = trainingSet[:, -1]
    d = np.where(d == -1, 1, 0)
    perceptron = Perceptron(input_size=2)
    perceptron.fit(X, d)
    print(perceptron.W)

And compare the evolution of the line in the different setups. If you compare the first 100 epochs when centering versus not centering, you will see that when you do not center the data, the line tends to bump in a sort of a loop, while when centering the line moves more smoothly. (That's actually the same kind of effect you usually get when slowing down the learning rate, as some people suggested in comments.)

I don't mean to say that looking at those plots is analytical evidence for the behavior of your loss function. I don't even pretend that this a real answer to your question. But anyway, if it helps you build an intuition, then it will be worth it.

There's loads of work about convergence, which has been applied extensively in Deep Learning since it is a key issue, as you probably know. Sure you've heard about the different optimizers and how they affect convergence of a loss function that, in Deep Learning or in complex neural nets in general, is certainly difficult to understand and impossible to tackle analytically.

Soper answered 19/12, 2019 at 9:24 Comment(11)
No, rebalancing is a good idea but 2 epochs is maybe surprisingly lowSoper
i.e. you would usually not get such a dramatic improvement when rebalancingSoper
Why are you saying that we will always be adding? When the line moves too much in the positive direction, the errors will become negative, won't they?Yvetteyvon
Also, what does rebalancing do? (a link to simple explanation would be much appreciated)Yvetteyvon
Ok. Run out of time for now. I'll try to clarify these later today.Soper
Thank you for the edit. Your writing is very clear, but I don't understand how the imbalance provides an explanation. Whenever we subtract from the weights based on a positive example, the separator line defined by the weights moves so as to allow that positive example to be on the positive side of the line. After a number of such "subtracting epochs", the positive examples will be correctly classified and thus contribute 0, while the negative examples will be on the positive side and so will begin playing their role no matter how few negative examples we have compared to the positive ones.Yvetteyvon
The point is that, regardless of where the line is, the errors from positive examples subtract to the weights... I'll try to post some explanation tomorrowSoper
Isn't it correct that only samples that are currently misclassified affect the weights?Yvetteyvon
In other words, you are saying that you are giving up on translating your intuition into words. As for me, I don't have an intuition why Perceptron converges at all in the first place, because an update takes into account only one example and can be bad for other examples. I was not able to find any explanation for that. In any case, I might try just what you said. I thank you for the attempt.Yvetteyvon
Well, I would not say giving up, but ok. I don't think there is so much that you can gain from this example in particular. My intuitions come from quite a few examples that I faced, in which the loss function was quite obscure by the way. Thank you for the effort you've put in your question.Soper
And about why it converges, maybe the famous course by Andrew Ng coursera.org/learn/machine-learning? If you haven't done it yet, it's very interesting, and some parts may help you start building those intuitions.Soper
S
1

When I could not answer properly your question one month ago I kind of regreted it; now I give it another try. I leave the older answer for the record.

I think the problem relates to convexity and local minima of the loss function, which makes it difficult to converge. However, with your problem as you did set it up, I'm not really sure about the derivative of your loss function, so I've modified your activation function to a sigmoid, so I can apply the log loss easily.

This is the new predict,

def predict(self, x):
    z = self.W.T.dot(x)
    return 1/(1+np.exp(-z))

And this is the loop for the training data, also calculating the loss.

 loss = 0
 dw = 0
 for i in range(d.shape[0]):
     x = np.insert(X[i], 0, 1)
     y = self.predict(x)
     e = d[i] - (1 if y > 0.5 else 0)
     total_error += np.abs(e)
     dw += self.lr * e * x
     loss2add = (-1) * (np.log(y) if d[i] else np.log(1-y))
     if np.isinf(loss2add) or np.isnan(loss2add):
         loss += 500
     else:
         loss += loss2add
 self.W = self.W + dw
 errors += [total_error]
 losses += [loss/d.shape[0]]

It converges in 103K epochs, so I hope you believe this behaves similarly to your initial setup.

Then I plot the cost function related to W. To make it simple, I take 2 values of a known solution, and only change the remaining 1 value. This is the code (could be cleaner I know):

def predict(W, x):
    z = W.dot(x)
    return 1/(1+np.exp(-z))

trainingSet = np.array([[279.25746446, 162.44072328,   1.        ],
                        [306.23240054, 128.3794866 ,   1.        ],
                        [216.67811217, 148.58167262,   1.        ],
                        [223.64431813, 197.75745016,   1.        ],
                        [486.68209275,  96.09115377,   1.        ],
                        [400.71323154, 125.18183395,   1.        ],
                        [288.87299305, 204.52217766,   1.        ],
                        [245.1492875 ,  55.75847006,  -1.        ],
                        [ 14.95991122, 185.92681911,   1.        ],
                        [393.92908798, 193.40527965,   1.        ],
                        [494.15988362, 179.23456285,   1.        ],
                        [235.59039363, 175.50868526,   1.        ],
                        [423.72071607,   9.50166894,  -1.        ],
                        [ 76.52735621, 208.33663341,   1.        ],
                        [495.1492875 ,  -7.73818431,  -1.        ]])
X = trainingSet[:, :2]
d = trainingSet[:, -1]
d = np.where(d == -1, 1, 0)
losses = []
ws = []
n_points = 10001
for w1 in np.linspace(-40, 40, n_points):
    ws += [w1]
    W = np.array([3629., w1, -238.21109877])
    loss = 0
    for i in range(d.shape[0]):
        x = np.insert(X[i], 0, 1)
        y = predict(W, x)
        loss2add = (-1) * (np.log(y) if d[i] else np.log(1-y))
        if np.isinf(loss2add) or np.isnan(loss2add):
            loss += 500
        else:
            loss += loss2add
    losses += [loss]
plt.plot(ws, losses)
plt.show()

The solution for w1 is 39.48202635. Take a look at the loss:

enter image description here

which has some peaks and thus some local minima in which it can get easily stuck.

However if you center the data with

X = scale(X, with_mean=True, with_std=False)

and set the w's to

W = np.array([-550.3, w1, -59.65467824])

you get the following loss function

enter image description here

which has the minimum at the expected area (the solution for w1 is -11.00208344).

I'd expect a smoother function for the balanced dataset.

Hope it is clearer now!


EDIT after comments

This is the loss function when standarizing -converges in 26 epochs.

enter image description here

(Not centering in this case!)

Solution about 0.7, and the loss is even smoother. It makes sense that standarizing works so well with logistic regression, since it does not saturate the output of the activation function.

For the rest, I don't have anything to add on how to fit these with the theory you mention. I guess the theorem fixes an upper bound, but anyhow no idea. Cheers.

Soper answered 15/1, 2020 at 20:39 Comment(3)
Thank you for this reply. Indeed this looks like a fruitful direction. Two things remain to be understood: (1) why does normalization remove the peaks? and (2) How does all this fit the theory which predicts the number of iterations for convergence and does not condition it on normalization of features?Yvetteyvon
Longer answer tonight. Short answer (1): in this case, centering the data is better than normalizing. But anyway normalization would help since it will keep the sigmoid in the central central area, rather than in the extremes, and the loss will be smoother. (2) Not familiar with such a theory I'm sorry; clearly in practice the values of the features affect to the loss function, that's why normalization, centering and balancing data usually help, and people recommend it. But I cannot provide a theory for that. I can only try to prove it with an example.Soper
The theory is cited in the question. I will appreciate the longer answer. Thank you!Yvetteyvon

© 2022 - 2024 — McMap. All rights reserved.