Trying to find object coordinates (x,y) in image, my neural network seems to optimize error without learning [closed]

I generate images of a single coin pasted over a white background of size 200x200. The coin is randomly chosen among 8 euro coin images (one for each coin) and has :

random rotation ;
random size (bewteen fixed bounds) ;
random position (so that the coin is not cropped).

Here are two examples (center markers added): Two dataset examples

I am using Python + Lasagne. I feed the color image into the neural network that has an output layer of 2 linear neurons fully connected, one for x and one for y. The targets associated to the generated coin images are the coordinates (x,y) of the coin center.

I have tried (from Using convolutional neural nets to detect facial keypoints tutorial):

Dense layer architecture with various number of layers and number of units (500 max) ;
Convolution architecture (with 2 dense layers before output) ;
Sum or mean of squared difference (MSE) as loss function ;
Target coordinates in the original range [0,199] or normalized [0,1] ;
Dropout layers between layers, with dropout probability of 0.2.

I always used simple SGD, tuning the learning rate trying to have a nice decreasing error curve.

I found that as I train the network, the error decreases until a point where the output is always the center of the image. It looks like the output is independent of the input. It seems that the network output is the average of the targets I give. This behavior looks like a simple minimization of the error since the positions of the coins are uniformly distributed on the image. This is not the wanted behavior.

I have the feeling that the network is not learning but is just trying to optimize the output coordinates to minimize the mean error against the targets. Am I right? How can I prevent this? I tried to remove the bias of the output neurons because I thought maybe I'm just modifying the bias and all others parameters are being set to zero but this didn't work.

Is it possible for a neural network alone to perform well at this task? I have read that one can also train a net for present/not present binary classification and then scan the image to find possible locations of objects. But I just wondered if it was possible just using the forward computation of a neural net.

Question : How can I prevent this [overfitting without improvement to test scores]?

What needs to be done is to re-architect your neural net. A neural net just isn't going to do a good job at predicting an X and Y coordinate. It can through create a heat map of where it detects a coin, or said another way, you could have it turn your color picture into a "coin-here" probability map.

Why? Neurons have a good ability to be used to measure probability, not coordinates. Neural nets are not the magic machines they are sold to be but instead really do follow the program laid out by their architecture. You'd have to lay out a pretty fancy architecture to have the neural net first create an internal space representation of where the coins are, then another internal representation of their center of mass, then another to use the center of mass and the original image size to somehow learn to scale the X coordinate, then repeat the whole thing for Y.

Easier, much easier, is to create a coin detector Convolution that converts your color image to a black and white image of probability-a-coin-is-here matrix. Then use that output for your custom hand written code that turns that probability matrix into an X/Y coordinate.

Question : How can I prevent this [overfitting without improvement to test scores]?

Question : Is it possible for a neural network alone to perform well at this task?

Recommended topics

Hot tags