Deep Belief Networks vs Convolutional Neural Networks

Asked 3/7, 2014 at 5:38 Answered 8/1, 2015 at 0:44

Solved machine-learning computer-vision neural-network dbn autoencoder

I am new to the field of neural networks and I would like to know the difference between Deep Belief Networks and Convolutional Networks. Also, is there a Deep Convolutional Network which is the combination of Deep Belief and Convolutional Neural Nets?

This is what I have gathered till now. Please correct me if I am wrong.

For an image classification problem, Deep Belief networks have many layers, each of which is trained using a greedy layer-wise strategy. For example, if my image size is 50 x 50, and I want a Deep Network with 4 layers namely

Input Layer
Hidden Layer 1 (HL1)
Hidden Layer 2 (HL2)
Output Layer

My input layer will have 50 x 50 = 2500 neurons, HL1 = 1000 neurons (say) , HL2 = 100 neurons (say) and output layer = 10 neurons, in order to train the weights (W1) between Input Layer and HL1, I use an AutoEncoder (2500 - 1000 - 2500) and learn W1 of size 2500 x 1000 (This is unsupervised learning). Then I feed forward all images through the first hidden layers to obtain a set of features and then use another autoencoder ( 1000 - 100 - 1000) to get the next set of features and finally use a softmax layer (100 - 10) for classification. (only learning the weights of the last layer (HL2 - Output which is the softmax layer) is supervised learning).

(I could use RBM instead of autoencoder).

If the same problem was solved using Convolutional Neural Networks, then for 50x50 input images, I would develop a network using only 7 x 7 patches (say). My layers would be

Input Layer (7 x 7 = 49 neurons)
HL1 (25 neurons for 25 different features) - (convolution layer)
Pooling Layer
Output Layer (Softmax)

And for learning the weights, I take 7 x 7 patches from images of size 50 x 50, and feed forward through convolutional layer, so I will have 25 different feature maps each of size (50 - 7 + 1) x (50 - 7 + 1) = 44 x 44.

I then use a window of say 11x11 for pooling hand hence get 25 feature maps of size (4 x 4) for as the output of the pooling layer. I use these feature maps for classification.

While learning the weights, I don't use the layer wise strategy as in Deep Belief Networks (Unsupervised Learning), but instead use supervised learning and learn the weights of all the layers simultaneously. Is this correct or is there any other way to learn the weights?

Is what I have understood correct?

So if I want to use DBN's for image classification, I should resize all my images to a particular size (say 200x200) and have that many neurons in the input layer, whereas in case of CNN's, I train only on a smaller patch of the input (say 10 x 10 for an image of size 200x200) and convolve the learnt weights over the entire image?

Do DBNs provide better results than CNNs or is it purely dependent on the dataset?

Thank You.

Jinnyjinrikisha answered 3/7, 2014 at 5:38 Comment(1)

you can also ask in dsp.stackexchange. Might get a better answer. – Pokpoke 3/7, 2014 at 7:38

Generally speaking, DBNs are generative neural networks that stack Restricted Boltzmann Machines (RBMs) . You can think of RBMs as being generative autoencoders; if you want a deep belief net you should be stacking RBMs and not plain autoencoders as Hinton and his student Yeh proved that stacking RBMs results in sigmoid belief nets.

Convolutional neural networks have performed better than DBNs by themselves in current literature on benchmark computer vision datasets such as MNIST. If the dataset is not a computer vision one, then DBNs can most definitely perform better. In theory, DBNs should be the best models but it is very hard to estimate joint probabilities accurately at the moment. You may be interested in Lee et. al's (2009) work on Convolutional Deep Belief Networks which looks to combine the two.

Leinster answered 5/7, 2014 at 20:37 Comment(5)

I have a catalog of images with shoes, shirts watches etc, and I want to my classification to be as accurate as being able to say that a given image(taken from a camera) is a watch with a round dial or sports shoes or a woman's heels. These images are much larger(400×400) than 30×30 images which most of the neural nets algorithms have been tested (mnist ,stl). So I am guessing a deep belief network is not going to scale (too many parameters to compute) and hence I should use a convolutional deep belief network? – Jinnyjinrikisha 6/7, 2014 at 6:51

@Jinnyjinrikisha You can just rescale your 400 x 400 image to a smaller size (e.g. 50 x 50) - that will greatly reduce the number of parameters and shouldn't affect performance. And yeah, you can try out Conv. DBN's; there a lot of cool new variants of ConvNets (e.g. ConvNets w/ Maxout, see Goodfellow et. al paper) which you can also try out. Lots of new inventions in deep learning continuously happening in general.. so lots of things to try. – Leinster 8/7, 2014 at 23:18

I will try resizing them to different sizes and check the performance and I will also look into convolutional DBNs. I want to compare this method with the traditional CNN approach. Is there any way to decide on the filter sizes, number of filters and number of layers in the CNN? Thanks – Jinnyjinrikisha 9/7, 2014 at 10:43

No, not really. Most automated approach I can think of is Bayesian hyperparameter optimization. See: github.com/JasperSnoek/spearmint – Leinster 10/7, 2014 at 3:35

I would use a CNN. It has worked well for image recognition as others have also repeatedly proved. It is also computationally more efficient atm. – Argentic 2/1, 2017 at 3:3

I will try to explain the situation through learning shoes.

If you use DBN to learn those images here is the bad thing that will happen in your learning algorithm

there will be shoes on different places.
all the neurons will try to learn not only shoes but also the place of the shoes in the images because it will not have the concept of 'local image patch' inside weights.
DBN makes sense if all your images are aligned by means of size, translation and rotation.

the idea of convolutional networks is that, there is a concept called weight sharing. If I try to extend this 'weight sharing' concept

first you looked at 7x7 patches, and according to your example - as an example of 3 of your neurons in the first layer you can say that they learned shoes 'front', 'back-bottom' and 'back-upper' parts as these would look alike for a 7x7 patch through all shoes.
- Normally the idea is to have multiple convolution layers one after another to learn
  - lines/edges in the first layer,
  - arcs, corners in the second layer,
  - higher concepts in higher layers like shoes front, eye in a face, wheel in a car or rectangles cones triangles as primitive but yet combinations of previous layers outputs.
- You can think of these 3 different things I told you as 3 different neurons. And such areas/neurons in your images will fire when there are shoes in some part of the image.
- Pooling will protect your higher activations while sub-sampling your images and creating a lower-dimensional space to make things computationally easier and feasible.
- So at last layer when you look at your 25X4x4, in other words 400 dimensional vector, if there is a shoe somewhere in the picture your 'shoe neuron(s)' will be active whereas non-shoe neurons will be close to zero.
- And to understand which neurons are for shoes and which ones are not you will put that 400 dimensional vector to another supervised classifier(this can be anything like multi-class-SVM or as you said a soft-max-layer)

I can advise you to have a glance at Fukushima 1980 paper to understand what I try to say about translation invariance and line -> arc -> semicircle -> shoe front -> shoe idea (http://www.cs.princeton.edu/courses/archive/spr08/cos598B/Readings/Fukushima1980.pdf). Even just looking at the images in the paper will give you some idea.

Dmso answered 8/1, 2015 at 0:44 Comment(1)

Well, this is true for the naive RBM but there has been significant developments which this answer did not mention. Lee et al. (cs.toronto.edu/~rgrosse/icml09-cdbn.pdf) introduced probabilistic max-pooling as well as convolutional DBN. The strengths of CNN that you mentioned can easily be adopted to DBN and Prof. Lee managed to get the at-the-time state of the art performance. Recently (CVPR15), Prof. Xiao at Princeton applied the convolutional RBN to 3D shape classification and reconstruction :) – Ascogonium 1/7, 2015 at 2:23

Recommended topics

Hot tags