Viola-Jones' face detection claims 180k features

Asked 10/11, 2009 at 12:30 Answered 23/6, 2020 at 8:55

Solved algorithm image-processing computer-vision face-detection viola-jones

I've been implementing an adaptation of Viola-Jones' face detection algorithm. The technique relies upon placing a subframe of 24x24 pixels within an image, and subsequently placing rectangular features inside it in every position with every size possible.

These features can consist of two, three or four rectangles. The following example is presented.

Rectangle features

They claim the exhaustive set is more than 180k (section 2):

Given that the base resolution of the detector is 24x24, the exhaustive set of rectangle features is quite large, over 180,000 . Note that unlike the Haar basis, the set of rectangle features is overcomplete.

The following statements are not explicitly stated in the paper, so they are assumptions on my part:

There are only 2 two-rectangle features, 2 three-rectangle features and 1 four-rectangle feature. The logic behind this is that we are observing the difference between the highlighted rectangles, not explicitly the color or luminance or anything of that sort.
We cannot define feature type A as a 1x1 pixel block; it must at least be at least 1x2 pixels. Also, type D must be at least 2x2 pixels, and this rule holds accordingly to the other features.
We cannot define feature type A as a 1x3 pixel block as the middle pixel cannot be partitioned, and subtracting it from itself is identical to a 1x2 pixel block; this feature type is only defined for even widths. Also, the width of feature type C must be divisible by 3, and this rule holds accordingly to the other features.
We cannot define a feature with a width and/or height of 0. Therefore, we iterate x and y to 24 minus the size of the feature.

Based upon these assumptions, I've counted the exhaustive set:

const int frameSize = 24;
const int features = 5;
// All five feature types:
const int feature[features][2] = {{2,1}, {1,2}, {3,1}, {1,3}, {2,2}};

int count = 0;
// Each feature:
for (int i = 0; i < features; i++) {
    int sizeX = feature[i][0];
    int sizeY = feature[i][1];
    // Each position:
    for (int x = 0; x <= frameSize-sizeX; x++) {
        for (int y = 0; y <= frameSize-sizeY; y++) {
            // Each size fitting within the frameSize:
            for (int width = sizeX; width <= frameSize-x; width+=sizeX) {
                for (int height = sizeY; height <= frameSize-y; height+=sizeY) {
                    count++;
                }
            }
        }
    }
}

The result is 162,336.

The only way I found to approximate the "over 180,000" Viola & Jones speak of, is dropping assumption #4 and by introducing bugs in the code. This involves changing four lines respectively to:

for (int width = 0; width < frameSize-x; width+=sizeX)
for (int height = 0; height < frameSize-y; height+=sizeY)

The result is then 180,625. (Note that this will effectively prevent the features from ever touching the right and/or bottom of the subframe.)

Now of course the question: have they made a mistake in their implementation? Does it make any sense to consider features with a surface of zero? Or am I seeing it the wrong way?

Sarabia answered 10/11, 2009 at 12:30 Comment(10)

Why do I get count=114829 when I run your code? – Allerie 10/11, 2009 at 13:13

Why do your x/y loops start at 1? I assume x/y is the top left coordinate of the feature rectangle. Shouldn't x/y start at 0/0 then? – Allerie 10/11, 2009 at 13:31

Aside from whether it starts at 0 or 1, ending at x < size has to do with assumption #4: I want the feature to remain within the subframe, but have a dimension of at least 1x1. As to whether the dimension of the feature should not extend outside of the subframe, well, perhaps that is an assumption, too. – Sarabia 10/11, 2009 at 13:37

Similarly, if I started x at 0, it would have to run to x < size - 1, so there is no gain. – Sarabia 10/11, 2009 at 13:39

I've done a zillion for loops. this seems wrong to me. <size would keep x from ever becoming 24, starting at 0 will give you 0...23, With a dimension of 1 pixel wide, the rectangle will never leave the frame. – Shallow 10/11, 2009 at 13:43

Well, if x/y start at 0, I get 162336. If I drop assumption #4 and let width/height start at 0, I get 212256. I wonder how they got 180k... – Allerie 10/11, 2009 at 13:48

Haha, brilliant, now we're all at the same stage! Good point, Breton, the size can never reach 24 the way it is. I'll take another look. – Sarabia 10/11, 2009 at 14:10

Ayep, I can't get it to go anywhere near 180,000 either. here's my lousy attempt. jsbin.com/imase/edit (press output tab for result, javascript tab for the source code) – Shallow 10/11, 2009 at 14:49

@Breton: I can code to demonstrate the 180k+ by dropping assumption #4 and by introducing the bugs we discussed in the code: jsbin.com/ibahe/edit – Sarabia 10/11, 2009 at 15:35

thanks for bringing up this paper, I'd never heard of it before. It's really neat. – Chambertin 10/11, 2009 at 15:54

Upon closer look, your code looks correct to me; which makes one wonder whether the original authors had an off-by-one bug. I guess someone ought to look at how OpenCV implements it!

Nonetheless, one suggestion to make it easier to understand is to flip the order of the for loops by going over all sizes first, then looping over the possible locations given the size:

#include <stdio.h>
int main()
{
    int i, x, y, sizeX, sizeY, width, height, count, c;

    /* All five shape types */
    const int features = 5;
    const int feature[][2] = {{2,1}, {1,2}, {3,1}, {1,3}, {2,2}};
    const int frameSize = 24;

    count = 0;
    /* Each shape */
    for (i = 0; i < features; i++) {
        sizeX = feature[i][0];
        sizeY = feature[i][1];
        printf("%dx%d shapes:\n", sizeX, sizeY);

        /* each size (multiples of basic shapes) */
        for (width = sizeX; width <= frameSize; width+=sizeX) {
            for (height = sizeY; height <= frameSize; height+=sizeY) {
                printf("\tsize: %dx%d => ", width, height);
                c=count;

                /* each possible position given size */
                for (x = 0; x <= frameSize-width; x++) {
                    for (y = 0; y <= frameSize-height; y++) {
                        count++;
                    }
                }
                printf("count: %d\n", count-c);
            }
        }
    }
    printf("%d\n", count);

    return 0;
}

with the same results as the previous 162336

To verify it, I tested the case of a 4x4 window and manually checked all cases (easy to count since 1x2/2x1 and 1x3/3x1 shapes are the same only 90 degrees rotated):

2x1 shapes:
        size: 2x1 => count: 12
        size: 2x2 => count: 9
        size: 2x3 => count: 6
        size: 2x4 => count: 3
        size: 4x1 => count: 4
        size: 4x2 => count: 3
        size: 4x3 => count: 2
        size: 4x4 => count: 1
1x2 shapes:
        size: 1x2 => count: 12             +-----------------------+
        size: 1x4 => count: 4              |     |     |     |     |
        size: 2x2 => count: 9              |     |     |     |     |
        size: 2x4 => count: 3              +-----+-----+-----+-----+
        size: 3x2 => count: 6              |     |     |     |     |
        size: 3x4 => count: 2              |     |     |     |     |
        size: 4x2 => count: 3              +-----+-----+-----+-----+
        size: 4x4 => count: 1              |     |     |     |     |
3x1 shapes:                                |     |     |     |     |
        size: 3x1 => count: 8              +-----+-----+-----+-----+
        size: 3x2 => count: 6              |     |     |     |     |
        size: 3x3 => count: 4              |     |     |     |     |
        size: 3x4 => count: 2              +-----------------------+
1x3 shapes:
        size: 1x3 => count: 8                  Total Count = 136
        size: 2x3 => count: 6
        size: 3x3 => count: 4
        size: 4x3 => count: 2
2x2 shapes:
        size: 2x2 => count: 9
        size: 2x4 => count: 3
        size: 4x2 => count: 3
        size: 4x4 => count: 1

Noblewoman answered 10/11, 2009 at 21:2 Comment(5)

Convincing. So convincing that I'm fairly sure that we're right. I've sent an e-mail to the author to see if I've made some fundamental mistake in my reasoning. We'll see if a guy that busy has time to respond. – Sarabia 10/11, 2009 at 22:7

keep in mind this thing has been out for a couple of years now, and many improvements were made since then – Noblewoman 10/11, 2009 at 22:32

The original paper where the 180k was stated comes from the proceedings for the 2001 Conference on Computer Vision and Pattern Recognition. A revised paper, accepted in 2003 and published in the International Journal of Computer Vision in 2004, states on p. 139 (end of section 2): "the exhaustive set of rectangles is quite large, 160,000". Looks like we were right! – Sarabia 17/11, 2009 at 11:16

Great, thanks for the update. For those interested, I found a link to the IJCV'04 paper: lear.inrialpes.fr/people/triggs/student/vj/viola-ijcv04.pdf – Noblewoman 17/11, 2009 at 16:53

Yes, that's it. 160k, not 180k. – Sarabia 20/11, 2009 at 14:56

all. There is still some confusion in Viola and Jones' papers.

In their CVPR'01 paper it is clearly stated that

"More specifically, we use three kinds of features. The value of a two-rectangle feature is the difference between the sum of the pixels within two rectangular regions. The regions have the same size and shape and are horizontally or vertically adjacent (see Figure 1). A three-rectangle feature computes the sum within two outside rectangles subtracted from the sum in a center rectangle. Finally a four-rectangle feature".

In the IJCV'04 paper, exactly the same thing is said. So altogether, 4 features. But strangely enough, they stated this time that the the exhaustive feature set is 45396! That does not seem to be the final version.Here I guess that some additional constraints were introduced there, such as min_width, min_height, width/height ratio, and even position.

Note that both papers are downloadable on his webpage.

Scolecite answered 21/7, 2010 at 12:42 Comment(0)

Having not read the whole paper, the wording of your quote sticks out at me

Given that the base resolution of the detector is 24x24, the exhaustive set of rectangle features is quite large, over 180,000 . Note that unlike the Haar basis, the set of rectangle features is overcomplete.

"The set of rectangle features is overcomplete" "Exhaustive set"

it sounds to me like a set up, where I expect the paper writer to follow up with an explaination for how they cull the search space down to a more effective set, by, for example, getting rid of trivial cases such as rectangles with zero surface area.

edit: or using some kind of machine learning algorithm, as the abstract hints at. Exhaustive set implies all possibilities, not just "reasonable" ones.

Shallow answered 10/11, 2009 at 12:50 Comment(8)

I should include the footnote after "overcomplete": "A complete basis has no linear dependence between basis elements and has the same number of elements as the image space, in this case 576. The full set of 180,000 thousand features is many times over-complete." They do not explicitly get rid of classifiers with no surface, they use AdaBoost to determine that "a very small number of these features can be combined to form an effective classifier". Ok, so the zero-surface features will be dropped immediately, but why consider them in the first place? – Sarabia 10/11, 2009 at 12:57

Well it sounds like the reasoning of someone really into set theory. – Shallow 10/11, 2009 at 12:59

I agree, the exhaustive set would imply all possibilities. But consider that if you take 1 to 24 for x and width <= x, the feature will extend 1 pixel outside of the subframe! – Sarabia 10/11, 2009 at 13:0

Are you sure your code isn't riddled with "off by one" bugs? I just had a closer look, and you sure do have a funny way of writing a for loop. – Shallow 10/11, 2009 at 13:3

I should qualify that- I just thought it over a bit, and if you have a rectangle that is 1 pixel tall, 2 pixels tall, 3 pixels tall, all the way to 24 pixels tall, you have 24 kinds of rectangle, all of which fit into a 24 pixel high subframe. What overhangs? – Shallow 10/11, 2009 at 13:16

You're right; the for-loops were sloppy. I had confused the dimensions with the location of the feature. I've edited it in the OP. You're also right about the overhang: there is none. The only way I can replicate 180k+ is by setting the for-loops for the width and height to begin at 0. – Sarabia 10/11, 2009 at 13:23

So, in summary, it appears that Viola & Jones considered their overcomplete set of rectangle features to include those with zero surface. Does this sound logical to you? – Sarabia 10/11, 2009 at 13:25

Well mostly, except that asa nikie points out above, you start your x/y coordinates at 1 instead of 0, which may account for your discrepency. – Shallow 10/11, 2009 at 13:35

There is no guarantee that any author of any paper is correct in all their assumptions and findings. If you think that assumption #4 is valid, then keep that assumption, and try out your theory. You may be more successful than the original authors.

Wastepaper answered 10/11, 2009 at 13:0 Comment(5)

Experimentation shows that it performs seemingly precisely the same. I believe AdaBoost simply drops those additional zero-surface features in the first cycle, but I haven't actually looked into this. – Sarabia 10/11, 2009 at 13:3

Viola and Jones are very big names in computer vision. In fact, this particular paper is considered seminal. Everyone makes mistakes, but this particular algorithm has been proven to work very well. – Estremadura 10/11, 2009 at 15:29

Definitely, and I don't doubt their method at all. It's efficient and works very well! The theory is sound, but I believe they might have mistakenly cropped their detector one pixel short and included needless zero-surface features. If not, I challenge you to demonstrate the 180k features! – Sarabia 10/11, 2009 at 15:48

The fact is that everyone is human. Everyone makes mistakes. When a big name makes mistakes, they often lay hidden for generations because people are afraid to question the received wisdom. But true science, follows the scientific method and does not worship anybody, no matter how big their name is. If it is science, then mere mortals can put in the effort, understand how it works and adapt it to their circumstances. – Wastepaper 10/11, 2009 at 16:18

We'll see; I've sent an e-mail to the author. – Sarabia 10/11, 2009 at 16:43

Quite good observation, but they might implicitly zero-pad the 24x24 frame, or "overflow" and start using first pixels when it gets out of bounds, as in rotational shifts, or as Breton said they might consider some features as "trivial features" and then discard them with the AdaBoost.

In addition, I wrote Python and Matlab versions of your code so I can test the code myself (easier to debug and follow for me) and so I post them here if anyone find them useful sometime.

Python:

frameSize = 24;
features = 5;
# All five feature types:
feature = [[2,1], [1,2], [3,1], [1,3], [2,2]]

count = 0;
# Each feature:
for i in range(features):
    sizeX = feature[i][0]
    sizeY = feature[i][1]
    # Each position:
    for x in range(frameSize-sizeX+1):
        for y in range(frameSize-sizeY+1):
            # Each size fitting within the frameSize:
            for width in range(sizeX,frameSize-x+1,sizeX):
                for height in range(sizeY,frameSize-y+1,sizeY):
                    count=count+1
print (count)

Matlab:

frameSize = 24;
features = 5;
% All five feature types:
feature = [[2,1]; [1,2]; [3,1]; [1,3]; [2,2]];

count = 0;
% Each feature:
for ii = 1:features
    sizeX = feature(ii,1);
    sizeY = feature(ii,2);
    % Each position:
    for x = 0:frameSize-sizeX
        for y = 0:frameSize-sizeY
            % Each size fitting within the frameSize:
            for width = sizeX:sizeX:frameSize-x
                for height = sizeY:sizeY:frameSize-y
                    count=count+1;
                end
            end
        end
    end
end

display(count)

Hamitic answered 12/4, 2017 at 18:6 Comment(1)

Why do you use 5 features, only 4 are posted in the main question. But Thanks anyway for the python version. – Mills 25/3, 2018 at 7:37

In their original 2001 paper they only state that they used three kinds of features:

we use three kinds of features

with two, three and four rectangles respectively.

Since each kind has two orientations (that differ by 90 degrees), perhaps for the computation of the total number of features they used 2*3 types of features: 2 two-rectangle features, 2 three-rectangle features and 2 four-rectangle features. With this assumption there are indeed over 180,000 features:

feature_types = [(1,2), (2,1), (1,3), (3,1), (2,2), (2,2)]
window_size = (24,24)

total_features = 0
for f_type in feature_types:
    for f_height in range(f_type[0], window_size[0] + 1, f_type[0]):
        for f_width in range(f_type[1], window_size[1] + 1, f_type[1]):
            total_features += (window_size[0] - f_height + 1) * (window_size[1] - f_width + 1)
            
print(total_features)
# 183072

The second four-rectangle feature differs from the first only by a sign, so there is no need to keep it and if we drop it then the total number of features reduces to 162,336.

Lianaliane answered 23/6, 2020 at 8:55 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags