In today’s tutorial, we will learn how to apply the AdaBoost classifier in face detection using Haar cascades.
Face detection using Haar cascades
Object detection using Haar feature-based cascade classifiers is an effective object detection method proposed by Paul Viola and Michael Jones in their paper Rapid Object Detection using a Boosted Cascade of Simple Features in 2001. It is a machine-learning-based approach where a cascade function is trained from a lot of positive and negative images. It is then used to detect objects in other images.
Here, we will work with face detection. Initially, the algorithm needs a lot of positive images (images of faces) and negative images (images without faces) to train the classifier. Then we need to extract features from it. Features are nothing but numerical information extracted from the images that can be used to distinguish one image from another; for example, a histogram (distribution of intensity values) is one of the features that can be used to define several characteristics of an image even without looking at the image, such as dark or bright image, the intensity range of the image, contrast, and so on. We will use Haar features to detect faces in an image. Here is a figure showing different Haar features:
These features are just like the convolution kernel; to know about convolution, you need to wait for the following chapters. For a basic understanding, convolutions can be described as in the following figure:
So we can summarize convolution with these steps:
- Pick a pixel location from the image.
- Now crop a sub-image with the selected pixel as the center from the source image with the same size as the convolution kernel.
- Calculate an element-wise product between the values of the kernel and sub- image.
- Add the result of the product.
- Put the resultant value into the new image at the same place where you picked up the pixel location.
Now we are going to do a similar kind of procedure, but with a slight difference for our images. Each feature of ours is a single value obtained by subtracting the sum of the pixels under the white rectangle from the sum of the pixels under the black rectangle.
Now, all possible sizes and locations of each kernel are used to calculate plenty of features. (Just imagine how much computation it needs. Even a 24×24 window results in over 160,000 features.) For each feature calculation, we need to find the sum of the pixels under the white and black rectangles. To solve this, we will use the concept of integral image; we will discuss this concept very briefly here, as it’s not a part of our context.
Integral images are those images in which the pixel value at any (x,y) location is the sum of the all pixel values present before the current pixel. Its use can be understood by the following example:
Image on the left and the integral image on the right.
Let’s see how this concept can help reduce computation time; let us assume a matrix A of size 5×5 representing an image, as shown here:
Now, let’s say we want to calculate the average intensity over the area highlighted:
Region for addition
Normally, you’d do the following:
9 + 1 + 2 + 6 + 0 + 5 + 3 + 6 + 5 = 37
37 / 9 = 4.11
This requires a total of 9 operations. Doing the same for 100 such operations would require:
100 * 9 = 900 operations.
Now, let us first make a integral image of the preceding image:
Making this image requires a total of 56 operations.
Again, focus on the highlighted portion:
To calculate the avg intensity, all you have to do is:
(76 – 20) – (24 – 5) = 37
37 / 9 = 4.11
This required a total of 4 operations.
To do this for 100 such operations, we would require:
56 + 100 * 4 = 456 operations.
For just a hundred operations over a 5×5 matrix, using an integral image requires about 50% less computations. Imagine the difference it makes for large images and other such operations.
Creation of an integral image changes other sum difference operations by almost O(1) time complexity, thereby decreasing the number of calculations.
It simplifies the calculation of the sum of pixels—no matter how large the number of pixels—to an operation involving just four pixels. Nice, isn’t it? It makes things superfast.
However, among all of these features we calculated, most of them are irrelevant. For example, consider the following image. The top row shows two good features. The first feature selected seems to focus on the property that the region of the eyes is often darker than the region of the nose and cheeks. The second feature selected relies on the property that the eyes are darker than the bridge of the nose. But the same windows applying on cheeks or any other part is irrelevant. So how do we select the best features out of 160000+ features? It is achieved by AdaBoost.
To do this, we apply each and every feature on all the training images. For each feature, it finds the best threshold that will classify the faces as positive and negative. Obviously, there will be errors or misclassifications. We select the features with the minimum error rate, which means they are the features that best classify the face and non-face images.
|Note: The process is not as simple as this. Each image is given an equal weight in the beginning. After each classification, the weights of misclassified images are increased. Again, the same process is done. New error rates are calculated among the new weights. This process continues until the required accuracy or error rate is achieved or the required number of features is found.|
The final classifier is a weighted sum of these weak classifiers. It is called weak because it alone can’t classify the image, but together with others, it forms a strong classifier. The paper says that even 200 features provide detection with 95% accuracy. Their final setup had around 6,000 features. (Imagine a reduction from 160,000+ to 6000 features. That is a big gain.)
Face detection framework using the Haar cascade and AdaBoost algorithm
So now, you take an image take each 24×24 window, apply 6,000 features to it, and check if it is a face or not. Wow! Wow! Isn’t this a little inefficient and time consuming? Yes, it is. The authors of the algorithm have a good solution for that.
In an image, most of the image region is non-face. So it is a better idea to have a simple method to verify that a window is not a face region. If it is not, discard it in a single shot. Don’t process it again. Instead, focus on the region where there can be a face. This way, we can find more time to check a possible face region.
For this, they introduced the concept of a cascade of classifiers. Instead of applying all the 6,000 features to a window, we group the features into different stages of classifiers and apply one by one (normally first few stages will contain very few features). If a window fails in the first stage, discard it. We don’t consider the remaining features in it. If it passes, apply the second stage of features and continue the process. The window that passes all stages is a face region. How cool is the plan!!!
The authors’ detector had 6,000+ features with 38 stages, with 1, 10, 25, 25, and 50 features in the first five stages (two features in the preceding image were actually obtained as the best two features from AdaBoost). According to the authors, on average, 10 features out of 6,000+ are evaluated per subwindow.
So this is a simple, intuitive explanation of how Viola-Jones face detection works. Read the paper for more details.
If you found this post useful, do check out the book Ensemble Machine Learning to learn different machine learning aspects such as bagging, boosting, and stacking.