15 min read

In this article by Foat Akhmadeev, author of the book Computer Vision for the Web, we will discuss how we can detect an object on an image using several JavaScript libraries. In particular, we will see techniques such as FAST features detection, and BRIEF and ORB descriptors matching. Eventually, the object detection example will be presented.

There are many ways to detect an object on an image. Color object detection, which is the detection of changes in intensity of an image is just a simple computer vision methods. There are some sort of fundamental things which every computer vision enthusiast should know. The libraries we use here are:

  • JSFeat (http://inspirit.github.io/jsfeat/)
  • tracking.js (http://trackingjs.com)

(For more resources related to this topic, see here.)

Detecting key points

What information do we get when we see an object on an image? An object usually consists of some regular parts or unique points, which represent this particular object. Of course, we can compare each pixel of an image, but it is not a good thing in terms of computational speed. Probably, we can take unique points randomly, thus reducing the computation cost significantly, but we will still not get much information from random points. Using the whole information, we can get too much noise and lose important parts of an object representation. Eventually, we need to consider that both ideas, getting all pixels and selecting random pixels, are really bad. So, what can we do in that case?

We are working with a grayscale image and we need to get unique points of an image. Then, we need to focus on the intensity information. For example, getting object edges in the Canny edge detector or the Sobel filter. We are closer to the solution! But still not close enough. What if we have a long edge? Don’t you think that it is a bit bad that we have too many unique pixels that lay on this edge? An edge of an object has end points or corners; if we reduce our edge to those corners, we will get enough unique pixels and remove unnecessary information.

There are various methods of getting keypoints from an image, many of which extract corners as those keypoints. To get them, we will use the FAST (Features from Accelerated Segment Test) algorithm. It is really simple and you can easily implement it by yourself if you want. But you do not need to. The algorithm implementation is provided by both tracking.js and JSFeat libraries.

The idea of the FAST algorithm can be captured from the following image:

Suppose we want to check whether the pixel P is a corner. We will check 16 pixels around it. If at least 9 pixels in an arc around P are much darker or brighter than the value of P, then we say that P is a corner. How much darker or brighter should the P pixels be? The decision is made by applying a threshold for the difference between the value of P and the value of pixels around P.

A practical example

First, we will start with an example of FAST corner detection for the tracking.js library. Before we do something, we can set the detector threshold. Threshold defines the minimum difference between a tested corner and the points around it:

tracking.Fast.THRESHOLD = 30;

It is usually a good practice to apply a Gaussian blur on an image before we start the method. It significantly reduces the noise of an image:

var imageData = context.getImageData(0, 0, cols, rows);
var gray = tracking.Image.grayscale(imageData.data, cols, rows, true);
var blurred4 = tracking.Image.blur(gray, cols, rows, 3);

Remember that the blur function returns a 4 channel array—RGBA. In that case, we need to convert it to 1-channel. Since we can easily skip other channels, it should not be a problem:

var blurred1 = new Array(blurred4.length / 4);
for (var i = 0, j = 0; i < blurred4.length; i += 4, ++j) {
   blurred1[j] = blurred4[i];
}

Next, we run a corner detection function on our image array:

var corners = tracking.Fast.findCorners(blurred1, cols, rows);

The result returns an array with its length twice the length of the corner’s number. The array is returned in the format [x0,y0,x1,y1,…]. Where [xn, yn] are coordinates of a detected corner. To print the result on a canvas, we will use the fillRect function:

for (i = 0; i < corners.length; i += 2) {
   context.fillStyle = '#0f0';
   context.fillRect(corners[i], corners[i + 1], 3, 3);
}

Let’s see an example with the JSFeat library,. for which the steps are very similar to that of tracking.js. First, we set the global threshold with a function:

jsfeat.fast_corners.set_threshold(30);

Then, we apply a Gaussian blur to an image matrix and run the corner detection:

jsfeat.imgproc.gaussian_blur(matGray, matBlurred, 3);

We need to preallocate keypoints for a corners result. The keypoint_t function is just a new type which is useful for keypoints of an image. The first two parameters represent coordinates of a point and the other parameters set: point score (which checks whether the point is good enough to be a key point), point level (which you can use it in an image pyramid, for example), and point angle (which is usually used for the gradient orientation):

var corners = [];
var i = cols * rows;
while (--i >= 0) {
   corners[i] = new jsfeat.keypoint_t(0, 0, 0, 0, -1);
}

After all this, we execute the FAST corner detection method. As a last parameter of detection function, we define a border size. The border is used to constrain circles around each possible corner. For example, you cannot precisely say whether the point is a corner for the [0,0] pixel. There is no [0, -3] pixel in our matrix:

var count = jsfeat.fast_corners.detect(matBlurred, corners, 3);

Since we preallocated the corners, the function returns the number of calculated corners for us. The result returns an array of structures with the x and y fields, so we can print it using those fields:

for (var i = 0; i < count; i++) {
   context.fillStyle = '#0f0';
   context.fillRect(corners[i].x, corners[i].y, 3, 3);
}

The result is nearly the same for both algorithms. The difference is in some parts of realization. Let’s look at the following example:

From left to right: tracking.js without blur, JSFeat without blur, tracking.js and JSFeat with blur.

If you look closely, you can see the difference between tracking.js and JSFeat results, but it is not easy to spot it. Look at how much noise was reduced by applying just a small 3 x 3 Gaussian filter! A lot of noisy points were removed from the background. And now the algorithm can focus on points that represent flowers and the pattern of the vase.

We have extracted key points from our image, and we successfully reached the goal of reducing the number of keypoints and focusing on the unique points of an image. Now, we need to compare or match those points somehow. How we can do that?

Descriptors and object matching

Image features by themselves are a bit useless. Yes, we have found unique points on an image. But what did we get? Only values of pixels and that’s it. If we try to compare these values, it will not give us much information. Moreover, if we change the overall image brightness, we will not find the same keypoints on the same image! Taking into account all of this, we need the information that surrounds our key points. Moreover, we need a method to efficiently compare this information. First, we need to describe the image features, which comes from image descriptors. In this part, we will see how these descriptors can be extracted and matched. The tracking.js and JSFeat libraries provide different methods for image descriptors. We will discuss both.

BRIEF and ORB descriptors

The descriptors theory is focused on changes in image pixels’ intensities. The tracking.js library provides the BRIEF (Binary Robust Independent Elementary Features) descriptors and its JSFeat extension—ORB (Oriented FAST and Rotated BRIEF). As we can see from the ORB naming, it is rotation invariant. This means that even if you rotate an object, the algorithm can still detect it. Moreover, the authors of the JSFeat library provide an example using the image pyramid, which is scale invariant too.

Let’s start by explaining BRIEF, since it is the source for ORB descriptors. As a first step, the algorithm takes computed image features, and it takes the unique pairs of elements around each feature. Based on these pairs’ intensities it forms a binary string. For example, if we have a pair of positions i and j, and if I(i) < I(j) (where I(pos) means image value at the position pos), then the result is 1, else 0. We add this result to the binary string. We do that for N pairs, where N is taken as a power of 2 (128, 256, 512). Since descriptors are just binary strings, we can compare them in an efficient manner. To match these strings, the Hamming distance is usually used. It shows the minimum number of substitutions required to change one string to another. For example, we have two binary strings: 10011 and 11001. The Hamming distance between them is 2, since we need to change 2 bits of information to change the first string to the second.

The JSFeat library provides the functionality to apply ORB descriptors. The core idea is very similar to BRIEF. However, there are two major differences:

  • The implementation is scale invariant, since the descriptors are computed for an image pyramid.
  • The descriptors are rotation invariant; the direction is computed using intensity of the patch around a feature. Using this orientation, ORB manages to compute the BRIEF descriptor in a rotation-invariant manner.

Implementation of descriptors implementation and their matching

Our goal is to find an object from a template on a scene image. We can do that by finding features and descriptors on both images and matching descriptors from a template to an image.

We start from the tracking.js library and BRIEF descriptors. The first thing that we can do is set the number of location pairs:

tracking.Brief.N = 512

By default, it is 256, but you can choose a higher value. The larger the value, the more information you will get and the more the memory and computational cost it requires.

Before starting the computation, do not forget to apply the Gaussian blur to reduce the image noise. Next, we find the FAST corners and compute descriptors on both images. Here and in the next example, we use the suffix Object for a template image and Scene for a scene image:

var cornersObject = tracking.Fast.findCorners(grayObject, colsObject, rowsObject);
var cornersScene = tracking.Fast.findCorners(grayScene, colsScene, rowsScene);
var descriptorsObject = tracking.Brief.getDescriptors(grayObject, colsObject, cornersObject);
var descriptorsScene = tracking.Brief.getDescriptors(grayScene, colsScene, cornersScene);

Then we do the matching:

var matches = tracking.Brief.reciprocalMatch(cornersObject, descriptorsObject, cornersScene, descriptorsScene);

We need to pass information of both corners and descriptors to the function, since it returns coordinate information as a result.

Next, we print both images on one canvas. To draw the matches using this trick, we need to shift our scene keypoints for the width of a template image as a keypoint1 matching returns a point on a template and keypoint2 returns a point on a scene image. The keypoint1 and keypoint2 are arrays with x and y coordinates at 0 and 1 indexes, respectively:

for (var i = 0; i < matches.length; i++) {
   var color = '#' + Math.floor(Math.random() * 16777215).toString(16);
   context.fillStyle = color;
   context.strokeStyle = color;
   context.fillRect(matches[i].keypoint1[0], matches[i].keypoint1[1], 5, 5);
   context.fillRect(matches[i].keypoint2[0] + colsObject, matches[i].keypoint2[1], 5, 5);
   context.beginPath();
   context.moveTo(matches[i].keypoint1[0], matches[i].keypoint1[1]);
   context.lineTo(matches[i].keypoint2[0] + colsObject, matches[i].keypoint2[1]);
   context.stroke();
}

The JSFeat library provides most of the code for pyramids and scale invariant features not in the library, but in the examples, which are available on https://github.com/inspirit/jsfeat/blob/gh-pages/sample_orb.html. We will not provide the full code here, because it requires too much space. But do not worry, we will highlight main topics here.

Let’s start from functions that are included in the library. First, we need to preallocate the descriptors matrix, where 32 is the length of a descriptor and 500 is the maximum number of descriptors. Again, 32 is a power of two:

var descriptors = new jsfeat.matrix_t(32, 500, jsfeat.U8C1_t);

Then, we compute the ORB descriptors for each corner, we need to do that for both template and scene images:

jsfeat.orb.describe(matBlurred, corners, num_corners, descriptors);

The function uses global variables, which mainly define input descriptors and output matching:

function match_pattern()

The result match_t contains the following fields:

  • screen_idx: This is the index of a scene descriptor
  • pattern_lev: This is the index of a pyramid level
  • pattern_idx: This is the index of a template descriptor

Since ORB works with the image pyramid, it returns corners and matches for each level:

var s_kp = screen_corners[m.screen_idx];
var p_kp = pattern_corners[m.pattern_lev][m.pattern_idx];

We can print each matching as shown here. Again, we use Shift, since we computed descriptors on separate images, but print the result on one canvas:

context.fillRect(p_kp.x, p_kp.y, 4, 4);
context.fillRect(s_kp.x + shift, s_kp.y, 4, 4);

Working with a perspective

Let’s take a step away. Sometimes, an object you want to detect is affected by a perspective distortion. In that case, you may want to rectify an object plane. For example, a building wall:

Looks good, doesn’t it? How do we do that? Let’s look at the code:

var imgRectified = new jsfeat.matrix_t(mat.cols, mat.rows, jsfeat.U8_t | jsfeat.C1_t);
var transform = new jsfeat.matrix_t(3, 3, jsfeat.F32_t | jsfeat.C1_t);

jsfeat.math.perspective_4point_transform(transform,
       0, 0, 0, 0, // first pair x1_src, y1_src, x1_dst, y1_dst
       640, 0, 640, 0, // x2_src, y2_src, x2_dst, y2_dst and so on.
       640, 480, 640, 480,
       0, 480, 180, 480);
jsfeat.matmath.invert_3x3(transform, transform);
jsfeat.imgproc.warp_perspective(mat, imgRectified, transform, 255);

Primarily, as we did earlier, we define a result matrix object. Next, we assign a matrix for image perspective transformation. We calculate it based on four pairs of corresponding points. For example, the last, that is, the fourth point of the original image, which is [0, 480], should be projected to the point [180, 480] on the rectified image. Here, the first coordinate refers to x and the second to y. Then, we invert the transform matrix to be able to apply it to the original image—the mat variable. We pick the background color as white (255 for an unsigned byte). As a result, we get a nice image without any perspective distortion.

Finding an object location

Returning to our primary goal, we found a match. That is great. But what we did not do is finding an object location. There is no function for that in the tracking.js library, but JSFeat provides such functionality in the examples section.

First, we need to compute a perspective transform matrix. This is why we discussed the example of such transformation previously. We have points from two images but we do not have a transformation for the whole image.

First, we define a transform matrix:

var homo3x3 = new jsfeat.matrix_t(3, 3, jsfeat.F32C1_t);

To compute the homography, we need only four points. But after the matching, we get too many. In addition, there are can be noisy points, which we will need to skip somehow. For that, we use a RANSAC (Random sample consensus) algorithm. It is an iterative method for estimating a mathematical model from a dataset that contains outliers (noise). It estimates outliers and generates a model that is computed without the noisy data.

Before we start, we need to define the algorithm parameters. The first parameter is a match mask, where all mathces will be marked as good (1) or bad (0):

var match_mask = new jsfeat.matrix_t(500, 1, jsfeat.U8C1_t);

Our mathematical model to find:

var mm_kernel = new jsfeat.motion_model.homography2d();

Minimum number of points to estimate a model (4 points to get a homography):

var num_model_points = 4;

Maximum threshold to classify a data point as an inlier or a good match:

var reproj_threshold = 3;

Finally, the variable that holds main parameters and the last two arguments define the maximum ratio of outliers and probability of success when the algorithm stops at the point where the number of inliers is 99 percent:

var ransac_param = new jsfeat.ransac_params_t(num_model_points,
       reproj_threshold, 0.5, 0.99);

Then, we run the RANSAC algorithm. The last parameter represents the number of maximum iterations for the algorithm:

jsfeat.motion_estimator.ransac(ransac_param, mm_kernel,
       object_xy, screen_xy, count, homo3x3, match_mask, 1000);

The shape finding can be applied for both tracking.js and JSFeat libraries, you just need to set matches as object_xy and screen_xy, where those arguments must hold an array of objects with the x and y fields. After we find the transformation matrix, we compute the projected shape of an object to a new image:

var shape_pts = tCorners(homo3x3.data, colsObject, rowsObject);

After the computation is done, we draw computed shapes on our images:

As we see, our program successfully found an object in both cases. Actually, both methods can show different performance, it is mainly based on the thresholds you set.

Summary

Image features and descriptors matching are powerful tools for object detection. Both JSFeat and tracking.js libraries provide different functionalities to match objects using these features. In addition to this, the JSFeat project contains algorithms for object finding. These methods can be useful for tasks such as uniques object detection, face tracking, and creating a human interface by tracking various objects very efficiently.

Resources for Article:


Further resources on this subject:


1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here