Video Surveillance, Background Modeling

6 min read

In this article by David Millán Escrivá, Prateek Joshi and Vinícius Godoy the authors of the book OpenCV By Example, willIn order to detect moving objects, we first need to build a model of the background. This is not the same as the direct frame differencing because we are actually modeling the background and using this model to detect moving objects. When we say that we are modeling the background, we are basically building a mathematical formulation that can be used to represent the background. So, this performs in a much better way than the simple frame differencing technique. This technique tries to detect static parts of the scene and then includes builds (updates?) in the background model. This background model is then used to detect background pixels. So, it’s an adaptive technique that can adjust according to the scene.

(For more resources related to this topic, see here.)

Naive background subtraction

Let’s start the discussion from the beginning. What does a background subtraction process look like? Consider the following image:

The preceding image represents the background scene. Now, let’s introduce a new object into this scene:

As shown in the preceding image, there is a new object in the scene. So, if we compute the difference between this image and our background model, you should be able to identify the location of the TV remote:

The overall process looks like this:

Does it work well?

There’s a reason why we call it the naive approach. It works under ideal conditions, and as we know, nothing is ideal in the real world. It does a reasonably good job of computing the shape of the given object, but it does so under some constraints. One of the main requirements of this approach is that the color and intensity of the object should be sufficiently different from that of the background. Some of the factors that affect these kinds of algorithms are image noise, lighting conditions, autofocus in cameras, and so on.

Once a new object enters our scene and stays there, it will be difficult to detect new objects that are in front of it. This is because we don’t update our background model, and the new object is now part of our background. Consider the following image:

Now, let’s say a new object enters our scene:

We identify this to be a new object, which is fine. Let’s say another object comes into the scene:

It will be difficult to identify the location of these two different objects because their locations overlap. Here’s what we get after subtracting the background and applying the threshold:

In this approach, we assume that the background is static. If some parts of our background start moving, then those parts will start getting detected as new objects. So, even if the movements are minor, say a waving flag, it will cause problems in our detection algorithm. This approach is also sensitive to changes in illumination, and it cannot handle any camera movement. Needless to say, it’s a delicate approach! We need something that can handle all these things in the real world.

Frame differencing

We know that we cannot keep a static background image that can be used to detect objects. So, one of the ways to fix this would be to use frame differencing. It is one of the simplest techniques that we can use to see what parts of the video are moving. When we consider a live video stream, the difference between successive frames gives a lot of information. The concept is fairly straightforward. We just take the difference between successive frames and display the difference.

If I move my laptop rapidly, we can see something like this:

Instead of the laptop, let’s move the object and see what happens. If I rapidly shake my head, it will look something like this:

As you can see in the preceding images, only the moving parts of the video get highlighted. This gives us a good starting point to see the areas that are moving in the video. Let’s take a look at the function to compute the frame difference:

Mat frameDiff(Mat prevFrame, Mat curFrame, Mat nextFrame)
{
    Mat diffFrames1, diffFrames2, output;

    // Compute absolute difference between current frame and the next frame
    absdiff(nextFrame, curFrame, diffFrames1);

    // Compute absolute difference between current frame and the previous frame
    absdiff(curFrame, prevFrame, diffFrames2);

    // Bitwise "AND" operation between the above two diff images
    bitwise_and(diffFrames1, diffFrames2, output);

    return output;
}

Frame differencing is fairly straightforward. You compute the absolute difference between the current frame and previous frame and between the current frame and next frame. We then take these frame differences and apply bitwise AND operator. This will highlight the moving parts in the image. If you just compute the difference between the current frame and previous frame, it tends to be noisy. Hence, we need to use the bitwise AND operator between successive frame differences to get some stability when we see the moving objects.

Let’s take a look at the function that can extract and return a frame from the webcam:

Mat getFrame(VideoCapture cap, float scalingFactor)
{
    //float scalingFactor = 0.5;
    Mat frame, output;

    // Capture the current frame
    cap >> frame;

    // Resize the frame
    resize(frame, frame, Size(), scalingFactor, scalingFactor, INTER_AREA);

    // Convert to grayscale
    cvtColor(frame, output, CV_BGR2GRAY);

    return output;
}

As we can see, it’s pretty straightforward. We just need to resize the frame and convert it to grayscale. Now that we have the helper functions ready, let’s take a look at the main function and see how it all comes together:

int main(int argc, char* argv[])
{
    Mat frame, prevFrame, curFrame, nextFrame;
    char ch;

    // Create the capture object
    // 0 -> input arg that specifies it should take the input from the webcam
    VideoCapture cap(0);

    // If you cannot open the webcam, stop the execution!
    if( !cap.isOpened() )
        return -1;

    //create GUI windows
    namedWindow("Frame");

    // Scaling factor to resize the input frames from the webcam
    float scalingFactor = 0.75;

    prevFrame = getFrame(cap, scalingFactor);
    curFrame = getFrame(cap, scalingFactor);
    nextFrame = getFrame(cap, scalingFactor);

    // Iterate until the user presses the Esc key
    while(true)
    {
        // Show the object movement
        imshow("Object Movement", frameDiff(prevFrame, curFrame, nextFrame));

        // Update the variables and grab the next frame
        prevFrame = curFrame;
        curFrame = nextFrame;
        nextFrame = getFrame(cap, scalingFactor);

        // Get the keyboard input and check if it's 'Esc'
        // 27 -> ASCII value of 'Esc' key
        ch = waitKey( 30 );
        if (ch == 27) {
            break;
        }
    }

    // Release the video capture object
    cap.release();

    // Close all windows
    destroyAllWindows();

    return 1;
}

How well does it work?

As we can see, frame differencing addresses a couple of important problems that we faced earlier. It can quickly adapt to lighting changes or camera movements. If an object comes in the frame and stays there, it will not be detected in the future frames. One of the main concerns of this approach is about detecting uniformly colored objects. It can only detect the edges of a uniformly colored object. This is because a large portion of this object will result in very low pixel differences, as shown in the following image:

Let’s say this object moved slightly. If we compare this with the previous frame, it will look like this:

Hence, we have very few pixels that are labeled on that object. Another concern is that it is difficult to detect whether an object is moving toward the camera or away from it.