In this article by Roy Shilkrot, coauthor of the book Mastering OpenCV 3, we will discuss the notion of Structure from Motion (SfM), or better put, extracting geometric structures from images taken with a camera under motion, using OpenCV's API to help us. First, let's constrain the otherwise very broad approach to SfM using a single camera, usually called a monocular approach, and a discrete and sparse set of frames rather than a continuous video stream. These two constrains will greatly simplify the system we will sketch out in the coming pages and help us understand the fundamentals of any SfM method.

In this article, we will cover the following:

Structure from Motion concepts

Estimating the camera motion from a pair of images

(For more resources related to this topic, see here.)

Throughout the article, we assume the use of a calibrated camera—one that was calibrated beforehand. Calibration is a ubiquitous operation in computer vision, fully supported in OpenCV using command-line tools. We, therefore, assume the existence of the camera's intrinsic parameters embodied in the K matrix and the distortion coefficients vector—the outputs from the calibration process.

To make things clear in terms of language, from this point on, we will refer to a camera as a single view of the scene rather than to the optics and hardware taking the image. A camera has a position in space and a direction of view. Between two cameras, there is a translation element (movement through space) and a rotation of the direction of view.

We will also unify the terms for the point in the scene, world, real, or 3D to be the same thing, a point that exists in our real world. The same goes for points in the image or 2D, which are points in the image coordinates, of some real 3D point that was projected on the camera sensor at that location and time.

Structure from Motion concepts

The first discrimination we should make is the difference between stereo (or indeed any multiview), 3D reconstruction using calibrated rigs, and SfM. A rig of two or more cameras assumes we already know what the "motion" between the cameras is, while in SfM, we don't know what this motion is and we wish to find it. Calibrated rigs, from a simplistic point of view, allow a much more accurate reconstruction of 3D geometry because there is no error in estimating the distance and rotation between the cameras—it is already known. The first step in implementing an SfM system is finding the motion between the cameras. OpenCV may help us in a number of ways to obtain this motion, specifically using the findFundamentalMat and findEssentialMat functions.

Let's think for one moment of the goal behind choosing an SfM algorithm. In most cases, we wish to obtain the geometry of the scene, for example, where objects are in relation to the camera and what their form is. Having found the motion between the cameras picturing the same scene, from a reasonably similar point of view, we would now like to reconstruct the geometry. In computer vision jargon, this is known as triangulation, and there are plenty of ways to go about it. It may be done by way of ray intersection, where we construct two rays: one from each camera's center of projection and a point on each of the image planes. The intersection of these rays in space will, ideally, intersect at one 3D point in the real world that was imaged in each camera, as shown in the following diagram:

exploring-structure-motion-using-opencv-img-0

In reality, ray intersection is highly unreliable. This is because the rays usually do not intersect, making us fall back to using the middle point on the shortest segment connecting the two rays. OpenCV contains a simple API for a more accurate form of triangulation, the triangulatePoints function, so this part we do not need to code on our own.

After you have learned how to recover 3D geometry from two views, we will see how you can incorporate more views of the same scene to get an even richer reconstruction. At that point, most SfM methods try to optimize the bundle of estimated positions of our cameras and 3D points by means of Bundle Adjustment. OpenCV contains means for Bundle Adjustment in its new Image Stitching Toolbox. However, the beauty of working with OpenCV and C++ is the abundance of external tools that can be easily integrated into the pipeline. We will, therefore, see how to integrate an external bundle adjuster, the Ceres non-linear optimization package.

Now that we have sketched an outline of our approach to SfM using OpenCV, we will see how each element can be implemented.

Estimating the camera motion from a pair of images

Before we set out to actually find the motion between two cameras, let's examine the inputs and the tools we have at hand to perform this operation. First, we have two images of the same scene from (hopefully not extremely) different positions in space. This is a powerful asset, and we will make sure that we use it. As for tools, we should take a look at mathematical objects that impose constraints over our images, cameras, and the scene.

Two very useful mathematical objects are the fundamental matrix (denoted by F) and the essential matrix (denoted by E). They are mostly similar, except that the essential matrix is assuming usage of calibrated cameras; this is the case for us, so we will choose it. OpenCV allows us to find the fundamental matrix via the findFundamentalMat function and the essential matrix via the findEssentialMatrix function. Finding the essential matrix can be done as follows:

Mat E = findEssentialMat(leftPoints, rightPoints, focal, pp);

This function makes use of matching points in the "left" image, leftPoints, and "right" image, rightPoints, which we will discuss shortly, as well as two additional pieces of information from the camera's calibration: the focal length, focal, and principal point, pp.

The essential matrix, E, is a 3 x 3 matrix, which imposes the following constraint on a point in one image and a point in the other image: x'KTEKx = 0, where x is a point in the first image one, x' is the corresponding point in the second image, and K is the calibration matrix. This is extremely useful, as we are about to see. Another important fact we use is that the essential matrix is all we need in order to recover the two cameras' positions from our images, although only up to an arbitrary unit of scale. So, if we obtain the essential matrix, we know where each camera is positioned in space and where it is looking. We can easily calculate the matrix if we have enough of those constraint equations, simply because each equation can be used to solve for a small part of the matrix. In fact, OpenCV internally calculates it using just five point-pairs, but through the Random Sample Consensus algorithm (RANSAC), many more pairs can be used and make for a more robust solution.

Point matching using rich feature descriptors

Now we will make use of our constraint equations to calculate the essential matrix. To get our constraints, remember that for each point in image A, we must find a corresponding point in image B. We can achieve such a matching using OpenCV's extensive 2D feature-matching framework, which has greatly matured in the past few years.

Feature extraction and descriptor matching is an essential process in computer vision and is used in many methods to perform all sorts of operations, for example, detecting the position and orientation of an object in the image or searching a big database of images for similar images through a given query. In essence, feature extraction means selecting points in the image that would make for good features and computing a descriptor for them. A descriptor is a vector of numbers that describes the surrounding environment around a feature point in an image. Different methods have different lengths and data types for their descriptor vectors. Descriptor Matching is the process of finding a corresponding feature from one set in another using its descriptor. OpenCV provides very easy and powerful methods to support feature extraction and matching.

Let's examine a very simple feature extraction and matching scheme:

vector<KeyPoint> keypts1, keypts2;

Mat desc1, desc2;

// detect keypoints and extract ORB descriptors

Ptr<Feature2D> orb = ORB::create(2000);

orb->detectAndCompute(img1, noArray(), keypts1, desc1);

orb->detectAndCompute(img2, noArray(), keypts2, desc2);

// matching descriptors

Ptr<DescriptorMatcher> matcher = DescriptorMatcher::create("BruteForce-Hamming");

vector<DMatch> matches;

matcher->match(desc1, desc2, matches);

You may have already seen similar OpenCV code, but let's review it quickly. Our goal is to obtain three elements: feature points for two images, descriptors for them, and a matching between the two sets of features. OpenCV provides a range of feature detectors, descriptor extractors, and matchers. In this simple example, we use the ORB class to get both the 2D location of Oriented BRIEF (ORB) (where Binary Robust Independent Elementary Features (BRIEF)) feature points and their respective descriptors. We use a brute-force binary matcher to get the matching, which is the most straightforward way to match two feature sets by comparing each feature in the first set to each feature in the second set (hence the phrasing "brute-force").

In the following image, we will see a matching of feature points on two images from the Fountain-P11 sequence found at http://cvlab.epfl.ch/~strecha/multiview/denseMVS.html:

exploring-structure-motion-using-opencv-img-1

Practically, raw matching like we just performed is good only up to a certain level, and many matches are probably erroneous. For that reason, most SfM methods perform some form of filtering on the matches to ensure correctness and reduce errors. One form of filtering, which is built into OpenCV's brute-force matcher, is cross-check filtering. That is, a match is considered true if a feature of the first image matches a feature of the second image, and the reverse check also matches the feature of the second image with the feature of the first image. Another common filtering mechanism, used in the provided code, is to filter based on the fact that the two images are of the same scene and have a certain stereo-view relationship between them. In practice, the filter tries to robustly calculate the fundamental or essential matrix and retain those feature pairs that correspond to this calculation with small errors.

An alternative to using rich features, such as ORB, is to use optical flow. The following information box provides a short overview of optical flow. It is possible to use optical flow instead of descriptor matching to find the required point matching between two images, while the rest of the SfM pipeline remains the same. OpenCV recently extended its API for getting the flow field from two images, and now it is faster and more powerful.

Optical flow

It is the process of matching selected points from one image to another, assuming that both images are part of a sequence and relatively close to one another. Most optical flow methods compare a small region, known as the search window or patch, around each point from image A to the same area in image B. Following a very common rule in computer vision, called the brightness constancy constraint (and other names), the small patches of the image will not change drastically from one image to other, and therefore the magnitude of their subtraction should be close to zero. In addition to matching patches, newer methods of optical flow use a number of additional methods to get better results. One is using image pyramids, which are smaller and smaller resized versions of the image, which allow for working from coarse to-fine—a very well-used trick in computer vision. Another method is to define global constraints on the flow field, assuming that the points close to each other move together in the same direction.

Finding camera matrices

Now that we have obtained matches between keypoints, we can calculate the essential matrix. However, we must first align our matching points into two arrays, where an index in one array corresponds to the same index in the other. This is required by the findEssentialMat function as we've seen in the Estimating Camera Motion section. We would also need to convert the KeyPoint structure to a Point2f structure. We must pay special attention to the queryIdx and trainIdx member variables of DMatch, the OpenCV struct that holds a match between two keypoints, as they must align with the way we used the DescriptorMatcher::match() function. The following code section shows how to align a matching into two corresponding sets of 2D points, and how these can be used to find the essential matrix:

vector<KeyPoint> leftKpts, rightKpts;

// ... obtain keypoints using a feature extractor

vector<DMatch> matches;

// ... obtain matches using a descriptor matcher

//align left and right point sets

vector<Point2f> leftPts, rightPts;

for (size_t i = 0; i < matches.size(); i++) {

    // queryIdx is the "left" image

    leftPts.push_back(leftKpts[matches[i].queryIdx].pt);  

    // trainIdx is the "right" image

    rightPts.push_back(rightKpts[matches[i].trainIdx].pt);

}

//robustly find the Essential Matrix

Mat status;

Mat E = findEssentialMat(

                leftPts,     //points from left image

                rightPts,    //points from right image

                focal,        //camera focal length factor

                pp,           //camera principal point

                cv::RANSAC,  //use RANSAC for a robust solution

                0.999,        //desired solution confidence level

                1.0,          //point-to-epipolar-line threshold

                status);     //binary vector for inliers

We may later use the status binary vector to prune those points that align with the recovered essential matrix. Refer to the following image for an illustration of point matching after pruning. The red arrows mark feature matches that were removed in the process of finding the matrix, and the green arrows are feature matches that were kept:

exploring-structure-motion-using-opencv-img-2

Now we are ready to find the camera matrices; however, the new OpenCV 3 API makes things very easy for us by introducing the recoverPose function. First, we will briefly examine the structure of the camera matrix we will use:

This is the model for our camera; it consists of two elements, rotation (denoted as R) and translation (denoted as t). The interesting thing about it is that it holds a very essential equation: x = PX, where x is a 2D point on the image and X is a 3D point in space. There is more to it, but this matrix gives us a very important relationship between the image points and the scene points. So, now that we have a motivation for finding the camera matrices, we will see how it can be done. The following code section shows how to decompose the essential matrix into the rotation and translation elements:

Mat E;

// ... find the essential matrix

Mat R, t; //placeholders for rotation and translation

//Find Pright camera matrix from the essential matrix

//Cheirality check is performed internally.

recoverPose(E, leftPts, rightPts, R, t, focal, pp, mask);

Very simple. Without going too deep into mathematical interpretation, this conversion of the essential matrix to rotation and translation is possible because the essential matrix was originally composed by these two elements. Strictly for satisfying our curiosity, we can look at the following equation for the essential matrix, which appears in the literature: exploring-structure-motion-using-opencv-img-4 . We see that it is composed of (some form of) a translation element, t, and a rotational element, R.

Note that a cheirality check is internally performed in the recoverPose function. The cheirality check makes sure that all triangulated 3D points are in front of the reconstructed camera. Camera matrix recovery from the essential matrix has in fact four possible solutions, but the only correct solution is the one that will produce triangulated points in front of the camera, hence the need for a cheirality check.

Note that what we just did only gives us one camera matrix, and for triangulation, we require two camera matrices. This operation assumes that one camera matrix is fixed and canonical (no rotation and no translation):

exploring-structure-motion-using-opencv-img-5

The other camera that we recovered from the essential matrix has moved and rotated in relation to the fixed one. This also means that any of the 3D points that we recover from these two camera matrices will have the first camera at the world origin point (0, 0, 0).

One more thing we can think of adding to our method is error checking. Many times, the calculation of an essential matrix from point matching is erroneous, and this affects the resulting camera matrices. Continuing to triangulate with faulty camera matrices is pointless. We can install a check to see if the rotation element is a valid rotation matrix. Keeping in mind that rotation matrices must have a determinant of 1 (or -1), we can simply do the following:

bool CheckCoherentRotation(const cv::Mat_<double>& R) {

    if (fabsf(determinant(R)) - 1.0 > 1e-07) {

        cerr << "rotation matrix is invalid" << endl;

        return false;

    }

    return true;

}

We can now see how all these elements combine into a function that recovers the P matrices. First, we will introduce some convenience data structures and type short hands:

typedef std::vector<cv::KeyPoint> Keypoints;

typedef std::vector<cv::Point2f>  Points2f;

typedef std::vector<cv::Point3f>  Points3f;

typedef std::vector<cv::DMatch>    Matching;

struct Features { //2D features

    Keypoints keyPoints;

    Points2f  points;

    cv::Mat   descriptors;

};

struct Intrinsics { //camera intrinsic parameters

    cv::Mat K;

    cv::Mat Kinv;

    cv::Mat distortion;

};

Now, we can write the camera matrix finding function:

void findCameraMatricesFromMatch(

        constIntrinsics&   intrin,

        constMatching&     matches,

        constFeatures&     featuresLeft,

        constFeatures&     featuresRight,

        cv::Matx34f&        Pleft,

        cv::Matx34f&        Pright) {

{

    //Note: assuming fx = fy

    const double focal = intrin.K.at<float>(0, 0);

    const cv::Point2d pp(intrin.K.at<float>(0, 2),   

                             intrin.K.at<float>(1, 2));

    //align left and right point sets using the matching

    Features left;

    Features right;

    GetAlignedPointsFromMatch(

        featuresLeft,

        featuresRight,

        matches,

        left,

        right);

    //find essential matrix

    Mat E, mask;

    E = findEssentialMat(

            left.points,

            right.points,

            focal,

            pp,

            RANSAC,

            0.999,

            1.0,

            mask);

    Mat_<double> R, t;

    //Find Pright camera matrix from the essential matrix

    recoverPose(E, left.points, right.points, R, t, focal, pp, mask);

    Pleft = Matx34f::eye();

    Pright = Matx34f(R(0,0), R(0,1), R(0,2), t(0),

                         R(1,0), R(1,1), R(1,2), t(1),

                         R(2,0), R(2,1), R(2,2), t(2));

}

At this point, we have the two cameras that we need in order to reconstruct the scene. The canonical first camera, in the Pleft variable, and the second camera we calculated, form the essential matrix in the Pright variable.

Choosing the image pair to use first

Given we have more than just two image views of the scene, we must choose which two views we will start the reconstruction from. In their paper, Snavely et al. suggest that we pick the two views that have the least number of homography inliers. A homography is a relationship between two images or sets of points that lie on a plane; the homography matrix defines the transformation from one plane to another. In case of an image or a set of 2D points, the homography matrix is of size 3 x 3.

When Snavely et al. look for the lowest inlier ratio, they essentially suggest to calculate the homography matrix between all pairs of images and pick the pair whose points mostly do not correspond with the homography matrix. This means the geometry of the scene in these two views is not planar or at least not the same plane in both views, which helps when doing 3D reconstruction. For reconstruction, it is best to look at a complex scene with non-planar geometry, with things closer and farther away from the camera.

The following code snippet shows how to use OpenCV's findHomography function to count the number of inliers between two views whose features were already extracted and matched:

int findHomographyInliers(

        const Features& left,

        const Features& right,

        const Matching& matches) {

    //Get aligned feature vectors

    Features alignedLeft;

    Features alignedRight;

    GetAlignedPointsFromMatch(left,

                                   right,

                                   matches,

                                   alignedLeft,

                                   alignedRight);

    //Calculate homography with at least 4 points

    Mat inlierMask;

    Mat homography;

    if(matches.size() >= 4) {

        homography = findHomography(alignedLeft.points,

                                          alignedRight.points,

                                          cv::RANSAC, RANSAC_THRESHOLD,

                                          inlierMask);

    }

    if(matches.size() < 4 or homography.empty()) {

        return 0;

    }

    return countNonZero(inlierMask);

}

The next step is to perform this operation on all pairs of image views in our bundle and sort them based on the ratio of homography inliers to outliers:

//sort pairwise matches to find the lowest Homography inliers

map<float, ImagePair> pairInliersCt;

const size_t numImages = mImages.size();

//scan all possible image pairs (symmetric)

for (size_t i = 0; i < numImages - 1; i++) {

    for (size_t j = i + 1; j < numImages; j++) {      

        if (mFeatureMatchMatrix[i][j].size() < MIN_POINT_CT) {

            //Not enough points in matching

               pairInliersCt[1.0] = {i, j};

            continue;

        }

        //Find number of homography inliers

        const int numInliers = findHomographyInliers(

                mImageFeatures[i],

                mImageFeatures[j],

                mFeatureMatchMatrix[i][j]);

        const float inliersRatio =

                        (float)numInliers /

                        (float)(mFeatureMatchMatrix[i][j].size());

        pairInliersCt[inliersRatio] = {i, j};

    }

}

Note that the std::map<float, ImagePair> will internally sort the pairs based on the map's key: the inliers ratio. We then simply need to traverse this map from the beginning to find the image pair with least inlier ratio, and if that pair cannot be used, we can easily skip ahead to the next pair.

Summary

In this article, we saw how OpenCV v3 can help us approach Structure from Motion in a manner that is both simple to code and to understand. OpenCV v3's new API contains a number of useful functions and data structures that make our lives easier and also assist in a cleaner implementation.

However, the state-of-the-art SfM methods are far more complex. There are many issues we choose to disregard in favor of simplicity, and plenty more error examinations that are usually in place. Our chosen methods for the different elements of SfM can also be revisited. Some methods even use the N-view triangulation once they understand the relationship between the features in multiple images.

If we would like to extend and deepen our familiarity with SfM, we will certainly benefit from looking at other open source SfM libraries. One particularly interesting project is libMV, which implements a vast array of SfM elements that may be interchanged to get the best results. There is a great body of work from University of Washington that provides tools for many flavors of SfM (Bundler and VisualSfM). This work inspired an online product from Microsoft, called PhotoSynth, and 123D Catch from Adobe. There are many more implementations of SfM readily available online, and one must only search to find quite a lot of them.