Marker-based Augmented Reality on iPhone or iPad

22 min read

(For more resources related to this topic, see here.)

Creating an iOS project that uses OpenCV

In this section we will create a demo application for iPhone/iPad devices that will use the OpenCV ( Open Source Computer Vision ) library to detect markers in the camera frame and render 3D objects on it. This example will show you how to get access to the raw video data stream from the device camera, perform image processing using the OpenCV library, find a marker in an image, and render an AR overlay.

We will start by first creating a new XCode project by choosing the iOS Single View Application template, as shown in the following screenshot:

Now we have to add OpenCV to our project. This step is necessary because in this application we will use a lot of functions from this library to detect markers and estimate position position.

OpenCV is a library of programming functions for real-time computer vision. It was originally developed by Intel and is now supported by Willow Garage and Itseez. This library is written in C and C++ languages. It also has an official Python binding and unofficial bindings to Java and .NET languages.

Adding OpenCV framework

Fortunately the library is cross-platform, so it can be used on iOS devices. Starting from version 2.4.2, OpenCV library is officially supported on the iOS platform and you can download the distribution package from the library website at The OpenCV for iOS link points to the compressed OpenCV framework. Don’t worry if you are new to iOS development; a framework is like a bundle of files. Usually each framework package contains a list of header files and list of statically linked libraries. Application frameworks provide an easy way to distribute precompiled libraries to developers.

Of course, you can build your own libraries from scratch. OpenCV documentation explains this process in detail. For simplicity, we follow the recommended way and use the framework for this article.

After downloading the file we extract its content to the project folder, as shown in the following screenshot:

To inform the XCode IDE to use any framework during the build stage, click on Project options and locate the Build phases tab. From there we can add or remove the list of frameworks involved in the build process. Click on the plus sign to add a new framework, as shown in the following screenshot:

From here we can choose from a list of standard frameworks. But to add a custom framework we should click on the Add other button. The open file dialog box will appear. Point it to opencv2.framework in the project folder as shown in the following screenshot:

Including OpenCV headers

Now that we have added the OpenCV framework to the project, everything is almost done. One last thing—let’s add OpenCV headers to the project’s precompiled headers. The precompiled headers are a great feature to speed up compilation time. By adding OpenCV headers to them, all your sources automatically include OpenCV headers as well. Find a .pch file in the project source tree and modify it in the following way.

The following code shows how to modify the .pch file in the project source tree:

// // Prefix header for all source files of the 'Example_MarkerBasedAR' // #import <Availability.h> #ifndef __IPHONE_5_0 #warning "This project uses features only available in iOS SDK 5.0 and later." #endif #ifdef __cplusplus #include <opencv2/opencv.hpp> #endif #ifdef __OBJC__ #import <UIKit/UIKit.h> #import <Foundation/Foundation.h> #endif

Now you can call any OpenCV function from any place in your project.

That’s all. Our project template is configured and we are ready to move further. Free advice: make a copy of this project; this will save you time when you are creating your next one!

Application architecture

Each iOS application contains at least one instance of the UIViewController interface that handles all view events and manages the application’s business logic. This class provides the fundamental view-management model for all iOS apps. A view controller manages a set of views that make up a portion of your app’s user interface. As part of the controller layer of your app, a view controller coordinates its efforts with model objects and other controller objects—including other view controllers—so your app presents a single coherent user interface.

The application that we are going to write will have only one view; that’s why we choose a Single-View Application template to create one. This view will be used to present the rendered picture. Our ViewController class will contain three major components that each AR application should have (see the next diagram):

  • Video source

  • Processing pipeline

  • Visualization engine

The video source is responsible for providing new frames taken from the built-in camera to the user code. This means that the video source should be capable of choosing a camera device (front- or back-facing camera), adjusting its parameters (such as resolution of the captured video, white balance, and shutter speed), and grabbing frames without freezing the main UI.

The image processing routine will be encapsulated in the MarkerDetector class. This class provides a very thin interface to user code. Usually it’s a set of functions like processFrame and getResult. Actually that’s all that ViewController should know about. We must not expose low-level data structures and algorithms to the view layer without strong necessity. VisualizationController contains all logic concerned with visualization of the Augmented Reality on our view. VisualizationController is also a facade that hides a particular implementation of the rendering engine. Low code coherence gives us freedom to change these components without the need to rewrite the rest of your code.

Such an approach gives you the freedom to use independent modules on other platforms and compilers as well. For example, you can use the MarkerDetector class easily to develop desktop applications on Mac, Windows, and Linux systems without any changes to the code. Likewise, you can decide to port VisualizationController on the Windows platform and use Direct3D for rendering. In this case you should write only new VisualizationController implementation; other code parts will remain the same.

The main processing routine starts from receiving a new frame from the video source. This triggers video source to inform the user code about this event with a callback. ViewController handles this callback and performs the following operations:

  1. Sends a new frame to the visualization controller.

  2. Performs processing of the new frame using our pipeline.

  3. Sends the detected markers to the visualization stage.

  4. Renders a scene.

Let’s examine this routine in detail. The rendering of an AR scene includes the drawing of a background image that has a content of the last received frame; artificial 3D objects are drawn later on. When we send a new frame for visualization, we are copying image data to internal buffers of the rendering engine. This is not actual rendering yet; we are just updating the text with a new bitmap.

The second step is the processing of new frame and marker detection. We pass our image as input and as a result receive a list of the markers detected. on it. These markers are passed to the visualization controller, which knows how to deal with them. Let’s take a look at the following sequence diagram where this routine is shown:

We start development by writing a video capture component. This class will be responsible for all frame grabbing and for sending notifications of captured frames via user callback. Later on we will write a marker detection algorithm. This detection routine is the core of your application. In this part of our program we will use a lot of OpenCV functions to process images, detect contours on them, find marker rectangles, and estimate their position. After that we will concentrate on visualization of our results using Augmented Reality. After bringing all these things together we will complete our first AR application. So let’s move on!

Accessing the camera

The Augmented Reality application is impossible to create without two major things: video capturing and AR visualization. The video capture stage consists of receiving frames from the device camera, performing necessary color conversion, and sending it to the processing pipeline. As the single frame processing time is so critical to AR applications, the capture process should be as efficient as possible. The best way to achieve maximum performance is to have direct access to the frames received from the camera. This became possible starting from iOS Version 4. Existing APIs from the AVFoundation framework provide the necessary functionality to read directly from image buffers in memory.

You can find a lot of examples that use the AVCaptureVideoPreviewLayer class and the UIGetScreenImage function to capture videos from the camera. This technique was used for iOS Version 3 and earlier. It has now become outdated and has two major disadvantages:

  • Lack of direct access to frame data. To get a bitmap, you have to create an intermediate instance of UIImage, copy an image to it, and get it back. For AR applications this price is too high, because each millisecond matters. Losing a few frames per second (FPS) significantly decreases overall user experience.

  • To draw an AR, you have to add a transparent overlay view that will present the AR. Referring to Apple guidelines, you should avoid non-opaque layers because their blending is hard for mobile processors.

Classes AVCaptureDevice and AVCaptureVideoDataOutput allow you to configure, capture, and specify unprocessed video frames in 32 bpp BGRA format. Also you can set up the desired resolution of output frames. However, it does affect overall performance since the larger the frame the more processing time and memory is required.

There is a good alternative for high-performance video capture. The AVFoundation API offers a much faster and more elegant way to grab frames directly from the camera. But first, let’s take a look at the following figure where the capturing process for iOS is shown:

AVCaptureSession is a root capture object that we should create. Capture session requires two components—an input and an output. The input device can either be a physical device (camera) or a video file (not shown in diagram). In our case it’s a built-in camera (front or back). The output device can be presented by one of the following interfaces:

  • AVCaptureMovieFileOutput

  • AVCaptureStillImageOutput

  • AVCaptureVideoPreviewLayer

  • AVCaptureVideoDataOutput

The AVCaptureMovieFileOutput interface is used to record video to the file, the AVCaptureStillImageOutput interface is used to to make still images, and the AVCaptureVideoPreviewLayer interface is used to play a video preview on the screen. We are interested in the AVCaptureVideoDataOutput interface because it gives you direct access to video data.

The iOS platform is built on top of the Objective-C programming language. So to work with AVFoundation framework, our class also has to be written in Objective-C. In this section all code listings are in the Objective-C++ language.

To encapsulate the video capturing process, we create the VideoSource interface as shown by the following code:

@protocol VideoSourceDelegate<NSObject> -(void)frameReady:(BGRAVideoFrame) frame; @end @interface VideoSource : NSObject<AVCaptureVideoDataOutputSampleBuffe rDelegate> { } @property (nonatomic, retain) AVCaptureSession *captureSession; @property (nonatomic, retain) AVCaptureDeviceInput *deviceInput; @property (nonatomic, retain) id<VideoSourceDelegate> delegate; - (bool) startWithDevicePosition:(AVCaptureDevicePosition) devicePosition; - (CameraCalibration) getCalibration; - (CGSize) getFrameSize; @end

In this callback we lock the image buffer to prevent modifications by any new frames, obtain a pointer to the image data and frame dimensions. Then we construct temporary BGRAVideoFrame object that is passed to outside via special delegate. This delegate has following prototype:

@protocol VideoSourceDelegate<NSObject> -(void)frameReady:(BGRAVideoFrame) frame; @end

Within VideoSourceDelegate, the VideoSource interface informs the user code that a new frame is available.

The step-by-step guide for the initialization of video capture is listed as follows:

  1. Create an instance of AVCaptureSession and set the capture session quality preset.

  2. Choose and create AVCaptureDevice. You can choose the front- or backfacing camera or use the default one.

  3. Initialize AVCaptureDeviceInput using the created capture device and add it to the capture session.

  4. Create an instance of AVCaptureVideoDataOutput and initialize it with format of video frame, callback delegate, and dispatch the queue.

  5. Add the capture output to the capture session object.

  6. Start the capture session.

Let’s explain some of these steps in more detail. After creating the capture session, we can specify the desired quality preset to ensure that we will obtain optimal performance. We don’t need to process HD-quality video, so 640 x 480 or an even lesser frame resolution is a good choice:

- (id)init { if ((self = [super init])) { AVCaptureSession * capSession = [[AVCaptureSession alloc] init]; if ([capSession canSetSessionPreset:AVCaptureSessionPreset64 0x480]) { [capSession setSessionPreset:AVCaptureSessionPreset640x480]; NSLog(@"Set capture session preset AVCaptureSessionPreset640x480"); } else if ([capSession canSetSessionPreset:AVCaptureSessionPresetL ow]) { [capSession setSessionPreset:AVCaptureSessionPresetLow]; NSLog(@"Set capture session preset AVCaptureSessionPresetLow"); } self.captureSession = capSession; } return self; }

Always check hardware capabilities using the appropriate API; there is no guarantee that every camera will be capable of setting a particular session preset.

After creating the capture session, we should add the capture input—the instance of AVCaptureDeviceInput will represent a physical camera device. The cameraWithPosition function is a helper function that returns the camera device for the requested position (front, back, or default):

- (bool) startWithDevicePosition:(AVCaptureDevicePosition) devicePosition { AVCaptureDevice *videoDevice = [self cameraWithPosition:devicePosit ion]; if (!videoDevice) return FALSE; { NSError *error; AVCaptureDeviceInput *videoIn = [AVCaptureDeviceInput deviceInputWithDevice:videoDevice error:&error]; self.deviceInput = videoIn; if (!error) { if ([[self captureSession] canAddInput:videoIn]) { [[self captureSession] addInput:videoIn]; } else { NSLog(@"Couldn't add video input"); return FALSE; } } else { NSLog(@"Couldn't create video input"); return FALSE; } } [self addRawViewOutput]; [captureSession startRunning]; return TRUE; }

Please notice the error handling code. Take care of return values for such an important thing as working with hardware setup is a good practice. Without this, your code can crash in unexpected cases without informing the user what has happened.

We created a capture session and added a source of the video frames. Now it’s time to add a receiver—an object that will receive actual frame data. The AVCaptureVideoDataOutput class is used to process uncompressed frames from the video stream. The camera can provide frames in BGRA, CMYK, or simple grayscale color models. For our purposes the BGRA color model fits best of all, as we will use this frame for visualization and image processing. The following code shows the addRawViewOutput function:

- (void) addRawViewOutput { /*We setupt the output*/ AVCaptureVideoDataOutput *captureOutput = [[AVCaptureVideoDataOutput alloc] init]; /*While a frame is processes in -captureOutput:didOutputSampleBuff er:fromConnection: delegate methods no other frames are added in the queue. If you don't want this behaviour set the property to NO */ captureOutput.alwaysDiscardsLateVideoFrames = YES; /*We create a serial queue to handle the processing of our frames*/ dispatch_queue_t queue; queue = dispatch_queue_create("com.Example_MarkerBasedAR. cameraQueue", NULL); [captureOutput setSampleBufferDelegate:self queue:queue]; dispatch_release(queue); // Set the video output to store frame in BGRA (It is supposed to be faster) NSString* key = (NSString*)kCVPixelBufferPixelFormatTypeKey; NSNumber* value = [NSNumber numberWithUnsignedInt:kCVPixelFormatType_32BGRA]; NSDictionary* videoSettings = [NSDictionary dictionaryWithObject:value forKey:key]; [captureOutput setVideoSettings:videoSettings]; // Register an output [self.captureSession addOutput:captureOutput]; }

Now the capture session is finally configured. When started, it will capture frames from the camera and send it to user code. When the new frame is available, an AVCaptureSession object performs a captureOutput: didOutputSampleBuffer:fromConnection callback. In this function, we will perform a minor data conversion operation to get the image data in a more usable format and pass it to user code:

- (void)captureOutput:(AVCaptureOutput *)captureOutput didOutputSampleBuffer:(CMSampleBufferRef)sampleBuffer fromConnection:(AVCaptureConnection *)connection { // Get a image buffer holding video frame CVImageBufferRef imageBuffer = CMSampleBufferGetImageBuffer (sampleB uffer); // Lock the image buffer CVPixelBufferLockBaseAddress(imageBuffer,0); // Get information about the image uint8_t *baseAddress = (uint8_t *)CVPixelBufferGetBaseAddress(image Buffer); size_t width = CVPixelBufferGetWidth(imageBuffer); size_t height = CVPixelBufferGetHeight(imageBuffer); size_t stride = CVPixelBufferGetBytesPerRow(imageBuffer); BGRAVideoFrame frame = {width, height, stride, baseAddress}; [delegate frameReady:frame]; /*We unlock the image buffer*/ CVPixelBufferUnlockBaseAddress(imageBuffer,0); }

We obtain a reference to the image buffer that stores our frame data. Then we lock it to prevent modifications by new frames. Now we have exclusive access to the frame data. With help of the CoreVideo API, we get the image dimensions, stride (number of pixels per row), and the pointer to the beginning of the image data.

I draw your attention to the CVPixelBufferLockBaseAddress/ CVPixelBufferUnlockBaseAddress function call in the callback code. Until we hold a lock on the pixel buffer, it guarantees consistency and correctness of its data. Reading of pixels is available only after you have obtained a lock. When you’re done, don’t forget to unlock it to allow the OS to fill it with new data.

Marker detection

A marker is usually designed as a rectangle image holding black and white areas inside it. Due to known limitations, the marker detection procedure is a simple one. First of all we need to find closed contours on the input image and unwarp the image inside it to a rectangle and then check this against our marker model.

In this sample the 5 x 5 marker will be used. Here is what it looks like:

In the sample project that you will find in this book, the marker detection routine is encapsulated in the MarkerDetector class:

/** * A top-level class that encapsulate marker detector algorithm */ class MarkerDetector { public: /** * Initialize a new instance of marker detector object * @calibration[in] - Camera calibration necessary for pose estimation. */ MarkerDetector(CameraCalibration calibration); void processFrame(const BGRAVideoFrame& frame); const std::vector<Transformation>& getTransformations() const; protected: bool findMarkers(const BGRAVideoFrame& frame, std::vector<Marker>& detectedMarkers); void prepareImage(const cv::Mat& bgraMat, cv::Mat& grayscale); void performThreshold(const cv::Mat& grayscale, cv::Mat& thresholdImg); void findContours(const cv::Mat& thresholdImg, std::vector<std::vector<cv::Point> >& contours, int minContourPointsAllowed); void findMarkerCandidates(const std::vector<std::vector<cv::Point> >& contours, std::vector<Marker>& detectedMarkers); void detectMarkers(const cv::Mat& grayscale, std::vector<Marker>& detectedMarkers); void estimatePosition(std::vector<Marker>& detectedMarkers); private: };

To help you better understand the marker detection routine, a step-by-step processing on one frame from a video will be shown. A source image taken from an iPad camera will be used as an example:

Marker identification

Here is the workflow of the marker detection routine:

  1. Convert the input image to grayscale.

  2. Perform binary threshold operation.

  3. Detect contours.

  4. Search for possible markers.

  5. Detect and decode markers.

  6. Estimate marker 3D pose.

Grayscale conversion

The conversion to grayscale is necessary because markers usually contain only black and white blocks and it’s much easier to operate with them on grayscale images. Fortunately, OpenCV color conversion is simple enough.

Please take a look at the following code listing in C++:

void MarkerDetector::prepareImage(const cv::Mat& bgraMat, cv::Mat& grayscale) { // Convert to grayscale cv::cvtColor(bgraMat, grayscale, CV_BGRA2GRAY); }

This function will convert the input BGRA image to grayscale (it will allocate image buffers if necessary) and place the result into the second argument. All further steps will be performed with the grayscale image.

Image binarization

The binarization operation will transform each pixel of our image to black (zero intensity) or white (full intensity). This step is required to find contours. There are several threshold methods; each has strong and weak sides. The easiest and fastest method is absolute threshold. In this method the resulting value depends on current pixel intensity and some threshold value. If pixel intensity is greater than the threshold value, the result will be white (255); otherwise it will be black (0).

This method has a huge disadvantage—it depends on lighting conditions and soft intensity changes. The more preferable method is the adaptive threshold. The major difference of this method is the use of all pixels in given radius around the examined pixel. Using average intensity gives good results and secures more robust corner detection.

The following code snippet shows the MarkerDetector function:

void MarkerDetector::performThreshold(const cv::Mat& grayscale, cv::Mat& thresholdImg) { cv::adaptiveThreshold(grayscale, // Input image thresholdImg,// Result binary image 255, // cv::ADAPTIVE_THRESH_GAUSSIAN_C, // cv::THRESH_BINARY_INV, // 7, // 7 // ); }

After applying adaptive threshold to the input image, the resulting image looks similar to the following one:

Each marker usually looks like a square figure with black and white areas inside it. So the best way to locate a marker is to find closed contours and approximate them with polygons of 4 vertices.

Contours detection

The cv::findCountours function will detect contours on the input binary image:

void MarkerDetector::findContours(const cv::Mat& thresholdImg, std::vector<std::vector<cv::Point> >& contours, int minContourPointsAllowed) { std::vector< std::vector<cv::Point> > allContours; cv::findContours(thresholdImg, allContours, CV_RETR_LIST, CV_ CHAIN_APPROX_NONE); contours.clear(); for (size_t i=0; i<allContours.size(); i++) { int contourSize = allContours[i].size(); if (contourSize > minContourPointsAllowed) { contours.push_back(allContours[i]); } } }

The return value of this function is a list of polygons where each polygon represents a single contour. The function skips contours that have their perimeter in pixels value set to be less than the value of the minContourPointsAllowed variable. This is because we are not interested in small contours. (They will probably contain no marker, or the contour won’t be able to be detected due to a small marker size.)

The following figure shows the visualization of detected contours:

Candidates search

After finding contours, the polygon approximation stage is performed. This is done to decrease the number of points that describe the contour shape. It’s a good quality check to filter out areas without markers because they can always be represented with a polygon that contains four vertices. If the approximated polygon has more than or fewer than 4 vertices, it’s definitely not what we are looking for. The following code implements this idea:

void MarkerDetector::findCandidates ( const ContoursVector& contours, std::vector<Marker>& detectedMarkers ) { std::vector<cv::Point> approxCurve; std::vector<Marker> possibleMarkers; // For each contour, analyze if it is a parallelepiped likely to be the marker for (size_t i=0; i<contours.size(); i++) { // Approximate to a polygon double eps = contours[i].size() * 0.05; cv::approxPolyDP(contours[i], approxCurve, eps, true); // We interested only in polygons that contains only four points if (approxCurve.size() != 4) continue; // And they have to be convex if (!cv::isContourConvex(approxCurve)) continue; // Ensure that the distance between consecutive points is large enough float minDist = std::numeric_limits<float>::max(); for (int i = 0; i < 4; i++) { cv::Point side = approxCurve[i] - approxCurve[(i+1)%4]; float squaredSideLength =; minDist = std::min(minDist, squaredSideLength); } // Check that distance is not very small if (minDist < m_minContourLengthAllowed) continue; // All tests are passed. Save marker candidate: Marker m; for (int i = 0; i<4; i++) m.points.push_back( cv::Point2f(approxCurve[i].x,approxCu rve[i].y) ); // Sort the points in anti-clockwise order // Trace a line between the first and second point. // If the third point is at the right side, then the points are anticlockwise cv::Point v1 = m.points[1] - m.points[0]; cv::Point v2 = m.points[2] - m.points[0]; double o = (v1.x * v2.y) - (v1.y * v2.x); if (o < 0.0) //if the third point is in the left side, then sort in anti-clockwise order std::swap(m.points[1], m.points[3]); possibleMarkers.push_back(m); } // Remove these elements which corners are too close to each other. // First detect candidates for removal: std::vector< std::pair<int,int> > tooNearCandidates; for (size_t i=0;i<possibleMarkers.size();i++) { const Marker& m1 = possibleMarkers[i]; //calculate the average distance of each corner to the nearest corner of the other marker candidate for (size_t j=i+1;j<possibleMarkers.size();j++) { const Marker& m2 = possibleMarkers[j]; float distSquared = 0; for (int c = 0; c < 4; c++) { cv::Point v = m1.points - m2.points; distSquared +=; } distSquared /= 4; if (distSquared < 100) { tooNearCandidates.push_back(std::pair<int,int>(i,j)); } } } // Mark for removal the element of the pair with smaller perimeter std::vector<bool> removalMask (possibleMarkers.size(), false); for (size_t i=0; i<tooNearCandidates.size(); i++) { float p1 = perimeter(possibleMarkers[tooNearCandidates[i]. first ].points); float p2 = perimeter(possibleMarkers[tooNearCandidates[i].second]. points); size_t removalIndex; if (p1 > p2) removalIndex = tooNearCandidates[i].second; else removalIndex = tooNearCandidates[i].first; removalMask[removalIndex] = true; } // Return candidates detectedMarkers.clear(); for (size_t i=0;i<possibleMarkers.size();i++) { if (!removalMask[i]) detectedMarkers.push_back(possibleMarkers[i]); } }

Now we have obtained a list of parallelepipeds that are likely to be the markers. To verify whether they are markers or not, we need to perform three steps:

  1. First, we should remove the perspective projection so as to obtain a frontal view of the rectangle area.

  2. Then we perform thresholding of the image using the Otsu algorithm. This algorithm assumes a bimodal distribution and finds the threshold value that maximizes the extra-class variance while keeping a low intra-class variance.

  3. Finally we perform identification of the marker code. If it is a marker, it has an internal code. The marker is divided into a 7 x 7 grid, of which the internal 5 x 5 cells contain ID information. The rest correspond to the external black border. Here, we first check whether the external black border is present. Then we read the internal 5 x 5 cells and check if they provide a valid code. (It might be required to rotate the code to get the valid one.)

To get the rectangle marker image, we have to unwarp the input image using perspective transformation. This matrix can be calculated with the help of the cv::getPerspectiveTransform function. It finds the perspective transformation from four pairs of corresponding points. The first argument is the marker coordinates in image space and the second point corresponds to the coordinates of the square marker image. Estimated transformation will transform the marker to square form and let us analyze it:

cv::Mat canonicalMarker; Marker& marker = detectedMarkers[i]; // Find the perspective transfomation that brings current marker to rectangular form cv::Mat M = cv::getPerspectiveTransform(marker.points, m_ markerCorners2d); // Transform image to get a canonical marker image cv::warpPerspective(grayscale, canonicalMarker, M, markerSize);

Image warping transforms our image to a rectangle form using perspective transformation:

Now we can test the image to verify if it is a valid marker image. Then we try to extract the bit mask with the marker code. As we expect our marker to contain only black and white colors, we can perform Otsu thresholding to remove gray pixels and leave only black and white pixels:

//threshold image cv::threshold(markerImage, markerImage, 125, 255, cv::THRESH_BINARY | cv::THRESH_OTSU);


Please enter your comment!
Please enter your name here