In this article by Michael Beyeler author of the book OpenCV with Python Blueprints is to develop an app that detects and tracks simple hand gestures in real time using the output of a depth sensor, such as that of a Microsoft Kinect 3D sensor or an Asus Xtion. The app will analyze each captured frame to perform the following tasks:
- Hand region segmentation: The user’s hand region will be extracted in each frame by analyzing the depth map output of the Kinect sensor, which is done by thresholding, applying some morphological operations, and finding connected components
- Hand shape analysis: The shape of the segmented hand region will be analyzed by determining contours, convex hull, and convexity defects
- Hand gesture recognition: The number of extended fingers will be determined based on the hand contour’s convexity defects, and the gesture will be classified accordingly (with no extended finger corresponding to a fist, and five extended fingers corresponding to an open hand)
Gesture recognition is an ever popular topic in computer science. This is because it not only enables humans to communicate with machines (human-machine interaction or HMI), but also constitutes the first step for machines to begin understanding the human body language. With affordable sensors, such as Microsoft Kinect or Asus Xtion, and open source software such as OpenKinect and OpenNI, it has never been easy to get started in the field yourself. So what shall we do with all this technology?
The beauty of the algorithm that we are going to implement in this article is that it works well for a number of hand gestures, yet is simple enough to run in real time on a generic laptop. And if we want, we can easily extend it to incorporate more complicated hand pose estimations. The end product looks like this:
No matter how many fingers of my left hand I extend, the algorithm correctly segments the hand region (white), draws the corresponding convex hull (the green line surrounding the hand), finds all convexity defects that belong to the spaces between fingers (large green points) while ignoring others (small red points), and infers the correct number of extended fingers (the number in the bottom-right corner), even for a fist.
This article assumes that you have a Microsoft Kinect 3D sensor installed. Alternatively, you may install Asus Xtion or any other depth sensor for which OpenCV has built-in support. First, install OpenKinect and libfreenect from http://www.openkinect.org/wiki/Getting_Started. Then, you need to build (or rebuild) OpenCV with OpenNI support. The GUI used in this article will again be designed with wxPython, which can be obtained from http://www.wxpython.org/download. php.
Planning the app
The final app will consist of the following modules and scripts:
- gestures: A module that consists of an algorithm for recognizing hand gestures. We separate this algorithm from the rest of the application so that it can be used as a standalone module without the need for a GUI.
- gestures.HandGestureRecognition: A class that implements the entire process flow of hand gesture recognition. It accepts a single-channel depth image (acquired from the Kinect depth sensor) and returns an annotated RGB color image with an estimated number of extended fingers.
- gui: A module that provides a wxPython GUI application to access the capture device and display the video feed. In order to have it access the Kinect depth sensor instead of a generic camera, we will have to extend some of the base class functionality.
- gui.BaseLayout: A generic layout from which more complicated layouts can be built.
Setting up the app
Before we can get to the nitty-grittyof our gesture recognition algorithm, we need to make sure that we can access the Kinect sensor and display a stream of depth frames in a simple GUI.
Accessing the Kinect 3D sensor
Accessing Microsoft Kinect from within OpenCV is not much different from accessing a computer’s webcam or camera device. The easiest way to integrate a Kinect sensor with OpenCV is by using an OpenKinect module called freenect. For installation instructions, see the preceding information box. The following code snippet grants access to the sensor using cv2.VideoCapture:
import cv2 import freenect device = cv2.cv.CV_CAP_OPENNI capture = cv2.VideoCapture(device)
On some platforms, the first call to cv2.VideoCapture fails to open a capture channel. In this case, we provide a workaround by opening the channel ourselves:
if not(capture.isOpened(device)): capture.open(device)
If you want to connect to your Asus Xtion, the device variable should be assigned the cv2.cv.CV_CAP_OPENNI_ASUS value instead.
In order to give our app a fair chance to run in real time, we will limit the frame size to 640 x 480 pixels:
capture.set(cv2.cv.CV_CAP_PROP_FRAME_WIDTH, 640) capture.set(cv2.cv.CV_CAP_PROP_FRAME_HEIGHT, 480)
If you are using OpenCV 3, the constants you are looking for might be called cv3.CAP_PROP_FRAME_WIDTH and cv3.CAP_PROP_FRAME_HEIGHT.
The read() method of cv2.VideoCapture is inappropriate when we need to synchronize a set of cameras or a multihead camera, such as a Kinect. In this case, we should use the grab() and retrieve() methods instead. An even easier way when working with OpenKinect is to use the sync_get_depth() and sync_get_video()methods.
For the purpose of this article, we will need only the Kinect’s depth map, which is a single-channel (grayscale) image in which each pixel value is the estimated distance from the camera to a particular surface in the visual scene. The latest frame can be grabbed via this code:
depth, timestamp = freenect.sync_get_depth()
The preceding code returns both the depth map and a timestamp. We will ignore the latter for now. By default, the map is in 11-bit format, which is inadequate to be visualized with cv2.imshow right away. Thus, it is a good idea to convert the image to 8-bit precision first.
In order to reduce the range of depth values in the frame, we will clip the maximal distance to a value of 1,023 (or 2**10-1). This will get rid of values that correspond either to noise or distances that are far too large to be of interest to us:
np.clip(depth, 0, 2**10-1, depth) depth >>= 2
Then, we convert the image into 8-bit format and display it:
depth = depth.astype(np.uint8) cv2.imshow("depth", depth)
Running the app
In order to run our app, we will need to execute a main function routine that accesses the Kinect, generates the GUI, and executes the main loop of the app:
import numpy as np import wx import cv2 import freenect from gui import BaseLayout from gestures import HandGestureRecognition def main(): device = cv2.cv.CV_CAP_OPENNI capture = cv2.VideoCapture() if not(capture.isOpened()): capture.open(device) capture.set(cv2.cv.CV_CAP_PROP_FRAME_WIDTH, 640) capture.set(cv2.cv.CV_CAP_PROP_FRAME_HEIGHT, 480)
We will design a suitable layout (KinectLayout) for the current project:
# start graphical user interface app = wx.App() layout = KinectLayout(None, -1, 'Kinect Hand Gesture Recognition', capture) layout.Show(True) app.MainLoop()
The Kinect GUI
The layout chosen for the current project (KinectLayout) is as plain as it gets. It should simply display the live stream of the Kinect depth sensor at a comfortable frame rate of 10 frames per second. Therefore, there is no need to further customize BaseLayout:
class KinectLayout(BaseLayout): def _create_custom_layout(self): pass
The only parameter that needs to be initialized this time is the recognition class. This will be useful in just a moment:
def _init_custom_layout(self): self.hand_gestures = HandGestureRecognition()
Instead of reading a regular camera frame, we need to acquire a depth frame via the freenect method sync_get_depth(). This can be achieved by overriding the following method:
As mentioned earlier, by default, this function returns a single-channel depth image with 11-bit precision and a timestamp. However, we are not interested in the timestamp, and we simply pass on the frame if the acquisition was successful:
frame, _ = freenect.sync_get_depth() # return success if frame size is valid if frame is not None: return (True, frame) else: return (False, frame)
The rest of the visualization pipeline is handled by the BaseLayout class. We only need to make sure that we provide a _process_frame method. This method accepts a depth image with 11-bit precision, processes it, and returns an annotated 8-bit RGB color image. Conversion to a regular grayscale image is the same as mentioned in the previous subsection:
def _process_frame(self, frame): # clip max depth to 1023, convert to 8-bit grayscale np.clip(frame, 0, 2**10 – 1, frame) frame >>= 2 frame = frame.astype(np.uint8)
The resulting grayscale image can then be passed to the hand gesture recognizer, which will return the estimated number of extended fingers (num_fingers) and the annotated RGB color image mentioned earlier (img_draw):
num_fingers, img_draw = self.hand_gestures.recognize(frame)
In order to simplify the segmentation task of the HandGestureRecognition class, we will instruct the user to place their hand in the center of the screen. To provide a visual aid for this, let’s draw a rectangle around the image center and highlight the center pixel of the image in orange:
height, width = frame.shape[:2] cv2.circle(img_draw, (width/2, height/2), 3, [255, 102, 0], 2) cv2.rectangle(img_draw, (width/3, height/3), (width*2/3, height*2/3), [255, 102, 0], 2)
In addition, we print num_fingers on the screen:
cv2.putText(img_draw, str(num_fingers), (30, 30),cv2.FONT_HERSHEY_SIMPLEX, 1, (255, 255, 255)) return img_draw
Tracking hand gestures in real time
The bulk of the work is done by the HandGestureRecognition class, especially by its recognize method. This class starts off with a few parameter initializations, which will be explained and used later:
class HandGestureRecognition: def __init__(self): # maximum depth deviation for a pixel to be considered # within range self.abs_depth_dev = 14 # cut-off angle (deg): everything below this is a convexity # point that belongs to two extended fingers self.thresh_deg = 80.0
The recognize method is where the real magic takes place. This method handles the entire process flow, from the raw grayscale image all the way to a recognized hand gesture. It implements the following procedure:
- It extracts the user’s hand region by analyzing the depth map (img_gray) and returning a hand region mask (segment):
def recognize(self, img_gray): segment = self._segment_arm(img_gray)
- It performs contour analysis on the hand region mask (segment). Then, it returns the largest contour area found in the image (contours) and any convexity defects (defects):
[contours, defects] = self._find_hull_defects(segment)
- Based on the contours found and the convexity defects, it detects the number of extended fingers (num_fingers) in the image. Then, it annotates the output image (img_draw) with contours, defect points, and the number of extended fingers:
img_draw = cv2.cvtColor(img_gray, cv2.COLOR_GRAY2RGB) [num_fingers, img_draw] = self._detect_num_fingers(contours, defects, img_draw)
- It returns the estimated number of extended fingers (num_fingers) as well as the annotated output image (img_draw):
return (num_fingers, img_draw)
Hand region segmentation
The automatic detection of an arm, and later the hand region, could be designed to be arbitrarily complicated, maybe by combining information about the shape and color of an arm or hand. However, using a skin color as a determining feature to find hands in visual scenes might fail terribly in poor lighting conditions or when the user is wearing gloves. Instead, we choose to recognize the user’s hand by its shape in the depth map. Allowing hands of all sorts to be present in any region of the image unnecessarily complicates the mission of this article, so we make two simplifying assumptions:
- We will instruct the user of our app to place their hand in front of the center of the screen, orienting their palm roughly parallel to the orientation of the Kinect sensor so that it is easier to identify the corresponding depth layer of the hand.
- We will also instruct the user to sit roughly 1 to 2 meters away from the Kinect, and to slightly extend their arm in front of their body so that the hand will end up in a slightly different depth layer than the arm. However, the algorithm will still work even if the full arm is visible.
In this way, it will be relatively straightforward to segment the image based on the depth layer alone. Otherwise, we would have to come up with a hand detection algorithm first, which would unnecessarily complicate our mission. If you feel adventurous, feel free to do this on your own.
Finding the most prominent depth of the image center region
Once the hand is placed roughly in the center of the screen, we can start finding all image pixels that lie on the same depth plane as the hand.
For this, we simply need to determine the most prominent depth value of the center region of the image. The simplest approach would be as follows: look only at the depth value of the center pixel:
width, height = depth.shape center_pixel_depth = depth[width/2, height/2]
Then, create a mask in which all pixels at a depth of center_pixel_depth are white and all others are black:
import numpy as np depth_mask = np.where(depth == center_pixel_depth, 255, 0).astype(np.uint8)
However, this approach will not be very robust, because chances are that:
- Your hand is not placed perfectly parallel to the Kinect sensor
- Your hand is not perfectly flat
- The Kinect sensor values are noisy
Therefore, different regions of your hand will have slightly different depth values.
The _segment_arm method takes a slightly better approach, that is, looking at a small neighborhood in the center of the image and determining the median (meaning the most prominent) depth value. First, we find the center (for example, 21 x 21 pixels) region of the image frame:
def _segment_arm(self, frame): """ segments the arm region based on depth """ center_half = 10 # half-width of 21 is 21/2-1 lowerHeight = self.height/2 – center_half upperHeight = self.height/2 + center_half lowerWidth = self.width/2 – center_half upperWidth = self.width/2 + center_half center = frame[lowerHeight:upperHeight,lowerWidth:upperWidth]
We can then reshape the depth values of this center region into a one-dimensional vector and determine the median depth value, med_val:
med_val = np.median(center)
We can now compare med_val with the depth value of all pixels in the image and create a mask in which all pixels whose depth values are within a particular range [med_val-self.abs_depth_dev, med_val+self.abs_depth_dev] are white and all other pixels are black. However, for reasons that will be clear in a moment, let’s paint the pixels gray instead of white:
frame = np.where(abs(frame – med_val)
The result will look like this:
Applying morphological closing to smoothen the segmentation mask
A common problem with segmentation is that a hard threshold typically results in small imperfections (that is, holes, as in the preceding image) in the segmented region. These holes can be alleviated using morphological opening and closing. Opening removes small objects from the foreground (assuming that the objects are bright on a dark foreground), whereas closing removes small holes (dark regions).
This means that we can get rid of the small black regions in our mask by applying morphological closing (dilation followed by erosion) with a small 3 x 3 pixel kernel:
kernel = np.ones((3, 3), np.uint8) frame = cv2.morphologyEx(frame, cv2.MORPH_CLOSE, kernel)
The result looks a lot smoother, as follows:
Notice, however, that the mask still contains regions that do not belong to the hand or arm, such as what appears to be one of my knees on the left and some furniture on the right. These objects just happen to be on the same depth layer as my arm and hand. If possible, we could now combine the depth information with another descriptor, maybe a texture-based or skeleton-based hand classifier, that would weed out all non-skin regions.
Finding connected components in a segmentation mask
An easier approach is to realize that most of the times, hands are not connected to knees or furniture. We already know that the center region belongs to the hand, so we can simply apply cv2.floodfill to find all the connected image regions.
Before we do this, we want to be absolutely certain that the seed point for the flood fill belongs to the right mask region. This can be achieved by assigning a grayscale value of 128 to the seed point. But we also want to make sure that the center pixel does not, by any coincidence, lie within a cavity that the morphological operation failed to close. So, let’s set a small 7 x 7 pixel region with a grayscale value of 128 instead:
small_kernel = 3 frame[self.height/2-small_kernel : self.height/2+small_kernel, self.width/2-small_kernel : self.width/2+small_kernel] = 128
Because flood filling (as well as morphological operations) is potentially dangerous, the Python version of later OpenCV versions requires specifying a mask that avoids flooding the entire image. This mask has to be 2 pixels wider and taller than the original image and has to be used in combination with the cv2.FLOODFILL_MASK_ONLY flag. It can be very helpful in constraining the flood filling to a small region of the image or a specific contour so that we need not connect two neighboring regions that should have never been connected in the first place. It’s better to be safe than sorry, right?
Ah, screw it! Today, we feel courageous! Let’s make the mask entirely black:
mask = np.zeros((self.height+2, self.width+2), np.uint8)
Then we can apply the flood fill to the center pixel (seed point) and paint all the connected regions white:
flood = frame.copy() cv2.floodFill(flood, mask, (self.width/2, self.height/2), 255, flags=4 | (255
At this point, it should be clear why we decided to start with a gray mask earlier. We now have a mask that contains white regions (arm and hand), gray regions (neither arm nor hand but other things in the same depth plane), and black regions (all others). With this setup, it is easy to apply a simple binary threshold to highlight only the relevant regions of the pre-segmented depth plane:
ret, flooded = cv2.threshold(flood, 129, 255, cv2.THRESH_BINARY)
This is what the resulting mask looks like:
The resulting segmentation mask can now be returned to the recognize method, where it will be used as an input to _find_hull_defects as well as a canvas for drawing the final output image (img_draw).
Hand shape analysis
Now that we (roughly) know where the hand is located, we aim to learn something about its shape.
Determining the contour of the segmented hand region
The first step involves determining the contour of the segmented hand region. Luckily, OpenCV comes with a pre-canned version of such an algorithm—cv2.findContours. This function acts on a binary image and returns a set of points that are believed to be part of the contour. Because there might be multiple contours present in the image, it is possible to retrieve an entire hierarchy of contours:
def _find_hull_defects(self, segment): contours, hierarchy = cv2.findContours(segment, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
Furthermore, because we do not know which contour we are looking for, we have to make an assumption to clean up the contour result. Since it is possible that some small cavities are left over even after the morphological closing—but we are fairly certain that our mask contains only the segmented area of interest—we will assume that the largest contour found is the one that we are looking for. Thus, we simply traverse the list of contours, calculate the contour area (cv2.contourArea), and store only the largest one (max_contour):
max_contour = max(contours, key=cv2.contourArea)
Finding the convex hull of a contour area
Once we have identified the largest contour in our mask, it is straightforward to compute the convex hull of the contour area. The convex hull is basically the envelope of the contour area. If you think of all the pixels that belong to the contour area as a set of nails sticking out of a board, then the convex hull is the shape formed by a tight rubber band that surrounds all the nails.
We can get the convex hull directly from our largest contour (max_contour):
hull = cv2.convexHull(max_contour, returnPoints=False)
Because we now want to look at convexity deficits in this hull, we are instructed by the OpenCV documentation to set the returnPoints optional flag to False.
The convex hull drawn in green around a segmented hand region looks like this:
Finding convexity defects of a convex hull
As is evident from the preceding screenshot, not all points on the convex hull belong to the segmented hand region. In fact, all the fingers and the wrist cause severe convexity defects, that is, points of the contour that are far away from the hull.
We can find these defects by looking at both the largest contour (max_contour) and the corresponding convex hull (hull):
defects = cv2.convexityDefects(max_contour, hull)
The output of this function (defects) is a 4-tuple that contains start_index (the point of the contour where the defect begins), end_index (the point of the contour where the defect ends), farthest_pt_index (the farthest from the convex hull point within the defect), and fixpt_depth (distance between the farthest point and the convex hull). We will make use of this information in just a moment when we reason about fingers.
But for now, our job is done. The extracted contour (max_contour) and convexity defects (defects) can be passed to recognize, where they will be used as inputs to _detect_num_fingers:
Hand gesture recognition
What remains to be done is classifying the hand gesture based on the number of extended fingers. For example, if we find five extended fingers, we assume the hand to be open, whereas no extended fingers imply a fist. All that we are trying to do is count from zero to five and make the app recognize the corresponding number of fingers.
This is actually trickier than it might seem at first. For example, people in Europe might count to three by extending their thumb, index finger, and middle finger. If you do that in the US, people there might get horrendously confused, because people do not tend to use their thumbs when signaling the number two. This might lead to frustration, especially in restaurants (trust me). If we could find a way to generalize these two scenarios—maybe by appropriately counting the number of extended fingers—we would have an algorithm that could teach simple hand gesture recognition to not only a machine but also (maybe) to an average waitress.
As you might have guessed, the answer has to do with convexity defects. As mentioned earlier, extended fingers cause defects in the convex hull. However, the inverse is not true; that is, not all convexity defects are caused by fingers! There might be additional defects caused by the wrist as well as the overall orientation of the hand or the arm. How can we distinguish between these different causes for defects?
Distinguishing between different causes for convexity defects
The trick is to look at the angle between the farthest point from the convex hull point within the defect (farthest_pt_index) and the start and end points of the defect (start_index and end_index, respectively), as illustrated in the following screenshot:
In this screenshot, the orange markers serve as a visual aid to center the hand in the middle of the screen, and the convex hull is outlined in green. Each red dot corresponds to a farthest from the convex hull point (farthest_pt_index) for every convexity defect detected. If we compare a typical angle that belongs to two extended fingers (such as θj) to an angle that is caused by general hand geometry (such as θi), we notice that the former is much smaller than the latter. This is obviously because humans can spread their finger only a little, thus creating a narrow angle made by the farthest defect point and the neighboring fingertips.
Therefore, we can iterate over all convexity defects and compute the angle between the said points. For this, we will need a utility function that calculates the angle (in radians) between two arbitrary, list-like vectors, v1 and v2:
def angle_rad(v1, v2): return np.arctan2(np.linalg.norm(np.cross(v1, v2)), np.dot(v1, v2))
This method uses the cross product to compute the angle, rather than the standard way. The standard way of calculating the angle between two vectors v1 and v2 is by calculating their dot product and dividing it by the norm of v1 and the norm of v2. However, this method has two imperfections:
- You have to manually avoid division by zero if either the norm of v1 or the norm of v2 is zero
- The method returns relatively inaccurate results for small angles
Similarly, we provide a simple function to convert an angle from degrees to radians:
def deg2rad(angle_deg): return angle_deg/180.0*np.pi
Classifying hand gestures based on the number of extended fingers
What remains to be done is actually classifying the hand gesture based on the number of extended fingers. The _detect_num_fingers method will take as input the detected contour (contours), the convexity defects (defects), and a canvas to draw on (img_draw):
def _detect_num_fingers(self, contours, defects, img_draw):
Based on these parameters, it will then determine the number of extended fingers.
However, we first need to define a cut-off angle that can be used as a threshold to classify convexity defects as being caused by extended fingers or not. Except for the angle between the thumb and the index finger, it is rather hard to get anything close to 90 degrees, so anything close to that number should work. We do not want the cut-off angle to be too high, because that might lead to misclassifications:
self.thresh_deg = 80.0
For simplicity, let’s focus on the special cases first. If we do not find any convexity defects, it means that we possibly made a mistake during the convex hull calculation, or there are simply no extended fingers in the frame, so we return 0 as the number of detected fingers:
if defects is None: return [0, img_draw]
But we can take this idea even further. Due to the fact that arms are usually slimmer than hands or fists, we can assume that the hand geometry will always generate at least two convexity defects (which usually belong to the wrists). So if there are no additional defects, it implies that there are no extended fingers:
Now that we have ruled out all special cases, we can begin counting real fingers. If there are a sufficient number of defects, we will find a defect between every pair of fingers. Thus, in order to get the number right (num_fingers), we should start counting at 1:
num_fingers = 1
Then we can start iterating over all convexity defects. For each defect, we will extract the four elements and draw its hull for visualization purposes:
for i in range(defects.shape): # each defect point is a 4-tuplestart_idx, end_idx, farthest_idx, _ == defects[i, 0] start = tuple(contours[start_idx]) end = tuple(contours[end_idx]) far = tuple(contours[farthest_idx]) # draw the hull cv2.line(img_draw, start, end [0, 255, 0], 2)
Then we will compute the angle between the two edges from far to start and from far to end. If the angle is smaller than self.thresh_deg degrees, it means that we are dealing with a defect that is most likely caused by two extended fingers. In this case, we want to increment the number of detected fingers (num_fingers), and we draw the point with green. Otherwise, we draw the point with red:
# if angle is below a threshold, defect point belongs # to two extended fingers if angle_rad(np.subtract(start, far), np.subtract(end, far))
After iterating over all convexity defects, we pass the number of detected fingers and the assembled output image to the recognize method:
return (min(5, num_fingers), img_draw)
This will make sure that we do not exceed the common number of fingers per hand.
The result can be seen in the following screenshots:
Interestingly, our app is able to detect the correct number of extended fingers in a variety of hand configurations. Defect points between extended fingers are easily classified as such by the algorithm, and others are successfully ignored.
This article showed a relatively simple and yet surprisingly robust way of recognizing a variety of hand gestures by counting the number of extended fingers.
The algorithm first shows how a task-relevant region of the image can be segmented using depth information acquired from a Microsoft Kinect 3D Sensor, and how morphological operations can be used to clean up the segmentation result. By analyzing the shape of the segmented hand region, the algorithm comes up with a way to classify hand gestures based on the types of convexity effects found in the image. Once again, mastering our use of OpenCV to perform a desired task did not require us to produce a large amount of code. Instead, we were challenged to gain an important insight that made us use the built-in functionality of OpenCV in the most effective way possible.
Gesture recognition is a popular but challenging field in computer science, with applications in a large number of areas, such as human-computer interaction, video surveillance, and even the video game industry. You can now use your advanced understanding of segmentation and structure analysis to build your own state-of-the-art gesture recognition system.