3 min read

Google researchers have unveiled a new real-time hand tracking algorithm that could be a new breakthrough for people communicating via sign language. Their algorithm uses machine learning to compute 3D keypoints of a hand from a video frame. This research is implemented in MediaPipe which is an open-source cross-platform framework for building multimodal (eg. video, audio, any time series data) applied ML pipelines. What is interesting is that the 3D hand perception can be viewed in real-time on a mobile phone.

How real-time hand perception and gesture recognition works with MediaPipe?

The algorithm is built using the MediaPipe framework. Within this framework, the pipeline is built as a directed graph of modular components.

The pipeline employs three different models: a palm detector model, a handmark detector model and a gesture recognizer.

The palm detector operates on full images and outputs an oriented bounding box. They employ a single-shot detector model called BlazePalm, They achieve an average precision of 95.7% in palm detection.

Next, the hand landmark takes the cropped image defined by the palm detector and returns 3D hand keypoints. For detecting key points on the palm images, researchers manually annotated around 30K real-world images with 21 coordinates. They also generated a synthetic dataset to improve the robustness of the hand landmark detection model.

The gesture recognizer then classifies the previously computed keypoint configuration into a discrete set of gestures. The algorithm determines the state of each finger, e.g. bent or straight, by the accumulated angles of joints. The existing pipeline supports counting gestures from multiple cultures, e.g. American, European, and Chinese, and various hand signs including “Thumb up”, closed fist, “OK”, “Rock”, and “Spiderman”. They also trained their models to work in a wide variety of lighting situations and with a diverse range of skin tones.

Gesture recognition – Source: Google blog

With MediaPipe, the researchers built their pipeline as a directed graph of modular components, called Calculators. Individual calculators like cropping, rendering , and neural network computations can be performed exclusively on the GPU. They employed TFLite GPU inference on most modern phones. The researchers are open sourcing the hand tracking and gesture recognition pipeline in the MediaPipe framework along with the source code.

The researchers Valentin Bazarevsky and Fan Zhang write in a blog post, “Whereas current state-of-the-art approaches rely primarily on powerful desktop environments for inference, our method, achieves real-time performance on a mobile phone, and even scales to multiple hands. We hope that providing this hand perception functionality to the wider research and development community will result in an emergence of creative use cases, stimulating new applications and new research avenues.

People commended the fact that this algorithm can run on mobile devices and is useful for people who communicate via sign language.

Read Next

Microsoft Azure VP demonstrates Holoportation, a reconstructed transmittable 3D technology

Terrifyingly realistic Deepfake video of Bill Hader transforming into Tom Cruise is going viral on YouTube.

Google News Initiative partners with Google AI to help ‘deep fake’ audio detection research

Content Marketing Editor at Packt Hub. I blog about new and upcoming tech trends ranging from Data science, Web development, Programming, Cloud & Networking, IoT, Security and Game development.