Google researchers have unveiled a new real-time hand tracking algorithm that could be a new breakthrough for people communicating via sign language. Their algorithm uses machine learning to compute 3D keypoints of a hand from a video frame. This research is implemented in MediaPipe which is an open-source cross-platform framework for building multimodal (eg. video, audio, any time series data) applied ML pipelines. What is interesting is that the 3D hand perception can be viewed in real-time on a mobile phone.
How real-time hand perception and gesture recognition works with MediaPipe?
The algorithm is built using the MediaPipe framework. Within this framework, the pipeline is built as a directed graph of modular components.
The pipeline employs three different models: a palm detector model, a handmark detector model and a gesture recognizer.
The palm detector operates on full images and outputs an oriented bounding box. They employ a single-shot detector model called BlazePalm, They achieve an average precision of 95.7% in palm detection.
Next, the hand landmark takes the cropped image defined by the palm detector and returns 3D hand keypoints. For detecting key points on the palm images, researchers manually annotated around 30K real-world images with 21 coordinates. They also generated a synthetic dataset to improve the robustness of the hand landmark detection model.
The gesture recognizer then classifies the previously computed keypoint configuration into a discrete set of gestures. The algorithm determines the state of each finger, e.g. bent or straight, by the accumulated angles of joints. The existing pipeline supports counting gestures from multiple cultures, e.g. American, European, and Chinese, and various hand signs including “Thumb up”, closed fist, “OK”, “Rock”, and “Spiderman”. They also trained their models to work in a wide variety of lighting situations and with a diverse range of skin tones.
Gesture recognition – Source: Google blog
With MediaPipe, the researchers built their pipeline as a directed graph of modular components, called Calculators. Individual calculators like cropping, rendering , and neural network computations can be performed exclusively on the GPU. They employed TFLite GPU inference on most modern phones. The researchers are open sourcing the hand tracking and gesture recognition pipeline in the MediaPipe framework along with the source code.
The researchers Valentin Bazarevsky and Fan Zhang write in a blog post, “Whereas current state-of-the-art approaches rely primarily on powerful desktop environments for inference, our method, achieves real-time performance on a mobile phone, and even scales to multiple hands. We hope that providing this hand perception functionality to the wider research and development community will result in an emergence of creative use cases, stimulating new applications and new research avenues.”
People commended the fact that this algorithm can run on mobile devices and is useful for people who communicate via sign language.
Ok this is neat! I initially thought YOLOv3 with landmark detection, what's new? Then I read the blog. Doing this On Mobile Device is very cool. Also cool is the hierarchical and multimodal mixing of standard object detection with VAE-like. This is important work. I like!
— Dr. Stephen Odaibo (@SOdaibo) August 19, 2019
Google just released a significant amount of research for hand gesture recognition using #AI/ML for Real-Time on-device hand tracking. They've also open-sourced their work as well. The applications for #VR and #AR could be huge. Rock paper scissors guys?https://t.co/32Ipa4keb3 pic.twitter.com/LLwSN7recz
— Anshel Sag (@anshelsag) August 19, 2019
@GoogleAI absolutely love the use case. One that’s dear to my heart. Would love to see how this works in practice when signs are much less “clear” and very quickly made.
— Jonathan Corey (@JonCorey1) August 21, 2019