Jump to content

Gesture recognition

From VR & AR Wiki

Gesture recognition is the computational interpretation of human gestures, most often movements of the hands and fingers but also of the head, face, or whole body, so that a machine can treat them as input. In virtual reality (VR) and augmented reality (AR) it is the technology behind controller-free interaction: a headset or sensor observes the user's hands, estimates their pose, and maps recognised gestures such as a pinch, a tap, or a swipe to commands like selecting an object, scrolling a menu, or grabbing a virtual item.

Gesture recognition predates modern headsets. Early systems read finger bend from instrumented gloves, while later work moved to cameras and depth sensors that recover hand pose without any worn hardware. Contemporary XR devices, including the Meta Quest family, the Microsoft HoloLens 2, and the Apple Vision Pro, ship markerless camera-based hand tracking that recognises gestures directly from images, in several cases as the primary input method rather than an accessory to handheld controllers.

Definition and scope

A gesture is a meaningful movement of part of the body that conveys information or a command. Gesture recognition is the process by which a system detects that movement and classifies it. The literature commonly divides hand gestures into two classes: static gestures (also called postures), which are a single configuration of the hand such as an open palm or a thumbs-up and can be classified from one image, and dynamic gestures, which are a sequence of poses over time, such as a wave or a swipe, and require multiple frames to recognise.[1]

Gesture recognition is closely related to, but distinct from, hand tracking. Hand tracking is the continuous estimation of the position and orientation of the hand and its joints; gesture recognition is the layer that interprets those tracked poses, or their change over time, as discrete commands or continuous controls. A typical XR pipeline performs hand tracking first, producing a skeletal model of the hand, and then runs gesture recognition on top of that skeleton.[1][2]

History

The idea of reading hand gestures for computer input dates to the mid-1970s. The Sayre Glove, developed in 1977 by Daniel Sandin and Thomas DeFanti at the Electronic Visualization Laboratory of the University of Illinois at Chicago and based on an idea by colleague Richard Sayre, used flexible tubes with a light source at one end and a photocell at the other so that bending a finger reduced the light reaching the cell, giving a rough measure of finger flexion.[3]

The first widely cited gesture interface came from VPL Research, the company founded in 1984 by Jaron Lanier. Its DataGlove measured finger bend with optical fibres routed along the fingers and tracked hand position and orientation with ultrasonic or magnetic sensors. The system was described in the 1987 paper "A hand gesture interface device" by Thomas G. Zimmerman, Jaron Lanier, Chuck Blanchard, Steve Bryson, and Young Harvill, presented at the CHI+GI conference. The paper reports real-time gesture, position, and orientation sensing, with applications including driving a 3D model of the hand to manipulate computer-generated objects and interpreting finger-spelling.[4][5] VPL licensed related glove technology to Mattel, which produced the Power Glove for the Nintendo Entertainment System; it sold strongly in 1989 as a novelty but worked poorly as a game controller.[5]

Glove-based gesture input remained the dominant approach into the 1990s with products such as the CyberGlove, which used resistive flex sensors to measure joint angles. The next shift was toward vision, where a camera rather than a worn device observes the hand. A widely deployed example outside headsets was Microsoft's Kinect, launched for the Xbox 360 in late 2010, which paired an RGB camera with a depth sensor and used machine learning to label body parts and reconstruct a skeleton of about 20 joints, allowing full-body gestures to control games without any handheld device.[6]

How it works

Gesture recognition systems differ mainly in how they sense the hand or body. Three families of techniques are common, and many products combine them.

Sensor-based and glove-based methods

The earliest approach instruments the body directly. A data glove carries flex sensors (originally optical fibres, later resistive or capacitive elements) that measure how far each finger joint bends, plus a separate sensor for hand position and orientation. Wearable systems may instead use inertial measurement units, containing accelerometers and gyroscopes, mounted on the hand or fingers. These methods do not depend on line of sight or lighting and can be precise, but they require the user to wear and calibrate hardware.[4][7]

Camera and computer-vision methods

Vision-based recognition uses one or more cameras to observe the hand and recover its pose with computer vision. A marker-based variant attaches reflective or coloured markers to a glove or the fingers to simplify detection, while the markerless variant analyses the bare hand directly from camera images. Depth cameras and infrared illumination can supplement ordinary colour cameras to make the hand easier to separate from the background and to recover its distance, a process related to depth sensing. The output is typically a skeletal hand model: a set of keypoints for the fingertips and joints, each with a position and sometimes an orientation. Markerless vision is the dominant approach in current XR headsets because it needs no worn hardware, though it depends on the hand staying within the camera's field of view and on adequate lighting.[7][8]

Machine-learning classification

Once the hand or body has been sensed, a recognition stage decides which gesture, if any, is being performed. Static postures can be classified from a single frame of joint positions, for example with a neural network trained on labelled hand configurations. Dynamic gestures, which unfold over several frames, are handled with sequence models: classical work used hidden Markov models, particle filters, and dynamic time warping, while more recent systems use recurrent neural networks such as long short-term memory (LSTM), often trained on large datasets of hand images or skeletons.[1][9] Modern markerless hand tracking on headsets relies on deep learning models that infer the full skeleton even when fingers are partly hidden, and developers can then define gestures either by reading the skeleton (for example, checking whether the thumb and index fingertip are touching) or by training a dedicated classifier.[1][2]

Use in virtual and augmented reality

Gesture recognition lets a headset accept input from the user's bare hands, either alongside handheld input devices or instead of them. The major platforms differ in how aggressively they treat hands as a primary control method.

Meta Quest

The Meta Quest standalone headsets perform markerless hand tracking with their onboard cameras and recognise a set of system gestures. Pinching the thumb and index finger acts as a click for menus and the system interface, and a palm-up pinch opens the system menu, so the headset can be navigated without the Touch controllers. In 2025 Meta added microgestures, an OpenXR extension that recognises small thumb movements on the side of the index finger: a tap on the middle segment of the index finger, and left, right, forward, and backward swipes performed in one smooth motion, turning the index finger into a directional pad. Meta describes these as intuitive, low-effort inputs for repetitive actions such as scrolling a browser or teleporting in an app, and lists support on Quest 2, Quest Pro, and the Quest 3 family.[10][11]

Microsoft HoloLens 2

The HoloLens 2 AR headset, released in 2019, introduced what Microsoft calls a fully articulated hand-tracking system that recognises the user's hands as left and right skeletal models and supports direct manipulation: users press buttons, grab objects, and operate 2D panels by touching holograms with their fingers, with no symbolic gestures to memorise. Microsoft contrasts this with the first HoloLens, which used a small fixed gesture set including the air tap (a downward tap of the raised index finger, equivalent to a mouse click) and the bloom gesture. For targets out of reach, HoloLens 2 also casts a hand ray from the palm for point-and-commit interaction. Secondary documentation describes the articulated system as tracking up to 25 joints per hand.[12][13] The cross-vendor OpenXR standard later settled on a convention of 26 hand joints, including the wrist and palm, for hand-tracking extensions.[14]

Apple Vision Pro

The Apple Vision Pro, released in the United States in February 2024, ships with no physical controllers and is operated by a combination of eye gaze and hand gesture that Apple calls eyes, hands, and voice. The user looks at an interface element to target it, then pinches the thumb and index finger together to select, equivalent to a tap; keeping the fingers pinched and flicking the wrist scrolls, and a two-handed pinch-and-move zooms or rotates content. Apple's description is that users "browse through apps by simply looking at them, tapping their fingers to select, flicking their wrist to scroll, or using voice to dictate."[15][16] Because the headset's downward- and outward-facing cameras cover a wide area, the gestures can be made with the hands resting in the lap rather than held up, reducing arm fatigue. The Vision Pro's R1 chip processes input from 12 cameras and additional sensors, including dedicated infrared cameras and illuminators for eye tracking and hand tracking.[16][17]

Leap Motion and Ultraleap

A dedicated peripheral approach to XR gesture input came from Leap Motion, founded in 2010, whose Leap Motion Controller used two infrared cameras to track both hands and their fingers, discerning many distinct bones and joints. In February 2016 the company released Orion, a runtime rebuild aimed at hand tracking in VR; mounting the small sensor on the front of a headset let developers add bare-hand gesture input to PC VR. In 2019 Leap Motion merged with Ultrahaptics to form Ultraleap, which has continued the hand-tracking line, including the Ultraleap Leap Motion Controller 2, and licenses its tracking software to third-party headset makers.[18][19]

Limitations

Camera-based gesture recognition, the dominant approach in XR, has several practical constraints. The hands must stay within the field of view of the headset's cameras; once a hand moves out of view its gestures cannot be read. Recognition degrades in poor lighting and when fingers occlude one another, although infrared illumination and deep-learning pose estimation that infers hidden joints mitigate this.[1][7] Bare-hand input also lacks the tactile feedback of a physical controller, so platforms compensate with visual and audio cues; Microsoft's HoloLens 2 guidance, for example, adds proximity shaders and on-press visual feedback because there is no physical click.[12] Holding the hands up for extended interaction can cause fatigue, which is part of the reason the Apple Vision Pro decouples targeting (done with the eyes) from the pinch gesture, allowing the hands to rest.[16]

See also

References

  1. 1.0 1.1 1.2 1.3 1.4 "How do you implement hand tracking and gesture recognition in VR?". https://zilliz.com/ai-faq/how-do-you-implement-hand-tracking-and-gesture-recognition-in-vr.
  2. 2.0 2.1 "What is Hand Tracking?". https://www.autovrse.com/glossary/hand-tracking.
  3. "Sayre Glove (first wired data glove)". https://www.evl.uic.edu/research/2162.
  4. 4.0 4.1 Zimmerman, Thomas G.; Lanier, Jaron; Blanchard, Chuck; Bryson, Steve; Harvill, Young (1987). "A hand gesture interface device". Proceedings of the SIGCHI/GI Conference on Human Factors in Computing Systems and Graphics Interface (CHI+GI '87). Template:Hide in printTemplate:Only in print. https://dl.acm.org/doi/10.1145/29933.275628.
  5. 5.0 5.1 "VPL Research". https://en.wikipedia.org/wiki/VPL_Research.
  6. "Body Part Recognition and the Development of Kinect". https://www.microsoft.com/en-us/research/video/body-part-recognition-and-the-development-of-kinect/.
  7. 7.0 7.1 7.2 "Mastering Gesture-Based Interaction in VR/AR". https://www.numberanalytics.com/blog/gesture-based-interaction-in-vr-ar-development.
  8. "Computer Vision in AR and VR: The Complete Guide". https://viso.ai/computer-vision/augmented-reality-virtual-reality/.
  9. Wang, Xiaoyan(2012). "Hidden-Markov-Models-Based Dynamic Hand Gesture Recognition".{Template:Journal. 2012. https://www.hindawi.com/journals/mpe/2012/986134/. Retrieved 2026-06-16.
  10. "Hand tracking microgestures OpenXR extension". https://developers.meta.com/horizon/documentation/unity/unity-microgestures/.
  11. "Meta Quest now supports microgestures: what does that mean?". 2025-03-17. https://mixed-news.com/en/meta-quest-micro-gestures/.
  12. 12.0 12.1 "Direct manipulation with hands". 2019-04-02. https://learn.microsoft.com/en-us/windows/mixed-reality/design/direct-manipulation.
  13. "Microsoft's HoloLens 2 Team Answers More Questions About Biometric Security, Audio, and Hand Tracking". https://hololens.reality.news/news/microsofts-hololens-2-team-answers-more-questions-about-biometric-security-audio-hand-tracking-0194712/.
  14. "XrHandJointLocationsEXT". https://registry.khronos.org/OpenXR/specs/1.0/man/html/XrHandJointLocationsEXT.html.
  15. "Introducing Apple Vision Pro: Apple's first spatial computer". 2023-06-05. https://www.apple.com/newsroom/2023/06/introducing-apple-vision-pro/.
  16. 16.0 16.1 16.2 "How You Control Apple Vision Pro With Your Eyes and Hands". https://www.uploadvr.com/apple-vision-pro-gesture-controls/.
  17. "Apple Vision Pro Cameras". https://www.techinsights.com/blog/apple-vision-pro-cameras.
  18. "Leap Motion". https://en.wikipedia.org/wiki/Leap_Motion.
  19. "UltraLeap Gemini review: use both hands in VR!". 2021-01-27. https://skarredghost.com/2021/01/27/ultraleap-leap-motion-gemini-review/.