Jump to content

Hand tracking: Difference between revisions

Line 112: Line 112:
A common approach is using one or more infrared or RGB cameras to visually capture the hands and then employing computer vision algorithms to recognize the hand's pose (the positions of the palm and each finger joint) in 3D space. Advanced [[machine learning]] models are often trained to detect keypoints of the hand (such as knuckle and fingertip positions) from the camera images, reconstructing an articulated hand model that updates as the user moves. A typical pipeline includes:
A common approach is using one or more infrared or RGB cameras to visually capture the hands and then employing computer vision algorithms to recognize the hand's pose (the positions of the palm and each finger joint) in 3D space. Advanced [[machine learning]] models are often trained to detect keypoints of the hand (such as knuckle and fingertip positions) from the camera images, reconstructing an articulated hand model that updates as the user moves. A typical pipeline includes:


1. '''Detection''': Find hands in the camera frame (often with a palm detector)
#'''Detection''': Find hands in the camera frame (often with a palm detector)
2. '''Landmark regression''': Predict 2D/3D keypoints for wrist and finger joints (commonly 21 landmarks per hand in widely used models)<ref name="MediaPipeHands" />
#'''Landmark regression''': Predict 2D/3D keypoints for wrist and finger joints (commonly 21 landmarks per hand in widely used models)<ref name="MediaPipeHands" />
3. '''Pose / mesh estimation''': Fit a kinematic skeleton or hand mesh consistent with human biomechanics for stable interaction and animation
#'''Pose / mesh estimation''': Fit a kinematic skeleton or hand mesh consistent with human biomechanics for stable interaction and animation
4. '''Temporal smoothing & prediction''': Filter jitter and manage short occlusions for responsive feedback
#'''Temporal smoothing & prediction''': Filter jitter and manage short occlusions for responsive feedback


This positional data is then provided to the VR/AR system (often through standard interfaces like [[OpenXR]]) so that applications can respond to the user's hand gestures and contacts with virtual objects. Google's MediaPipe Hands, for example, infers 21 3D landmarks per hand from a single RGB frame and runs in real time on mobile-class hardware, illustrating the efficiency of modern approaches.<ref name="MediaPipeHands" />
This positional data is then provided to the VR/AR system (often through standard interfaces like [[OpenXR]]) so that applications can respond to the user's hand gestures and contacts with virtual objects. Google's MediaPipe Hands, for example, infers 21 3D landmarks per hand from a single RGB frame and runs in real time on mobile-class hardware, illustrating the efficiency of modern approaches.<ref name="MediaPipeHands" />