Markerless outside-in tracking
- See also: Outside-in tracking, Markerless tracking and Positional tracking
Introduction
Markerless outside-in tracking is a form of positional tracking for virtual reality (VR) and augmented reality (AR) that estimates a user’s six-degree-of-freedom pose from externally mounted depth-sensing or RGB cameras without requiring any fiducial markers. Instead, per-frame depth or colour images are processed by computer vision algorithms that segment the scene, classify body parts and fit a kinematic skeleton, enabling real-time motion capture and interaction.[1]
Underlying technology
A typical pipeline combines specialised hardware with software-based human-pose estimation:
- Sensing layer – One or more fixed RGB-D or infra-red depth cameras stream point clouds. The original Microsoft Kinect projects a near-IR structured light pattern, whereas Kinect V2 and Azure Kinect use time-of-flight ranging.[2] The effective operating range for Kinect v1 is ≈ 0.8 – 4.5 m (specification upper limit 5 m).[3]
- Segmentation – Foreground extraction isolates user pixels from background geometry.
- Per-pixel body-part classification – A Randomised Decision Forest labels each pixel (head, hand, torso, …).[1]
- Skeletal reconstruction and filtering – Joint positions are inferred and temporally filtered to reduce jitter, producing head- and hand-pose data consumable by VR/AR engines.
Although a single depth camera can suffice, multi-camera rigs expand coverage and reduce occlusions. Open-source and proprietary middleware (e.g., OpenNI/NiTE 2, Microsoft Kinect SDK) expose joint-stream APIs for developers.[4] Measured end-to-end skeleton latency for Kinect ranges from 60 – 90 ms, depending on model and SDK settings.[5]
Markerless vs. marker-based tracking
Marker-based outside-in systems such as **Vicon** optical mocap or HTC Vive **Lighthouse** attach retro-reflective spheres or use on-device photodiodes that read sweeping IR lasers from the base stations, achieving sub-millimetre precision and motion-to-photon latency below 10 ms.[6][7] Markerless alternatives remove physical targets, improving comfort and setup time, but at the cost of:
- Lower positional accuracy and higher latency – Depth-sensor noise plus the 60 – 90 ms processing pipeline produce millimetre- to centimetre-level error.[8]
- Sensitivity to occlusion – Body parts outside the camera’s line-of-sight are temporarily lost.
History and notable systems
Year | System | Notes |
---|---|---|
2003 | EyeToy (PlayStation 2) | 2-D silhouette tracking with a single RGB camera.[9] |
2010 | Kinect for Xbox 360 | First consumer structured-light depth sensor with real-time full-body skeletons (up to six users).[10] |
2014 – 2016 | Research prototypes | Academic work showed Kinect V2 could deliver 6-DOF head- and hand-pose input for DIY VR HMDs.[5] |
2017 | Kinect production ends | Microsoft discontinued Kinect hardware as commercial VR shifted toward inside-out and marker-based solutions.[11] |
Applications
- **Gaming & entertainment** – Titles such as Kinect Sports map whole-body actions to avatars; some VR chat platforms still use Kinect skeletons.
- **Rehabilitation & exercise** – Clinicians monitor range-of-motion without attaching markers.[12]
- **Interactive installations** – Depth cameras create “magic-mirror” AR exhibits in museums.
- **Telepresence** – Multi-camera arrays stream volumetric avatars into shared virtual spaces.
Advantages
- No wearable markers.
- Rapid single-sensor setup; no lighthouse calibration.
- Simultaneous multi-user support.
- Lower hardware cost than professional optical mocap rigs.
Disadvantages
- Occlusion sensitivity – furniture or other players can block tracking.
- Reduced accuracy and 60 – 90 ms latency compared with lighthouse or Vicon systems.[8][7]
- Environmental constraints – bright sunlight or glossy surfaces degrade depth quality.
- Limited range and FOV – reliable only within ≈ 0.8 – 4.5 m for Kinect-class sensors.[3]
References
- ↑ 1.0 1.1 Shotton J. et al. (2011). “Real-Time Human Pose Recognition in Parts from a Single Depth Image.” *CVPR 2011.*
- ↑ Zhang Z. (2012). “Microsoft Kinect Sensor and Its Effect.” *IEEE MultiMedia* 19 (2): 4–10.
- ↑ 3.0 3.1 Khoshelham K.; Elberink S. (2012). “Accuracy and Resolution of Kinect Depth Data for Indoor Mapping Applications.” *Sensors* 12 (2): 1437 – 1454.
- ↑ OpenNI Foundation (2013). “NiTE 2.0 User Guide.”
- ↑ 5.0 5.1 Livingston M. A. et al. (2012). “Performance Measurements for the Microsoft Kinect Skeleton.” *IEEE VR 2012 Workshop.*
- ↑ Vicon Motion Systems. “Vicon Tracker – Latency down to 2.5 ms.” Product sheet.
- ↑ 7.0 7.1 Malventano A. (2016). “SteamVR HTC Vive In-depth – Lighthouse Tracking System Dissected.” *PC Perspective.*
- ↑ 8.0 8.1 Guffanti D. et al. (2020). “Accuracy of the Microsoft Kinect V2 Sensor for Human Gait Analysis.” *Sensors* 20 (16): 4405.
- ↑ Pham A. (2004-01-18). “EyeToy Springs From One Man’s Vision.” *Los Angeles Times.*
- ↑ Microsoft News Center (2010-11-04). “The Future of Entertainment Starts Today as Kinect for Xbox 360 Leaps and Lands at Retailers Nationwide.”
- ↑ Good O. S. (2017-10-25). “Kinect is officially dead. Really. Officially. It’s dead.” *Polygon.*
- ↑ Wade L. et al. (2022). “Applications and Limitations of Current Markerless Motion Capture Methods for Clinical Gait Biomechanics.” *PeerJ* 10:e12995.