Jump to content

Markerless outside-in tracking: Difference between revisions

No edit summary
Tag: Reverted
m Text replacement - "e.g.," to "for example"
Tags: Mobile edit Mobile web edit
 
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
{{see also|Outside-in tracking|Markerless tracking|Positional tracking}}
{{see also|Terms|Technical Terms}}
:''See also [[Outside-in tracking]], [[Markerless tracking]], [[Positional tracking]]''


==Introduction==
==Introduction==
'''[[Markerless outside-in tracking]]''' is a form of [[positional tracking]] for [[virtual reality]] (VR) and [[augmented reality]] (AR) that estimates a user’s six-degree-of-freedom pose from externally mounted [[depth sensing|depth-sensing]] or RGB cameras without requiring any [[fiducial marker]]s. Instead, per-frame depth or colour images are processed by [[computer vision]] algorithms that segment the scene, classify body parts and fit a kinematic skeleton, enabling real-time [[motion capture]] and interaction.<ref name="Shotton2011" />
'''[[Markerless outside-in tracking]]''' is a subtype of [[positional tracking]] used in [[virtual reality]] (VR) and [[augmented reality]] (AR). In this approach, external [[camera]]s or other [[depth sensing]] devices positioned in the environment estimate the six-degree-of-freedom ([[6DOF]]) [[pose]] of a user or object without relying on any [[fiducial marker]]s. Instead, [[computer vision]] algorithms analyse the incoming colour or depth stream to detect and follow natural scene features or the user’s own body, enabling real-time [[motion capture]] and interaction.<ref name="Shotton2011" />


==Underlying technology==
==Underlying technology==
A typical pipeline combines specialised hardware with software-based human-pose estimation:
A typical markerless outside-in pipeline combines specialised hardware with software-based human-pose estimation:
* '''Sensing layer''' – One or more fixed [[RGB-D]] or [[infra-red]] depth cameras stream point clouds. The original Microsoft Kinect projects a near-IR [[structured light]] pattern, whereas Kinect V2 and Azure Kinect use [[time-of-flight camera|time-of-flight]] ranging.<ref name="Zhang2012" /> The effective operating range for Kinect v1 is ≈ 0.8 – 4.5 m (specification upper limit 5 m).<ref name="DepthRange2012" />
* '''Segmentation''' – Foreground extraction isolates user pixels from background geometry.
* '''Per-pixel body-part classification''' – A Randomised Decision Forest labels each pixel (head, hand, torso, …).<ref name="Shotton2011" />
* '''Skeletal reconstruction and filtering''' – Joint positions are inferred and temporally filtered to reduce jitter, producing head- and hand-pose data consumable by VR/AR engines.


Although a single depth camera can suffice, multi-camera rigs expand coverage and reduce occlusions. Open-source and proprietary middleware (e.g., [[OpenNI]]/NiTE 2, Microsoft Kinect SDK) expose joint-stream APIs for developers.<ref name="NiTE2013" /> 
* '''Sensing layer''' – One or more fixed [[RGB-D]] or [[infrared]] depth cameras acquire per-frame point clouds. Commodity devices such as the Microsoft Kinect project a [[structured light]] pattern or use [[time-of-flight]] methods to compute depth maps.<ref name="Zhang2012" />
Measured end-to-end skeleton latency for Kinect ranges from 60 – 90 ms, depending on model and SDK settings.<ref name="Livingston2012" />
* '''Segmentation''' – Foreground extraction or person segmentation isolates user pixels from the static background.
* '''Per-pixel body-part classification''' – A machine-learning model labels each pixel as “head”, “hand”, “torso”, and so on (for example the Randomised Decision Forest used in the original Kinect).<ref name="Shotton2011" />
* '''Skeletal reconstruction and filtering''' – The system fits a kinematic skeleton to the classified pixels and applies temporal filtering to reduce jitter, producing smooth head- and hand-pose data that can drive VR/AR applications.
 
Although a single camera can suffice, multi-camera rigs extend coverage and mitigate occlusion problems. Open source and proprietary middleware (for example [[OpenNI]]/NITE, the [[Microsoft Kinect]] SDK) expose joint-stream APIs for developers.<ref name="OpenNI2013" />


==Markerless vs. marker-based tracking==
==Markerless vs. marker-based tracking==
Marker-based outside-in systems such as **Vicon** optical mocap or HTC Vive **Lighthouse** attach retro-reflective spheres or use on-device photodiodes that read sweeping IR lasers from the base stations, achieving sub-millimetre precision and motion-to-photon latency below 10 ms.<ref name="ViconSpec" /><ref name="Lighthouse2016" /> 
[[Outside-in tracking|Marker-based outside-in systems]] ([[HTC Vive]] [[Lighthouse]], [[PlayStation VR]) attach active LEDs or retro-reflective spheres to the headset or controllers; external sensors triangulate these explicit targets, achieving sub-millimetre precision and sub-10 ms latency. Markerless alternatives dispense with physical targets, improving user comfort and reducing setup time, but at the cost of:
Markerless alternatives remove physical targets, improving comfort and setup time, but at the cost of:
 
* '''Lower positional accuracy and higher latency''' – Depth-sensor noise plus the 60 – 90 ms processing pipeline produce millimetre- to centimetre-level error.<ref name="Guffanti2020" />
* '''Lower positional accuracy and higher latency''' – Depth-sensor noise and computational overhead introduce millimetre- to centimetre-level error and ~20–30 ms end-to-end latency.
* '''Sensitivity to occlusion''' – Body parts outside the camera’s line-of-sight are temporarily lost.
* '''Sensitivity to occlusion''' – If a body part leaves the camera’s line of sight, the model loses track until the part re-enters view.


==History and notable systems==
==History and notable systems==
Line 24: Line 25:
! Year !! System !! Notes
! Year !! System !! Notes
|-
|-
| 2003 || [[EyeToy]] (PlayStation 2) || 2-D silhouette tracking with a single RGB camera.<ref name="EyeToy2004" />
| 2003 || [[EyeToy]] (PlayStation 2) || 2-D silhouette tracking with a single RGB camera for casual gesture-based games.
|-
|-
| 2010 || [[Kinect]] for Xbox 360 || First consumer structured-light depth sensor with real-time full-body skeletons (up to six users).<ref name="Microsoft2010" />
| 2010 || [[Kinect]] for Xbox 360 || Consumer launch of a structured-light depth sensor delivering real-time full-body skeletons (up to six users).<ref name="Microsoft2010" />
|-
|-
| 2014 – 2016 || Research prototypes || Academic work showed Kinect V2 could deliver 6-DOF head- and hand-pose input for DIY VR HMDs.<ref name="Livingston2012" />
| 2014 – 2016 || Research prototypes || Studies showed Kinect V2 could supply 6-DOF head, hand, and body input to DIY VR HMDs.
|-
|-
| 2017 || Kinect production ends || Microsoft discontinued Kinect hardware as commercial VR shifted toward inside-out and marker-based solutions.<ref name="Microsoft2017" />
| 2017 || Kinect production ends || Microsoft discontinued Kinect hardware as commercial VR shifted toward marker-based and inside-out solutions.<ref name="Microsoft2017" />
|}
|}


==Applications==
==Applications==
* **Gaming & entertainment** – Titles such as ''Kinect Sports'' map whole-body actions to avatars; some VR chat platforms still use Kinect skeletons.
* '''Gaming and Entertainment''' – Titles like ''Kinect Sports'' mapped whole-body actions directly onto avatars. Enthusiast VR chat platforms still use Kinect skeletons to animate full-body avatars.
* **Rehabilitation & exercise** – Clinicians monitor range-of-motion without attaching markers.<ref name="Wade2022" /> 
* '''Rehabilitation and Exercise''' – Clinicians employ depth-based pose tracking to monitor range-of-motion exercises without encumbering patients with sensors.
* **Interactive installations** Depth cameras create “magic-mirror” AR exhibits in museums.
* '''Interactive installations''' Museums deploy wall-mounted depth cameras to create “magic-mirror” AR exhibits that overlay virtual costumes onto visitors in real time.
* **Telepresence** – Multi-camera arrays stream volumetric avatars into shared virtual spaces.
* '''Telepresence''' – Multi-Kinect arrays stream volumetric representations of remote participants into shared virtual spaces.


==Advantages==
==Advantages==
* No wearable markers.
* '''No wearable markers''' – Users remain unencumbered, enhancing comfort and lowering entry barriers.
* Rapid single-sensor setup; no lighthouse calibration.
* '''Rapid setup''' – A single sensor covers an entire play area; no lighthouse calibration or reflector placement is necessary.
* Simultaneous multi-user support.
* '''Multi-user support''' – Commodity depth cameras distinguish and skeletonise several people simultaneously.
* Lower hardware cost than professional optical mocap rigs.
* '''Lower hardware cost''' – RGB or RGB-D sensors are inexpensive compared with professional optical-mocap rigs.


==Disadvantages==
==Disadvantages==
* Occlusion sensitivity – furniture or other players can block tracking.
* '''Occlusion sensitivity''' Furniture or other players can block the line of sight, causing intermittent loss of tracking.
* Reduced accuracy and 60 90 ms latency compared with lighthouse or Vicon systems.<ref name="Guffanti2020" /><ref name="Lighthouse2016" /> 
* '''Reduced accuracy and jitter''' Compared with marker-based solutions, joint estimates exhibit higher positional noise, especially during fast or complex motion.
* Environmental constraints – bright sunlight or glossy surfaces degrade depth quality.
* '''Environmental constraints''' Bright sunlight, glossy surfaces, and feature-poor backgrounds degrade depth or feature extraction quality.
* Limited range and FOV – reliable only within 0.8 – 4.5 m for Kinect-class sensors.<ref name="DepthRange2012" />
* '''Limited range and FOV''' Most consumer depth cameras operate effectively only within 0.8–5 m; beyond that, depth resolution and skeleton stability decrease.


==References==
==References==
<references>
<ref name="Shotton2011">Shotton, J.; Fitzgibbon, A.; Cook, M.; Sharp, T.; Finocchio, M.; Moore, R.; Kipman, A.; Blake, A. “Real‑Time Human Pose Recognition in Parts from a Single Depth Image.” *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 2011, pp. 1297–1304. DOI: 10.1109/CVPR.2011.5995316. Available at: https://ieeexplore.ieee.org/document/5995316 (accessed 3 May 2025).</ref>
<ref name="Shotton2011">Shotton J. ''et al.'' (2011). “Real-Time Human Pose Recognition in Parts from a Single Depth Image.” *CVPR 2011.*</ref>
<ref name="Zhang2012">Zhang, Z. “Microsoft Kinect Sensor and Its Effect.” *IEEE MultiMedia*, vol. 19, no. 2, 2012, pp. 4–10. DOI: 10.1109/MMUL.2012.24. Available at: https://dl.acm.org/doi/10.1109/MMUL.2012.24 (accessed 3 May 2025).</ref>
<ref name="Zhang2012">Zhang Z. (2012). “Microsoft Kinect Sensor and Its Effect.” *IEEE MultiMedia* 19 (2): 4–10.</ref>
<ref name="OpenNI2013">OpenNI Foundation. *OpenNI 1.5.2 User Guide*, 2010. PDF. Available at: https://www.cs.rochester.edu/courses/577/fall2011/kinect/openni-user-guide.pdf (accessed 3 May 2025).</ref>
<ref name="DepthRange2012">Khoshelham K.; Elberink S. (2012). “Accuracy and Resolution of Kinect Depth Data for Indoor Mapping Applications.” *Sensors* 12 (2): 1437 – 1454.</ref>
<ref name="Pfister2022">Pfister, A.; West, N.; et al. “Applications and Limitations of Current Markerless Motion Capture Methods for Clinical Gait Biomechanics.” *Journal of Biomechanics*, vol. 129, 2022, Article 110844. DOI: 10.1016/j.jbiomech.2021.110844. Available at: https://pubmed.ncbi.nlm.nih.gov/35237469/ (accessed 3 May 2025).</ref>
<ref name="NiTE2013">OpenNI Foundation (2013). “NiTE 2.0 User Guide.”</ref>
<ref name="Pham2004">Pham, A. “EyeToy Springs From One Man’s Vision.” *Los Angeles Times*, 18 Jan 2004. Available at: https://www.latimes.com/archives/la-xpm-2004-jan-18-fi-eyetoy18-story.html (accessed 3 May 2025).</ref>
<ref name="Livingston2012">Livingston M. A. ''et al.'' (2012). “Performance Measurements for the Microsoft Kinect Skeleton.” *IEEE VR 2012 Workshop.*</ref>
<ref name="Microsoft2010">Microsoft News Center. “The Future of Entertainment Starts Today as Kinect for Xbox 360 Leaps and Lands at Retailers Nationwide.” Press release, 4 Nov 2010. Available at: https://news.microsoft.com/2010/11/04/the-future-of-entertainment-starts-today-as-kinect-for-xbox-360-leaps-and-lands-at-retailers-nationwide/ (accessed 3 May 2025).</ref>
<ref name="Guffanti2020">Guffanti D. ''et al.'' (2020). “Accuracy of the Microsoft Kinect V2 Sensor for Human Gait Analysis.” *Sensors* 20 (16): 4405.</ref>
<ref name="Lange2011">Lange, B.; Rizzo, A.; Chang, C.-Y.; Suma, E. A.; Bolas, M. “Markerless Full Body Tracking: Depth‑Sensing Technology within Virtual Environments.” *Interservice/Industry Training, Simulation and Education Conference (I/ITSEC)*, 2011. PDF. Available at: http://ict.usc.edu/pubs/Markerless%20Full%20Body%20Tracking-%20Depth-Sensing%20Technology%20within%20Virtual%20Environments.pdf (accessed 3 May 2025).</ref>
<ref name="ViconSpec">Vicon Motion Systems. “Vicon Tracker – Latency down to 2.5 ms.” Product sheet.</ref>
<ref name="Microsoft2017">Good, O. S. “Kinect Is Officially Dead. Really. Officially. It’s Dead.” *Polygon*, 25 Oct 2017. Available at: https://www.polygon.com/2017/10/25/16543192/kinect-discontinued-microsoft-announcement (accessed 3 May 2025).</ref>
<ref name="Lighthouse2016">Malventano A. (2016). “SteamVR HTC Vive In-depth – Lighthouse Tracking System Dissected.” *PC Perspective.*</ref>
 
<ref name="EyeToy2004">Pham A. (2004-01-18). “EyeToy Springs From One Man’s Vision.” *Los Angeles Times.*</ref>
 
<ref name="Microsoft2010">Microsoft News Center (2010-11-04). “The Future of Entertainment Starts Today as Kinect for Xbox 360 Leaps and Lands at Retailers Nationwide.”</ref>
<ref name="Wade2022">Wade L. ''et al.'' (2022). “Applications and Limitations of Current Markerless Motion Capture Methods for Clinical Gait Biomechanics.” *PeerJ* 10:e12995.</ref>
<ref name="Microsoft2017">Good O. S. (2017-10-25). “Kinect is officially dead. Really. Officially. It’s dead.” *Polygon.*</ref>
</references>


[[Category:Terms]]
[[Category:Terms]]