Jump to content

Markerless outside-in tracking: Difference between revisions

No edit summary
No edit summary
Tag: Reverted
Line 1: Line 1:
{{see also|Terms|Technical Terms}}
{{see also|Outside-in tracking|Markerless tracking|Positional tracking}}
:''See also [[Outside-in tracking]], [[Markerless tracking]], [[Positional tracking]]''


==Introduction==
==Introduction==
'''[[Markerless outside-in tracking]]''' is a subtype of [[positional tracking]] used in [[virtual reality]] (VR) and [[augmented reality]] (AR). In this approach, external [[camera]]s or other [[depth sensing]] devices positioned in the environment estimate the six-degree-of-freedom pose of a user or object without relying on any [[fiducial marker]]s. Instead, [[computer vision]] algorithms analyse the incoming colour or depth stream to detect and follow natural scene features or the user’s own body, enabling real-time [[motion capture]] and interaction.<ref name="Shotton2011" />
'''[[Markerless outside-in tracking]]''' is a form of [[positional tracking]] for [[virtual reality]] (VR) and [[augmented reality]] (AR) that estimates a user’s six-degree-of-freedom pose from externally mounted [[depth sensing|depth-sensing]] or RGB cameras without requiring any [[fiducial marker]]s. Instead, per-frame depth or colour images are processed by [[computer vision]] algorithms that segment the scene, classify body parts and fit a kinematic skeleton, enabling real-time [[motion capture]] and interaction.<ref name="Shotton2011" />


==Underlying technology==
==Underlying technology==
A typical markerless outside-in pipeline combines specialised hardware with software-based human-pose estimation:
A typical pipeline combines specialised hardware with software-based human-pose estimation:
* '''Sensing layer''' – One or more fixed [[RGB-D]] or [[infra-red]] depth cameras stream point clouds. The original Microsoft Kinect projects a near-IR [[structured light]] pattern, whereas Kinect V2 and Azure Kinect use [[time-of-flight camera|time-of-flight]] ranging.<ref name="Zhang2012" /> The effective operating range for Kinect v1 is ≈ 0.8 – 4.5 m (specification upper limit 5 m).<ref name="DepthRange2012" />
* '''Segmentation''' – Foreground extraction isolates user pixels from background geometry.
* '''Per-pixel body-part classification''' – A Randomised Decision Forest labels each pixel (head, hand, torso, …).<ref name="Shotton2011" />
* '''Skeletal reconstruction and filtering''' – Joint positions are inferred and temporally filtered to reduce jitter, producing head- and hand-pose data consumable by VR/AR engines.


* **Sensing layer** – One or more fixed [[RGB-D]] or [[infrared]] depth cameras acquire per-frame point clouds. Commodity devices such as the Microsoft Kinect project a [[structured light]] pattern or use [[time-of-flight]] methods to compute depth maps.<ref name="Zhang2012" />
Although a single depth camera can suffice, multi-camera rigs expand coverage and reduce occlusions. Open-source and proprietary middleware (e.g., [[OpenNI]]/NiTE 2, Microsoft Kinect SDK) expose joint-stream APIs for developers.<ref name="NiTE2013" /> 
* **Segmentation** – Foreground extraction or person segmentation isolates user pixels from the static background.
Measured end-to-end skeleton latency for Kinect ranges from 60 – 90 ms, depending on model and SDK settings.<ref name="Livingston2012" />
* **Per-pixel body-part classification** – A machine-learning model labels each pixel as “head”, “hand”, “torso”, and so on (e.g., the Randomised Decision Forest used in the original Kinect).<ref name="Shotton2011" />
* **Skeletal reconstruction and filtering** – The system fits a kinematic skeleton to the classified pixels and applies temporal filtering to reduce jitter, producing smooth head- and hand-pose data that can drive VR/AR applications.
 
Although a single camera can suffice, multi-camera rigs extend coverage and mitigate occlusion problems. Open source and proprietary middleware (e.g., [[OpenNI]]/NITE, the Microsoft Kinect SDK) expose joint-stream APIs for developers.<ref name="OpenNI2013" />


==Markerless vs. marker-based tracking==
==Markerless vs. marker-based tracking==
Marker-based outside-in systems (HTC Vive Lighthouse, PlayStation VR) attach active LEDs or retro-reflective spheres to the headset or controllers; external sensors triangulate these explicit targets, achieving sub-millimetre precision and sub-10 ms latency. Markerless alternatives dispense with physical targets, improving user comfort and reducing setup time, but at the cost of:
Marker-based outside-in systems such as **Vicon** optical mocap or HTC Vive **Lighthouse** attach retro-reflective spheres or use on-device photodiodes that read sweeping IR lasers from the base stations, achieving sub-millimetre precision and motion-to-photon latency below 10 ms.<ref name="ViconSpec" /><ref name="Lighthouse2016" /> 
 
Markerless alternatives remove physical targets, improving comfort and setup time, but at the cost of:
* **Lower positional accuracy and higher latency** – Depth-sensor noise and computational overhead introduce millimetre- to centimetre-level error and ~20–30 ms end-to-end latency.
* '''Lower positional accuracy and higher latency''' – Depth-sensor noise plus the 60 – 90 ms processing pipeline produce millimetre- to centimetre-level error.<ref name="Guffanti2020" />
* **Sensitivity to occlusion** If a body part leaves the camera’s line of sight, the model loses track until the part re-enters view.
* '''Sensitivity to occlusion''' Body parts outside the camera’s line-of-sight are temporarily lost.


==History and notable systems==
==History and notable systems==
Line 25: Line 24:
! Year !! System !! Notes
! Year !! System !! Notes
|-
|-
| 2003 || [[EyeToy]] (PlayStation 2) || 2-D silhouette tracking with a single RGB camera for casual gesture-based games.
| 2003 || [[EyeToy]] (PlayStation 2) || 2-D silhouette tracking with a single RGB camera.<ref name="EyeToy2004" />
|-
|-
| 2010 || [[Kinect]] for Xbox 360 || Consumer launch of a structured-light depth sensor delivering real-time full-body skeletons (up to six users).<ref name="Microsoft2010" />
| 2010 || [[Kinect]] for Xbox 360 || First consumer structured-light depth sensor with real-time full-body skeletons (up to six users).<ref name="Microsoft2010" />
|-
|-
| 2014 – 2016 || Research prototypes || Studies showed Kinect V2 could supply 6-DOF head, hand, and body input to DIY VR HMDs.
| 2014 – 2016 || Research prototypes || Academic work showed Kinect V2 could deliver 6-DOF head- and hand-pose input for DIY VR HMDs.<ref name="Livingston2012" />
|-
|-
| 2017 || Kinect production ends || Microsoft discontinued Kinect hardware as commercial VR shifted toward marker-based and inside-out solutions.<ref name="Microsoft2017" />
| 2017 || Kinect production ends || Microsoft discontinued Kinect hardware as commercial VR shifted toward inside-out and marker-based solutions.<ref name="Microsoft2017" />
|}
|}


==Applications==
==Applications==
* **Gaming and Entertainment** – Titles like ''Kinect Sports'' mapped whole-body actions directly onto avatars. Enthusiast VR chat platforms still use Kinect skeletons to animate full-body avatars.
* **Gaming & entertainment** – Titles such as ''Kinect Sports'' map whole-body actions to avatars; some VR chat platforms still use Kinect skeletons.
* **Rehabilitation and Exercise** – Clinicians employ depth-based pose tracking to monitor range-of-motion exercises without encumbering patients with sensors.
* **Rehabilitation & exercise** – Clinicians monitor range-of-motion without attaching markers.<ref name="Wade2022" /> 
* **Interactive installations** – Museums deploy wall-mounted depth cameras to create “magic-mirror” AR exhibits that overlay virtual costumes onto visitors in real time.
* **Interactive installations** – Depth cameras create “magic-mirror” AR exhibits in museums.
* **Telepresence** – Multi-Kinect arrays stream volumetric representations of remote participants into shared virtual spaces.
* **Telepresence** – Multi-camera arrays stream volumetric avatars into shared virtual spaces.


==Advantages==
==Advantages==
* '''No wearable markers''' – Users remain unencumbered, enhancing comfort and lowering entry barriers.
* No wearable markers.
* '''Rapid setup''' – A single sensor covers an entire play area; no lighthouse calibration or reflector placement is necessary.
* Rapid single-sensor setup; no lighthouse calibration.
* '''Multi-user support''' – Commodity depth cameras distinguish and skeletonise several people simultaneously.
* Simultaneous multi-user support.
* '''Lower hardware cost''' – RGB or RGB-D sensors are inexpensive compared with professional optical-mocap rigs.
* Lower hardware cost than professional optical mocap rigs.


==Disadvantages==
==Disadvantages==
* '''Occlusion sensitivity''' Furniture or other players can block the line of sight, causing intermittent loss of tracking.
* Occlusion sensitivity – furniture or other players can block tracking.
* '''Reduced accuracy and jitter''' Compared with marker-based solutions, joint estimates exhibit higher positional noise, especially during fast or complex motion.
* Reduced accuracy and 60 90 ms latency compared with lighthouse or Vicon systems.<ref name="Guffanti2020" /><ref name="Lighthouse2016" /> 
* '''Environmental constraints''' Bright sunlight, glossy surfaces, and feature-poor backgrounds degrade depth or feature extraction quality.
* Environmental constraints – bright sunlight or glossy surfaces degrade depth quality.
* '''Limited range and FOV''' Most consumer depth cameras operate effectively only within 0.8–5 m; beyond that, depth resolution and skeleton stability decrease.
* Limited range and FOV – reliable only within 0.8 – 4.5 m for Kinect-class sensors.<ref name="DepthRange2012" />


==References==
==References==
<ref name="Shotton2011">Shotton, Jamie; Fitzgibbon, Andrew; Cook, Mat; Sharp, Toby; Finocchio, Mark; Moore, Bob; Kipman, Alex; Blake, Andrew (2011). Real-Time Human Pose Recognition in Parts from a Single Depth Image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2011​
<references>
microsoft.com
<ref name="Shotton2011">Shotton J. ''et al.'' (2011). “Real-Time Human Pose Recognition in Parts from a Single Depth Image.” *CVPR 2011.*</ref>
.</ref>
<ref name="Zhang2012">Zhang Z. (2012). “Microsoft Kinect Sensor and Its Effect.” *IEEE MultiMedia* 19 (2): 4–10.</ref>
<ref name="Zhang2012">Zeng, Wenjun; Zhang, Zhengyou (2012). Microsoft Kinect Sensor and Its Effect. IEEE MultiMedia, 19(2):4–10​
<ref name="DepthRange2012">Khoshelham K.; Elberink S. (2012). “Accuracy and Resolution of Kinect Depth Data for Indoor Mapping Applications.” *Sensors* 12 (2): 1437 – 1454.</ref>
microsoft.com
<ref name="NiTE2013">OpenNI Foundation (2013). “NiTE 2.0 User Guide.”</ref>
.</ref>
<ref name="Livingston2012">Livingston M. A. ''et al.'' (2012). “Performance Measurements for the Microsoft Kinect Skeleton.” *IEEE VR 2012 Workshop.*</ref>
<ref name="OpenNI2013">OpenNI Foundation (2010). OpenNI 1.5.2 User Guide. “OpenNI is an open source API that is publicly available at www.OpenNI.org.”&#8203;:contentReference[oaicite:2]{index=2}.</ref>
<ref name="Guffanti2020">Guffanti D. ''et al.'' (2020). “Accuracy of the Microsoft Kinect V2 Sensor for Human Gait Analysis.” *Sensors* 20 (16): 4405.</ref>
<ref name="Pfister2022">Pfister, Andreas; West, Niels; et al. (2022). Applications and limitations of current markerless motion capture methods for clinical gait biomechanics. Journal of Biomechanics, 129:110844. “While markerless temporospatial measures generally appear equivalent to marker-based systems, joint center locations and joint angles are not yet sufficiently accurate for clinical applications.”​
<ref name="ViconSpec">Vicon Motion Systems. “Vicon Tracker – Latency down to 2.5 ms.” Product sheet.</ref>
pmc.ncbi.nlm.nih.gov
<ref name="Lighthouse2016">Malventano A. (2016). “SteamVR HTC Vive In-depth – Lighthouse Tracking System Dissected.” *PC Perspective.*</ref>
.</ref>
<ref name="EyeToy2004">Pham A. (2004-01-18). “EyeToy Springs From One Man’s Vision.” *Los Angeles Times.*</ref>
<ref name="Pham2004">Pham, Alex (2004-01-18). EyeToy Springs From One Man’s Vision. Los Angeles Times. “the $50 EyeToy, a tiny camera that enables video game players to control the action by jumping around and waving their arms…”​
<ref name="Microsoft2010">Microsoft News Center (2010-11-04). “The Future of Entertainment Starts Today as Kinect for Xbox 360 Leaps and Lands at Retailers Nationwide.</ref>
latimes.com
<ref name="Wade2022">Wade L. ''et al.'' (2022). “Applications and Limitations of Current Markerless Motion Capture Methods for Clinical Gait Biomechanics.” *PeerJ* 10:e12995.</ref>
.</ref>
<ref name="Microsoft2017">Good O. S. (2017-10-25). “Kinect is officially dead. Really. Officially. It’s dead.” *Polygon.*</ref>
<ref name="Microsoft2010">Microsoft News Center (2010-11-04). The Future of Entertainment Starts Today as Kinect for Xbox 360 Leaps and Lands at Retailers Nationwide. “Kinect for Xbox 360 lets you use your body and voice to play your favorite games... No buttons. No barriers. Just you.”​
</references>
news.microsoft.com
.</ref>
<ref name="Lange2011">Lange, Belinda; Rizzo, Skip; Chang, Chien-Yen; Suma, Evan A.; Bolas, Mark (2011). Markerless Full Body Tracking: Depth-Sensing Technology within Virtual Environments. Proc. I/ITSEC 2011. “FAAST is middleware to facilitate integration of full-body control with virtual reality applications... (e.g. Microsoft Kinect).”​
illusioneering.cs.umn.edu
.</ref>
<ref name="Microsoft2017">Good, Owen S. (2017-10-25). Kinect is officially dead. Really. Officially. It’s dead. Polygon. “Microsoft has confirmed it is no longer manufacturing Kinect and none will be sold once retailers run out.”​
polygon.com
.</ref>
 


[[Category:Terms]]
[[Category:Terms]]