Markerless outside-in tracking: Difference between revisions - VR & AR Wiki

Line 1:

{{see also|~~Terms|Technical Terms}}~~

{{see also|Outside-in tracking|Markerless tracking|Positional tracking}}

~~:''See also [[~~Outside-in tracking~~]], [[~~Markerless tracking~~]], [[~~Positional tracking~~]]''~~

==Introduction==

'''[[Markerless outside-in tracking]]''' is a ~~subtype~~ of [[positional tracking]] ~~used in~~ [[virtual reality]] (VR) and [[augmented reality]] (AR)~~. In this approach, external [[camera]]s or other [[depth sensing]] devices positioned in the environment estimate the~~ six-degree-of-freedom pose ~~of a user~~ or ~~object~~ without ~~relying on~~ any [[fiducial marker]]s. Instead, [[computer vision]] algorithms ~~analyse~~ the ~~incoming colour or depth stream to detect and follow natural~~ scene ~~features or the user’s own~~ body, enabling real-time [[motion capture]] and interaction.<ref name="Shotton2011" />

'''[[Markerless outside-in tracking]]''' is a form of [[positional tracking]] for [[virtual reality]] (VR) and [[augmented reality]] (AR) that estimates a user’s six-degree-of-freedom pose from externally mounted [[depth sensing|depth-sensing]] or RGB cameras without requiring any [[fiducial marker]]s. Instead, per-frame depth or colour images are processed by [[computer vision]] algorithms that segment the scene, classify body parts and fit a kinematic skeleton, enabling real-time [[motion capture]] and interaction.<ref name="Shotton2011" />

==Underlying technology==

A typical ~~markerless outside-in~~ pipeline combines specialised hardware with software-based human-pose estimation:

A typical pipeline combines specialised hardware with software-based human-pose estimation:

* '''Sensing layer''' – One or more fixed [[RGB-D]] or [[infra-red]] depth cameras stream point clouds. The original Microsoft Kinect projects a near-IR [[structured light]] pattern, whereas Kinect V2 and Azure Kinect use [[time-of-flight camera|time-of-flight]] ranging.<ref name="Zhang2012" /> The effective operating range for Kinect v1 is ≈ 0.8 – 4.5 m (specification upper limit 5 m).<ref name="DepthRange2012" />

* '''Segmentation''' – Foreground extraction isolates user pixels from background geometry.

* '''Per-pixel body-part classification''' – A Randomised Decision Forest labels each pixel (head, hand, torso, …).<ref name="Shotton2011" />

* '''Skeletal reconstruction and filtering''' – Joint positions are inferred and temporally filtered to reduce jitter, producing head- and hand-pose data consumable by VR/AR engines.

* **Sensing layer** – One or more fixed [[RGB-D]] or [[infrared]] depth cameras acquire per-frame point clouds. Commodity devices such as the Microsoft Kinect project a [[structured light]] pattern or use [[time-of-flight]] methods to compute depth maps.<ref name="Zhang2012" />

Although a single depth camera can suffice, multi-camera rigs expand coverage and reduce occlusions. Open-source and proprietary middleware (e.g., [[OpenNI]]/NiTE 2, Microsoft Kinect SDK) expose joint-stream APIs for developers.<ref name="NiTE2013" />

* **Segmentation** – Foreground extraction or person segmentation isolates user pixels from the static background.

Measured end-to-end skeleton latency for Kinect ranges from 60 – 90 ms, depending on model and SDK settings.<ref name="Livingston2012" />

* **Per-pixel body-part classification** – A machine-learning model labels each pixel as “head”, “hand”, “torso”, and so on (e.g., the Randomised Decision Forest used in the original Kinect).<ref name="Shotton2011" />

* **Skeletal reconstruction and filtering** – The system fits a kinematic skeleton to the classified pixels and applies temporal filtering to reduce jitter, producing smooth head- and hand-pose data that can drive VR/AR applications.

Although a single camera can suffice, multi-camera rigs ~~extend~~ coverage and ~~mitigate occlusion problems~~. Open source and proprietary middleware (e.g., [[OpenNI]]/~~NITE~~, ~~the~~ Microsoft Kinect SDK) expose joint-stream APIs for developers.<ref name="~~OpenNI2013~~" />

==Markerless vs. marker-based tracking==

Marker-based outside-in systems (HTC Vive Lighthouse~~, PlayStation VR)~~ attach ~~active LEDs or~~ retro-reflective spheres to the ~~headset or controllers; external sensors triangulate these explicit targets~~, achieving sub-millimetre precision and ~~sub~~-10 ms ~~latency~~. Markerless alternatives ~~dispense with~~ physical targets, improving ~~user~~ comfort and ~~reducing~~ setup time, but at the cost of:

Marker-based outside-in systems such as **Vicon** optical mocap or HTC Vive **Lighthouse** attach retro-reflective spheres or use on-device photodiodes that read sweeping IR lasers from the base stations, achieving sub-millimetre precision and motion-to-photon latency below 10 ms.<ref name="ViconSpec" /><ref name="Lighthouse2016" />

Markerless alternatives remove physical targets, improving comfort and setup time, but at the cost of:

* **Lower positional accuracy and higher latency** – Depth-sensor noise ~~and computational overhead introduce~~ millimetre- to centimetre-level error ~~and ~20–30 ms end-to-end latency~~.

* '''Lower positional accuracy and higher latency''' – Depth-sensor noise plus the 60 – 90 ms processing pipeline produce millimetre- to centimetre-level error.<ref name="Guffanti2020" />

* **Sensitivity to occlusion** – ~~If a body part leaves~~ the camera’s line of sight~~, the model loses track until the part re-enters view~~.

* '''Sensitivity to occlusion''' – Body parts outside the camera’s line-of-sight are temporarily lost.

==History and notable systems==

Line 25:

Line 24:

! Year !! System !! Notes

|-

| 2003 || [[EyeToy]] (PlayStation 2) || 2-D silhouette tracking with a single RGB camera ~~for casual gesture-based games~~.

| 2003 || [[EyeToy]] (PlayStation 2) || 2-D silhouette tracking with a single RGB camera.<ref name="EyeToy2004" />

|-

| 2010 || [[Kinect]] for Xbox 360 || ~~Consumer launch of a~~ structured-light depth sensor ~~delivering~~ real-time full-body skeletons (up to six users).<ref name="Microsoft2010" />

| 2010 || [[Kinect]] for Xbox 360 || First consumer structured-light depth sensor with real-time full-body skeletons (up to six users).<ref name="Microsoft2010" />

|-

| 2014 – 2016 || Research prototypes || ~~Studies~~ showed Kinect V2 could ~~supply~~ 6-DOF head, hand~~, and body~~ input to DIY VR HMDs.

| 2014 – 2016 || Research prototypes || Academic work showed Kinect V2 could deliver 6-DOF head- and hand-pose input for DIY VR HMDs.<ref name="Livingston2012" />

|-

| 2017 || Kinect production ends || Microsoft discontinued Kinect hardware as commercial VR shifted toward marker-based ~~and inside-out~~ solutions.<ref name="Microsoft2017" />

| 2017 || Kinect production ends || Microsoft discontinued Kinect hardware as commercial VR shifted toward inside-out and marker-based solutions.<ref name="Microsoft2017" />

|}

==Applications==

* **Gaming ~~and Entertainment~~** – Titles ~~like~~ ''Kinect Sports'' ~~mapped~~ whole-body actions ~~directly onto~~ avatars~~. Enthusiast~~ VR chat platforms still use Kinect skeletons ~~to animate full-body avatars~~.

* **Gaming & entertainment** – Titles such as ''Kinect Sports'' map whole-body actions to avatars; some VR chat platforms still use Kinect skeletons.

* **Rehabilitation ~~and Exercise~~** – Clinicians ~~employ depth-based pose tracking to~~ monitor range-of-motion ~~exercises~~ without ~~encumbering patients with sensors~~.

* **Rehabilitation & exercise** – Clinicians monitor range-of-motion without attaching markers.<ref name="Wade2022" />

* **Interactive installations** – ~~Museums deploy wall-mounted depth~~ cameras to create “magic-mirror” AR exhibits ~~that overlay virtual costumes onto visitors~~ in ~~real time~~.

* **Interactive installations** – Depth cameras create “magic-mirror” AR exhibits in museums.

* **Telepresence** – Multi-~~Kinect~~ arrays stream volumetric ~~representations of remote participants~~ into shared virtual spaces.

* **Telepresence** – Multi-camera arrays stream volumetric avatars into shared virtual spaces.

==Advantages==

* ~~'''~~No wearable markers~~''' – Users remain unencumbered, enhancing comfort and lowering entry barriers~~.

* No wearable markers.

* ~~'''~~Rapid ~~setup''' – A~~ single sensor ~~covers an entire play area~~; no lighthouse calibration ~~or reflector placement is necessary~~.

* Rapid single-sensor setup; no lighthouse calibration.

* ~~'''Multi~~-user support~~''' – Commodity depth cameras distinguish and skeletonise several people simultaneously~~.

* Simultaneous multi-user support.

* ~~'''~~Lower hardware cost~~''' – RGB or RGB-D sensors are inexpensive compared with~~ professional optical-mocap rigs.

* Lower hardware cost than professional optical mocap rigs.

==Disadvantages==

* ~~'''~~Occlusion sensitivity~~'''~~ – ~~Furniture~~ or other players can block ~~the line of sight, causing intermittent loss of~~ tracking.

* Occlusion sensitivity – furniture or other players can block tracking.

* ~~'''~~Reduced accuracy and ~~jitter'''~~ – ~~Compared~~ with ~~marker-based solutions, joint estimates exhibit higher positional noise, especially during fast~~ or ~~complex motion~~.

* Reduced accuracy and 60 – 90 ms latency compared with lighthouse or Vicon systems.<ref name="Guffanti2020" /><ref name="Lighthouse2016" />

* ~~'''~~Environmental constraints~~'''~~ – ~~Bright~~ sunlight, glossy surfaces~~, and feature-poor backgrounds~~ degrade depth ~~or feature extraction~~ quality.

* Environmental constraints – bright sunlight or glossy surfaces degrade depth quality.

* ~~'''~~Limited range and FOV~~'''~~ – ~~Most consumer depth cameras operate effectively~~ only within 0.~~8–5~~ m~~; beyond that, depth resolution and skeleton stability decrease~~.

* Limited range and FOV – reliable only within ≈ 0.8 – 4.5 m for Kinect-class sensors.<ref name="DepthRange2012" />

==References==

<ref name="Shotton2011">Shotton~~, Jamie; Fitzgibbon, Andrew; Cook, Mat; Sharp, Toby; Finocchio, Mark; Moore, Bob; Kipman, Alex; Blake, Andrew~~ (2011). ~~Real~~-Time Human Pose Recognition in Parts from a Single Depth Image. ~~In IEEE Conference on Computer Vision and Pattern Recognition (~~CVPR~~) 2011~~

~~microsoft.com~~

<ref name="Shotton2011">Shotton J. ''et al.'' (2011). “Real-Time Human Pose Recognition in Parts from a Single Depth Image.” *CVPR 2011.*</ref>

.</ref>

<ref name="Zhang2012">Zhang Z. (2012). “Microsoft Kinect Sensor and Its Effect.” *IEEE MultiMedia* 19 (2): 4–10.</ref>

<ref name="Zhang2012">~~Zeng, Wenjun;~~ Zhang~~, Zhengyou~~ (2012). ~~Microsoft~~ Kinect Sensor and Its Effect. IEEE MultiMedia, 19(2):~~4–10~~

<ref name="DepthRange2012">Khoshelham K.; Elberink S. (2012). “Accuracy and Resolution of Kinect Depth Data for Indoor Mapping Applications.” *Sensors* 12 (2): 1437 – 1454.</ref>

~~microsoft~~.~~com~~

<ref name="NiTE2013">OpenNI Foundation (2013). “NiTE 2.0 User Guide.”</ref>

.</ref>

<ref name="Livingston2012">Livingston M. A. ''et al.'' (2012). “Performance Measurements for the Microsoft Kinect Skeleton.” *IEEE VR 2012 Workshop.*</ref>

<ref name="~~OpenNI2013~~">OpenNI Foundation (~~2010~~). ~~OpenNI 1.5~~.2 User Guide. ~~“OpenNI is an open source API that is publicly available at www~~.~~OpenNI~~.~~org~~.”~~:contentReference[oaicite:2]{index=2}~~.</ref>

<ref name="Guffanti2020">Guffanti D. ''et al.'' (2020). “Accuracy of the Microsoft Kinect V2 Sensor for Human Gait Analysis.” *Sensors* 20 (16): 4405.</ref>

<ref name="~~Pfister2022~~">~~Pfister, Andreas; West, Niels;~~ et al. (~~2022~~). ~~Applications and limitations~~ of ~~current markerless motion capture methods~~ for ~~clinical gait biomechanics~~. ~~Journal of Biomechanics, 129~~:~~110844~~. ~~“While markerless temporospatial measures generally appear equivalent~~ to ~~marker-based systems, joint center locations and joint angles are not yet sufficiently accurate for clinical applications~~.”

<ref name="ViconSpec">Vicon Motion Systems. “Vicon Tracker – Latency down to 2.5 ms.” Product sheet.</ref>

~~pmc~~.~~ncbi~~.~~nlm~~.~~nih.gov~~

<ref name="Lighthouse2016">Malventano A. (2016). “SteamVR HTC Vive In-depth – Lighthouse Tracking System Dissected.” *PC Perspective.*</ref>

.</ref>

<ref name="EyeToy2004">Pham A. (2004-01-18). “EyeToy Springs From One Man’s Vision.” *Los Angeles Times.*</ref>

<ref name="~~Pham2004~~">Pham~~, Alex~~ (2004-01-18). ~~EyeToy~~ Springs From One Man’s Vision. Los Angeles Times. ~~“the $50 EyeToy, a tiny camera that enables video game players to control the action by jumping around and waving their arms…”~~

<ref name="Microsoft2010">Microsoft News Center (2010-11-04). “The Future of Entertainment Starts Today as Kinect for Xbox 360 Leaps and Lands at Retailers Nationwide.”</ref>

~~latimes.com~~

<ref name="Wade2022">Wade L. ''et al.'' (2022). “Applications and Limitations of Current Markerless Motion Capture Methods for Clinical Gait Biomechanics.” *PeerJ* 10:e12995.</ref>

.</ref>

<ref name="Microsoft2017">Good O. S. (2017-10-25). “Kinect is officially dead. Really. Officially. It’s dead.” *Polygon.*</ref>

<ref name="Microsoft2010">Microsoft News Center (2010-11-04). ~~The~~ Future of Entertainment Starts Today as Kinect for Xbox 360 Leaps and Lands at Retailers Nationwide. ~~“Kinect for Xbox 360 lets you use your body and voice to play your favorite games... No buttons. No barriers. Just you.”~~

</references>

~~news.microsoft.com~~

.</ref>

<ref name="~~Lange2011~~">~~Lange, Belinda; Rizzo, Skip; Chang, Chien-Yen; Suma, Evan A~~.~~; Bolas, Mark~~ (~~2011~~). Markerless ~~Full Body Tracking~~: Depth-Sensing Technology within Virtual Environments. Proc. I/ITSEC 2011. “FAAST is middleware to facilitate integration of full-body control with virtual reality applications... (e.g. Microsoft Kinect).”

~~illusioneering.cs.umn.edu~~

.</ref>

<ref name="Microsoft2017">Good~~, Owen~~ S. (2017-10-25). ~~Kinect~~ is officially dead. Really. Officially. It’s dead. Polygon. ~~“Microsoft has confirmed it is no longer manufacturing Kinect and none will be sold once retailers run out.”~~

~~polygon.com~~

.</ref>

[[Category:Terms]]