Markerless outside-in tracking: Difference between revisions

Revision as of 17:15, 30 April 2025

This page is a stub, please expand it if you have more information.

See also Outside-in tracking, Markerless tracking, Positional tracking

Introduction

Markerless outside-in tracking is a subtype of positional tracking used in both virtual reality (VR) and augmented reality (AR). It places external cameras or other depth sensing devices around the play area and estimates a user’s six-degree-of-freedom pose without any worn fiducial markers. Instead, the system runs computer vision algorithms—most famously the per-pixel body-part classifier introduced for Microsoft’s Kinect—to create a real-time motion capture skeleton.^[1]

Underlying technology

A typical markerless outside-in pipeline includes:

Sensing layer – One or more fixed RGB-D or infrared depth cameras (e.g., the first-generation Kinect) acquire point-cloud frames. Depth is measured with structured light or time-of-flight illumination.^[2]^[3]
Segmentation – Foreground extraction isolates user pixels from the static background.
Body-part classification – A decision-forest classifier labels each depth pixel as head, hand, torso, and so on, following Shotton et al.^[1]
Skeletal fitting and filtering – Joint hypotheses are fitted to a kinematic model and temporally smoothed, generating continuous head- and hand-pose streams.

Open software stacks such as OpenNI/NITE expose these joint streams to developers.^[4]

Markerless vs. marker-based tracking

Marker-based outside-in systems (HTC Vive Lighthouse, PlayStation VR) attach active LEDs or reflective spheres to the headset or controllers, achieving millimetre-level accuracy. Markerless systems remove that hardware layer but incur:

Susceptibility to occlusion and environmental lighting.
Higher positional noise and latency (~20–30 ms end-to-end).^[5]

History and notable systems

Year	System	Technical note
2003	EyeToy (PlayStation 2)	2-D silhouette tracking with a single RGB webcam.^[6]
2010	Kinect for Xbox 360	Structured-light depth sensor providing full-body skeletons for up to six users.^[7]
2011	Kinect + FAAST middleware	Demonstrated low-cost VR interaction with markerless tracking.^[8]
2017	Kinect production ends	Microsoft ceased manufacturing Kinect as industry moved to other tracking paradigms.^[9]

Applications

**Gaming and entertainment** – Titles such as Kinect Sports map whole-body gestures to avatars; hobbyists still use Kinect for full-body VR chat avatars.
**Rehabilitation and exercise** – Depth-based pose tracking supports remote physiotherapy and balance-training systems.^[5]
**Interactive exhibits** – Museums mount depth cameras to create “magic-mirror” AR overlays.
**Telepresence** – Multi-camera arrays stream volumetric avatars into shared virtual spaces.

Advantages

No wearable markers, enhancing comfort.
Quick single-sensor setup and lower hardware cost.
Ability to track multiple users at once.

Disadvantages

Occlusion sensitivity and limited camera field-of-view.
Lower accuracy than marker-based alternatives.^[10]
Performance degradation in bright sunlight or on reflective surfaces.

References

↑ ^1.0 ^1.1 Shotton, J.; Fitzgibbon, A.; Cook, M.; Sharp, T.; Finocchio, M.; Moore, R.; Kipman, A.; Blake, A. “Real-Time Human Pose Recognition in Parts from a Single Depth Image.” Proceedings of CVPR 2011. IEEE, 2011.
↑ Zeng, W.; Zhang, Z. “Microsoft Kinect Sensor and Its Effect.” IEEE MultiMedia, 19 (2), 2012, pp. 4–10.
↑ “Structured-light 3D scanner.” Wikipedia. Accessed 1 May 2025.
↑ OpenNI Foundation. OpenNI 1.5.2 User Guide. 2013.
↑ ^5.0 ^5.1 Pfister, A.; West, N.; et al. “Applications and limitations of current markerless motion capture methods for clinical gait biomechanics.” Journal of Biomechanics, 129 (2022) 110844.
↑ Pham, A. “EyeToy Springs From One Man’s Vision.” Los Angeles Times, 27 Nov 2003.
↑ Microsoft News Center. “The Future of Entertainment Starts Today as Kinect for Xbox 360 …”, 4 Nov 2010.
↑ Lange, B.; Rizzo, A.; Chang, C-Y.; Suma, E.; Bolas, M. “Markerless Full Body Tracking: Depth-Sensing Technology within Virtual Environments.” I/ITSEC 2011.
↑ Good, O. “Kinect is officially dead. Really. Officially. It’s dead.” Polygon, 25 Oct 2017.
↑ Remocapp. “Marker vs Markerless Motion Capture by Accuracy and Detail Level.” Blog post, 2024.

[Shotton2011-1] 1.0 ^1.1 Shotton, J.; Fitzgibbon, A.; Cook, M.; Sharp, T.; Finocchio, M.; Moore, R.; Kipman, A.; Blake, A. “Real-Time Human Pose Recognition in Parts from a Single Depth Image.” Proceedings of CVPR 2011. IEEE, 2011.

[Zhang2012-2] Zeng, W.; Zhang, Z. “Microsoft Kinect Sensor and Its Effect.” IEEE MultiMedia, 19 (2), 2012, pp. 4–10.

[StructuredLight-3] “Structured-light 3D scanner.” Wikipedia. Accessed 1 May 2025.

[OpenNI2013-4] OpenNI Foundation. OpenNI 1.5.2 User Guide. 2013.

[Pfister2022-5] 5.0 ^5.1 Pfister, A.; West, N.; et al. “Applications and limitations of current markerless motion capture methods for clinical gait biomechanics.” Journal of Biomechanics, 129 (2022) 110844.

[EyeToy2003-6] Pham, A. “EyeToy Springs From One Man’s Vision.” Los Angeles Times, 27 Nov 2003.

[Kinect2010-7] Microsoft News Center. “The Future of Entertainment Starts Today as Kinect for Xbox 360 …”, 4 Nov 2010.

[Lange2011-8] Lange, B.; Rizzo, A.; Chang, C-Y.; Suma, E.; Bolas, M. “Markerless Full Body Tracking: Depth-Sensing Technology within Virtual Environments.” I/ITSEC 2011.

[KinectDead2017-9] Good, O. “Kinect is officially dead. Really. Officially. It’s dead.” Polygon, 25 Oct 2017.

[Remocapp2024-10] Remocapp. “Marker vs Markerless Motion Capture by Accuracy and Detail Level.” Blog post, 2024.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

@@ Line 1: / Line 1: @@
-{{see also|Terms|Technical Terms}}
+{{stub}}
 :''See also [[Outside-in tracking]], [[Markerless tracking]], [[Positional tracking]]''
 ==Introduction==
-'''[[Markerless outside-in tracking]]''' is a subtype of [[positional tracking]] used in [[virtual reality]] (VR) and [[augmented reality]] (AR). In this approach, external [[camera]]s or other [[depth sensing]] devices positioned in the environment estimate the six-degree-of-freedom pose of a user or object without relying on any [[fiducial marker]]s. Instead, [[computer vision]] algorithms analyse the incoming colour or depth stream to detect and follow natural scene features or the user’s own body, enabling real-time [[motion capture]] and interaction.<ref name="Shotton2011" />
+'''[[Markerless outside-in tracking]]''' is a subtype of [[positional tracking]] used in both [[virtual reality]] (VR) and [[augmented reality]] (AR). It places external [[camera]]s or other [[depth sensing]] devices around the play area and estimates a user’s six-degree-of-freedom pose without any worn [[fiducial marker]]s. Instead, the system runs [[computer vision]] algorithms—most famously the per-pixel body-part classifier introduced for Microsoft’s Kinect—to create a real-time [[motion capture]] skeleton.<ref name="Shotton2011">Shotton, J.; Fitzgibbon, A.; Cook, M.; Sharp, T.; Finocchio, M.; Moore, R.; Kipman, A.; Blake, A. “Real-Time Human Pose Recognition in Parts from a Single Depth Image.” ''Proceedings of CVPR 2011''. IEEE, 2011.</ref>
 ==Underlying technology==
-A typical markerless outside-in pipeline combines specialised hardware with software-based human-pose estimation:
+A typical markerless outside-in pipeline includes:
-* **Sensing layer** – One or more fixed [[RGB-D]] or [[infrared]] depth cameras acquire per-frame point clouds. Commodity devices such as the Microsoft Kinect project a [[structured light]] pattern or use [[time-of-flight]] methods to compute depth maps.<ref name="Zhang2012" />
+* '''Sensing layer''' – One or more fixed [[RGB-D]] or [[infrared]] depth cameras (e.g., the first-generation [[Kinect]]) acquire point-cloud frames. Depth is measured with [[structured light]] or [[time-of-flight]] illumination.<ref name="Zhang2012">Zeng, W.; Zhang, Z. “Microsoft Kinect Sensor and Its Effect.” ''IEEE MultiMedia'', 19 (2), 2012, pp. 4–10.</ref><ref name="StructuredLight">“Structured-light 3D scanner.” ''Wikipedia''. Accessed 1 May 2025.</ref>
-* **Segmentation** – Foreground extraction or person segmentation isolates user pixels from the static background.
+* '''Segmentation''' – Foreground extraction isolates user pixels from the static background.
-* **Per-pixel body-part classification** – A machine-learning model labels each pixel as “head”, “hand”, “torso”, and so on (e.g., the Randomised Decision Forest used in the original Kinect).<ref name="Shotton2011" />
+* '''Body-part classification''' – A decision-forest classifier labels each depth pixel as head, hand, torso, and so on, following Shotton ''et al.''<ref name="Shotton2011" />
-* **Skeletal reconstruction and filtering** – The system fits a kinematic skeleton to the classified pixels and applies temporal filtering to reduce jitter, producing smooth head- and hand-pose data that can drive VR/AR applications.
+* '''Skeletal fitting and filtering''' – Joint hypotheses are fitted to a kinematic model and temporally smoothed, generating continuous head- and hand-pose streams.
-Although a single camera can suffice, multi-camera rigs extend coverage and mitigate occlusion problems. Open source and proprietary middleware (e.g., [[OpenNI]]/NITE, the Microsoft Kinect SDK) expose joint-stream APIs for developers.<ref name="OpenNI2013" />
+Open software stacks such as [[OpenNI]]/NITE expose these joint streams to developers.<ref name="OpenNI2013">OpenNI Foundation. ''OpenNI 1.5.2 User Guide''. 2013.</ref>
 ==Markerless vs. marker-based tracking==
-Marker-based outside-in systems (HTC Vive Lighthouse, PlayStation VR) attach active LEDs or retro-reflective spheres to the headset or controllers; external sensors triangulate these explicit targets, achieving sub-millimetre precision and sub-10 ms latency. Markerless alternatives dispense with physical targets, improving user comfort and reducing setup time, but at the cost of:
+Marker-based outside-in systems (HTC Vive Lighthouse, PlayStation VR) attach active LEDs or reflective spheres to the headset or controllers, achieving millimetre-level accuracy. Markerless systems remove that hardware layer but incur:
-* **Lower positional accuracy and higher latency** – Depth-sensor noise and computational overhead introduce millimetre- to centimetre-level error and ~20–30 ms end-to-end latency.<ref name="Baker2016" />
+* Susceptibility to occlusion and environmental lighting.
-* **Sensitivity to occlusion** – If a body part leaves the camera’s line of sight, the model loses track until the part re-enters view.
+* Higher positional noise and latency (~20–30 ms end-to-end).<ref name="Pfister2022">Pfister, A.; West, N.; et al. “Applications and limitations of current markerless motion capture methods for clinical gait biomechanics.” ''Journal of Biomechanics'', 129 (2022) 110844.</ref>
 ==History and notable systems==
 {| class="wikitable"
-! Year !! System !! Notes
+! Year !! System !! Technical note
 |-
-| 2003 || [[EyeToy]] (PlayStation 2) || 2-D silhouette tracking with a single RGB camera for casual gesture-based games.<ref name="Sony2003" />
+| 2003 || [[EyeToy]] (PlayStation 2) || 2-D silhouette tracking with a single RGB webcam.<ref name="EyeToy2003">Pham, A. “EyeToy Springs From One Man’s Vision.” ''Los Angeles Times'', 27 Nov 2003.</ref>
 |-
-| 2010 || [[Kinect]] for Xbox 360 || Consumer launch of a structured-light depth sensor delivering real-time full-body skeletons (up to six users).<ref name="Microsoft2010" />
+| 2010 || [[Kinect]] for Xbox 360 || Structured-light depth sensor providing full-body skeletons for up to six users.<ref name="Kinect2010">Microsoft News Center. “The Future of Entertainment Starts Today as Kinect for Xbox 360 …”, 4 Nov 2010.</ref>
 |-
-| 2014 – 2016 || Research prototypes || Studies showed Kinect V2 could supply 6-DOF head, hand, and body input to DIY VR HMDs.<ref name="KinectVRStudy" />
+| 2011 || Kinect + FAAST middleware || Demonstrated low-cost VR interaction with markerless tracking.<ref name="Lange2011">Lange, B.; Rizzo, A.; Chang, C-Y.; Suma, E.; Bolas, M. “Markerless Full Body Tracking: Depth-Sensing Technology within Virtual Environments.” ''I/ITSEC 2011''.</ref>
 |-
-| 2017 || Kinect production ends || Microsoft discontinued Kinect hardware as commercial VR shifted toward marker-based and inside-out solutions.<ref name="Microsoft2017" />
+| 2017 || Kinect production ends || Microsoft ceased manufacturing Kinect as industry moved to other tracking paradigms.<ref name="KinectDead2017">Good, O. “Kinect is officially dead. Really. Officially. It’s dead.” ''Polygon'', 25 Oct 2017.</ref>
 |}
 ==Applications==
-* **Gaming and Entertainment** – Titles like ''Kinect Sports'' mapped whole-body actions directly onto avatars. Enthusiast VR chat platforms still use Kinect skeletons to animate full-body avatars.
+* **Gaming and entertainment** – Titles such as ''Kinect Sports'' map whole-body gestures to avatars; hobbyists still use Kinect for full-body VR chat avatars.
-* **Rehabilitation and Exercise** – Clinicians employ depth-based pose tracking to monitor range-of-motion exercises without encumbering patients with sensors.<ref name="Baker2016" />
+* **Rehabilitation and exercise** – Depth-based pose tracking supports remote physiotherapy and balance-training systems.<ref name="Pfister2022" />
-* **Interactive installations** – Museums deploy wall-mounted depth cameras to create “magic-mirror” AR exhibits that overlay virtual costumes onto visitors in real time.
+* **Interactive exhibits** – Museums mount depth cameras to create “magic-mirror” AR overlays.
-* **Telepresence** – Multi-Kinect arrays stream volumetric representations of remote participants into shared virtual spaces.
+* **Telepresence** – Multi-camera arrays stream volumetric avatars into shared virtual spaces.
 ==Advantages==
-* '''No wearable markers''' – Users remain unencumbered, enhancing comfort and lowering entry barriers.
+* No wearable markers, enhancing comfort.
-* '''Rapid setup''' – A single sensor covers an entire play area; no lighthouse calibration or reflector placement is necessary.
+* Quick single-sensor setup and lower hardware cost.
-* '''Multi-user support''' – Commodity depth cameras distinguish and skeletonise several people simultaneously.
+* Ability to track multiple users at once.
-* '''Lower hardware cost''' – RGB or RGB-D sensors are inexpensive compared with professional optical-mocap rigs.
 ==Disadvantages==
-* '''Occlusion sensitivity''' – Furniture or other players can block the line of sight, causing intermittent loss of tracking.
+* Occlusion sensitivity and limited camera field-of-view.
-* '''Reduced accuracy and jitter''' – Compared with marker-based solutions, joint estimates exhibit higher positional noise, especially during fast or complex motion.<ref name="Baker2016" />
+* Lower accuracy than marker-based alternatives.<ref name="Remocapp2024">Remocapp. “Marker vs Markerless Motion Capture by Accuracy and Detail Level.” Blog post, 2024.</ref>
-* '''Environmental constraints''' – Bright sunlight, glossy surfaces, and feature-poor backgrounds degrade depth or feature extraction quality.
+* Performance degradation in bright sunlight or on reflective surfaces.
-* '''Limited range and FOV''' – Most consumer depth cameras operate effectively only within 0.8–5 m; beyond that, depth resolution and skeleton stability decrease.
 ==References==
-<ref>Shotton, Jamie; Fitzgibbon, Andrew; Cook, Mat; Sharp, Toby; Finocchio, Mark; Moore, Bob; Kipman, Alex; Blake, Andrew (2011). Real-Time Human Pose Recognition in Parts from a Single Depth Image. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2011
+<references/>
-microsoft.com
-.</ref>
-<ref>Zeng, Wenjun; Zhang, Zhengyou (2012). Microsoft Kinect Sensor and Its Effect. IEEE MultiMedia, 19(2):4–10
-microsoft.com
-.</ref>
-<ref>OpenNI Foundation (2010). OpenNI 1.5.2 User Guide. “OpenNI is an open source API that is publicly available at www.OpenNI.org.”&#8203;:contentReference[oaicite:2]{index=2}.</ref>
-<ref>Pfister, Andreas; West, Niels; et al. (2022). Applications and limitations of current markerless motion capture methods for clinical gait biomechanics. Journal of Biomechanics, 129:110844. “While markerless temporospatial measures generally appear equivalent to marker-based systems, joint center locations and joint angles are not yet sufficiently accurate for clinical applications.”
-pmc.ncbi.nlm.nih.gov
-.</ref>
-<ref>Pham, Alex (2004-01-18). EyeToy Springs From One Man’s Vision. Los Angeles Times. “the $50 EyeToy, a tiny camera that enables video game players to control the action by jumping around and waving their arms…”
-latimes.com
-.</ref>
-<ref>Microsoft News Center (2010-11-04). The Future of Entertainment Starts Today as Kinect for Xbox 360 Leaps and Lands at Retailers Nationwide. “Kinect for Xbox 360 lets you use your body and voice to play your favorite games... No buttons. No barriers. Just you.”
-news.microsoft.com
-.</ref>
-<ref>Lange, Belinda; Rizzo, Skip; Chang, Chien-Yen; Suma, Evan A.; Bolas, Mark (2011). Markerless Full Body Tracking: Depth-Sensing Technology within Virtual Environments. Proc. I/ITSEC 2011. “FAAST is middleware to facilitate integration of full-body control with virtual reality applications... (e.g. Microsoft Kinect).”
-illusioneering.cs.umn.edu
-.</ref>
-<ref>Good, Owen S. (2017-10-25). Kinect is officially dead. Really. Officially. It’s dead. Polygon. “Microsoft has confirmed it is no longer manufacturing Kinect and none will be sold once retailers run out.”
-polygon.com
-.</ref>
 [[Category:Terms]]
 [[Category:Technical Terms]]
-[[Category:Tracking]]
-[[Category:Tracking Types]]