Markerless outside-in tracking: Difference between revisions - VR & AR Wiki

(10 intermediate revisions by the same user not shown)

Line 1:

:''See also [[Outside-in tracking]], [[Markerless tracking]], [[Positional tracking]]''

==Introduction==

'''[[Markerless outside-in tracking]]''' is a subtype of [[positional tracking]] used in ~~both~~ [[virtual reality]] (VR) and [[augmented reality]] (AR). ~~It places~~ external [[camera]]s or other [[depth sensing]] devices ~~around~~ the ~~play area and estimates a user’s~~ six-degree-of-freedom pose without any ~~worn~~ [[fiducial marker]]s. Instead, ~~the system runs~~ [[computer vision]] ~~algorithms—most famously~~ the ~~per-pixel~~ body~~-part classifier introduced for Microsoft’s Kinect—to create a~~ real-time [[motion capture]] ~~skeleton~~.<ref name="Shotton2011">Shotton, J.; Fitzgibbon, A.; Cook, M.; Sharp, T.; Finocchio, M.; Moore, R.; Kipman, A.; Blake, A. “Real-Time Human Pose Recognition in Parts from a Single Depth Image.” ''Proceedings of CVPR 2011''. IEEE, 2011.</~~ref~~>

'''[[Markerless outside-in tracking]]''' is a subtype of [[positional tracking]] used in [[virtual reality]] (VR) and [[augmented reality]] (AR). In this approach, external [[camera]]s or other [[depth sensing]] devices positioned in the environment estimate the six-degree-of-freedom ([[6DOF]]) [[pose]] of a user or object without relying on any [[fiducial marker]]s. Instead, [[computer vision]] algorithms analyse the incoming colour or depth stream to detect and follow natural scene features or the user’s own body, enabling real-time [[motion capture]] and interaction.<ref name="Shotton2011" />

==Underlying technology==

A typical markerless outside-in pipeline ~~includes~~:

A typical markerless outside-in pipeline combines specialised hardware with software-based human-pose estimation:

* '''Sensing layer''' – One or more fixed [[RGB-D]] or [[infrared]] depth cameras (e.~~g.,~~ the ~~first-generation [[~~Kinect~~]]) acquire point-cloud frames. Depth is measured with~~ [[structured light]] or [[time-of-flight]] ~~illumination~~.<ref name="Zhang2012">Zeng, W.; Zhang, Z. “Microsoft Kinect Sensor and Its Effect.” ''IEEE MultiMedia'', 19 (2), 2012, pp. 4–10.</ref><ref name="StructuredLight">“Structured-light 3D scanner.” ''Wikipedia''. Accessed 1 May 2025.</~~ref~~>

* '''Sensing layer''' – One or more fixed [[RGB-D]] or [[infrared]] depth cameras acquire per-frame point clouds. Commodity devices such as the Microsoft Kinect project a [[structured light]] pattern or use [[time-of-flight]] methods to compute depth maps.<ref name="Zhang2012" />

* '''Segmentation''' – Foreground extraction isolates user pixels from the static background.

* '''Segmentation''' – Foreground extraction or person segmentation isolates user pixels from the static background.

* '''~~Body~~-part classification''' – A ~~decision~~-~~forest classifier~~ labels each ~~depth~~ pixel as ~~head~~, ~~hand~~, ~~torso~~, and so on~~, following Shotton ''et al~~.''<ref name="Shotton2011" />

* '''Per-pixel body-part classification''' – A machine-learning model labels each pixel as “head”, “hand”, “torso”, and so on (for example the Randomised Decision Forest used in the original Kinect).<ref name="Shotton2011" />

* '''Skeletal ~~fitting~~ and filtering''' – ~~Joint hypotheses are fitted to~~ a kinematic ~~model~~ and ~~temporally smoothed~~, ~~generating continuous~~ head- and hand-pose ~~streams~~.

* '''Skeletal reconstruction and filtering''' – The system fits a kinematic skeleton to the classified pixels and applies temporal filtering to reduce jitter, producing smooth head- and hand-pose data that can drive VR/AR applications.

Open ~~software stacks such as~~ [[OpenNI]]/NITE expose ~~these~~ joint ~~streams to~~ developers.<ref name="OpenNI2013"~~>OpenNI Foundation. ''OpenNI 1.5.2 User Guide''. 2013.<~~/~~ref~~>

Although a single camera can suffice, multi-camera rigs extend coverage and mitigate occlusion problems. Open source and proprietary middleware (for example [[OpenNI]]/NITE, the [[Microsoft Kinect]] SDK) expose joint-stream APIs for developers.<ref name="OpenNI2013" />

==Markerless vs. marker-based tracking==

Marker-based outside-in systems (HTC Vive Lighthouse, PlayStation VR) attach active LEDs or reflective spheres to the headset or controllers, achieving millimetre-~~level accuracy~~. Markerless ~~systems remove that hardware layer~~ but ~~incur~~:

[[Outside-in tracking|Marker-based outside-in systems]] ([[HTC Vive]] [[Lighthouse]], [[PlayStation VR]) attach active LEDs or retro-reflective spheres to the headset or controllers; external sensors triangulate these explicit targets, achieving sub-millimetre precision and sub-10 ms latency. Markerless alternatives dispense with physical targets, improving user comfort and reducing setup time, but at the cost of:

* ~~Susceptibility to occlusion~~ and ~~environmental lighting.~~

* '''Lower positional accuracy and higher latency''' – Depth-sensor noise and computational overhead introduce millimetre- to centimetre-level error and ~20–30 ms end-to-end latency.

* Higher positional noise and ~~latency (~~~20–30 ms end-to-end).~~<ref name="Pfister2022">Pfister, A.; West, N.; et al. “Applications and limitations of current markerless motion capture methods for clinical gait biomechanics.”~~ ''~~Journal of Biomechanics~~'', ~~129 (2022) 110844~~.~~</ref>~~

* '''Sensitivity to occlusion''' – If a body part leaves the camera’s line of sight, the model loses track until the part re-enters view.

==History and notable systems==

{| class="wikitable"

! Year !! System !! ~~Technical note~~

! Year !! System !! Notes

|-

| 2003 || [[EyeToy]] (PlayStation 2) || 2-D silhouette tracking with a single RGB ~~webcam~~.~~<ref name="EyeToy2003">Pham, A. “EyeToy Springs From One Man’s Vision.” ''Los Angeles Times'', 27 Nov 2003.</ref>~~

| 2003 || [[EyeToy]] (PlayStation 2) || 2-D silhouette tracking with a single RGB camera for casual gesture-based games.

|-

| 2010 || [[Kinect]] for Xbox 360 || ~~Structured~~-light depth sensor ~~providing~~ full-body skeletons ~~for~~ up to six users.<ref name="~~Kinect2010~~"~~>Microsoft News Center. “The Future of Entertainment Starts Today as Kinect for Xbox 360 …”, 4 Nov 2010.<~~/~~ref~~>

| 2010 || [[Kinect]] for Xbox 360 || Consumer launch of a structured-light depth sensor delivering real-time full-body skeletons (up to six users).<ref name="Microsoft2010" />

|-

| ~~2011~~ || ~~Kinect + FAAST middleware~~ || ~~Demonstrated low~~-~~cost VR interaction with markerless tracking.<ref name="Lange2011">Lange~~, ~~B.; Rizzo~~, ~~A.; Chang, C-Y.; Suma, E.; Bolas, M. “Markerless Full Body Tracking: Depth-Sensing Technology within Virtual Environments.” ''I/ITSEC 2011''~~.~~</ref>~~

| 2014 – 2016 || Research prototypes || Studies showed Kinect V2 could supply 6-DOF head, hand, and body input to DIY VR HMDs.

|-

| 2017 || Kinect production ends || Microsoft ~~ceased manufacturing~~ Kinect as ~~industry moved to other tracking paradigms~~.<ref name="~~KinectDead2017~~"~~>Good, O. “Kinect is officially dead. Really. Officially. It’s dead.” ''Polygon'', 25 Oct 2017.<~~/~~ref~~>

| 2017 || Kinect production ends || Microsoft discontinued Kinect hardware as commercial VR shifted toward marker-based and inside-out solutions.<ref name="Microsoft2017" />

|}

==Applications==

* **Gaming and entertainment** – Titles ~~such as~~ ''Kinect Sports'' ~~map~~ whole-body ~~gestures to~~ avatars~~; hobbyists~~ still use Kinect ~~for~~ full-body ~~VR chat~~ avatars.

* '''Gaming and Entertainment''' – Titles like ''Kinect Sports'' mapped whole-body actions directly onto avatars. Enthusiast VR chat platforms still use Kinect skeletons to animate full-body avatars.

* **Rehabilitation and exercise** – ~~Depth~~-based pose tracking ~~supports remote physiotherapy and balance~~-~~training systems~~.~~<ref name="Pfister2022" />~~

* '''Rehabilitation and Exercise''' – Clinicians employ depth-based pose tracking to monitor range-of-motion exercises without encumbering patients with sensors.

* **Interactive exhibits** – Museums ~~mount~~ depth cameras to create “magic-mirror” AR ~~overlays~~.

* '''Interactive installations''' – Museums deploy wall-mounted depth cameras to create “magic-mirror” AR exhibits that overlay virtual costumes onto visitors in real time.

* **Telepresence** – Multi-~~camera~~ arrays stream volumetric ~~avatars~~ into shared virtual spaces.

* '''Telepresence''' – Multi-Kinect arrays stream volumetric representations of remote participants into shared virtual spaces.

==Advantages==

* No wearable markers, enhancing comfort.

* '''No wearable markers''' – Users remain unencumbered, enhancing comfort and lowering entry barriers.

* ~~Quick~~ single-~~sensor setup~~ and ~~lower hardware cost~~.

* '''Rapid setup''' – A single sensor covers an entire play area; no lighthouse calibration or reflector placement is necessary.

* ~~Ability to track multiple users at once~~.

* '''Multi-user support''' – Commodity depth cameras distinguish and skeletonise several people simultaneously.

* '''Lower hardware cost''' – RGB or RGB-D sensors are inexpensive compared with professional optical-mocap rigs.

==Disadvantages==

* Occlusion sensitivity ~~and limited camera field-~~of~~-view~~.

* '''Occlusion sensitivity''' – Furniture or other players can block the line of sight, causing intermittent loss of tracking.

* ~~Lower~~ accuracy ~~than~~ marker-based ~~alternatives.<ref name="Remocapp2024">Remocapp. “Marker vs Markerless Motion Capture by Accuracy and Detail Level.” Blog post~~, ~~2024~~.~~</ref>~~

* '''Reduced accuracy and jitter''' – Compared with marker-based solutions, joint estimates exhibit higher positional noise, especially during fast or complex motion.

* ~~Performance degradation in bright~~ sunlight or ~~on reflective surfaces~~.

* '''Environmental constraints''' – Bright sunlight, glossy surfaces, and feature-poor backgrounds degrade depth or feature extraction quality.

* '''Limited range and FOV''' – Most consumer depth cameras operate effectively only within 0.8–5 m; beyond that, depth resolution and skeleton stability decrease.

==References==

<~~references~~/>

<ref name="Shotton2011">Shotton, J.; Fitzgibbon, A.; Cook, M.; Sharp, T.; Finocchio, M.; Moore, R.; Kipman, A.; Blake, A. “Real‑Time Human Pose Recognition in Parts from a Single Depth Image.” *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 2011, pp. 1297–1304. DOI: 10.1109/CVPR.2011.5995316. Available at: https://ieeexplore.ieee.org/document/5995316 (accessed 3 May 2025).</ref>

<ref name="Zhang2012">Zhang, Z. “Microsoft Kinect Sensor and Its Effect.” *IEEE MultiMedia*, vol. 19, no. 2, 2012, pp. 4–10. DOI: 10.1109/MMUL.2012.24. Available at: https://dl.acm.org/doi/10.1109/MMUL.2012.24 (accessed 3 May 2025).</ref>

<ref name="OpenNI2013">OpenNI Foundation. *OpenNI 1.5.2 User Guide*, 2010. PDF. Available at: https://www.cs.rochester.edu/courses/577/fall2011/kinect/openni-user-guide.pdf (accessed 3 May 2025).</ref>

<ref name="Pfister2022">Pfister, A.; West, N.; et al. “Applications and Limitations of Current Markerless Motion Capture Methods for Clinical Gait Biomechanics.” *Journal of Biomechanics*, vol. 129, 2022, Article 110844. DOI: 10.1016/j.jbiomech.2021.110844. Available at: https://pubmed.ncbi.nlm.nih.gov/35237469/ (accessed 3 May 2025).</ref>

<ref name="Pham2004">Pham, A. “EyeToy Springs From One Man’s Vision.” *Los Angeles Times*, 18 Jan 2004. Available at: https://www.latimes.com/archives/la-xpm-2004-jan-18-fi-eyetoy18-story.html (accessed 3 May 2025).</ref>

<ref name="Microsoft2010">Microsoft News Center. “The Future of Entertainment Starts Today as Kinect for Xbox 360 Leaps and Lands at Retailers Nationwide.” Press release, 4 Nov 2010. Available at: https://news.microsoft.com/2010/11/04/the-future-of-entertainment-starts-today-as-kinect-for-xbox-360-leaps-and-lands-at-retailers-nationwide/ (accessed 3 May 2025).</ref>

<ref name="Lange2011">Lange, B.; Rizzo, A.; Chang, C.-Y.; Suma, E. A.; Bolas, M. “Markerless Full Body Tracking: Depth‑Sensing Technology within Virtual Environments.” *Interservice/Industry Training, Simulation and Education Conference (I/ITSEC)*, 2011. PDF. Available at: http://ict.usc.edu/pubs/Markerless%20Full%20Body%20Tracking-%20Depth-Sensing%20Technology%20within%20Virtual%20Environments.pdf (accessed 3 May 2025).</ref>

<ref name="Microsoft2017">Good, O. S. “Kinect Is Officially Dead. Really. Officially. It’s Dead.” *Polygon*, 25 Oct 2017. Available at: https://www.polygon.com/2017/10/25/16543192/kinect-discontinued-microsoft-announcement (accessed 3 May 2025).</ref>

[[Category:Terms]]

[[Category:Technical Terms]]

[[Category:Tracking]]

[[Category:Tracking Types]]