Machine learning
Machine learning (ML) is a branch of artificial intelligence in which algorithms learn patterns from data rather than following rules written by hand. A model is trained on examples, adjusts its internal parameters to reduce error on those examples, and then makes predictions on new inputs. Modern systems mostly use deep learning, which trains many-layered artificial neural networks; convolutional neural networks (CNNs) handle images, and other architectures handle sequences and three-dimensional data.
In virtual reality and augmented reality, machine learning is the method behind most perception, tracking, and several rendering tasks. The cameras and sensors on a headset produce raw image and motion data, and learned models turn that data into the pose of the head, the hands, and the eyes, into a map of the surrounding space, and into reconstructed or upscaled imagery. Tasks that were previously attempted with hand-tuned computer vision (locating a hand in a camera frame, estimating where a user is looking, building a map from a moving camera) are now done with trained networks, partly because those networks run more reliably across the wide range of users, lighting, and environments a consumer device must handle.
How machine learning works
A machine learning system is defined by a model, a dataset, and a training objective. During training the model is shown labelled examples (for hand tracking, camera images paired with the true hand pose; for gaze estimation, eye images paired with the true gaze direction) and an optimization procedure adjusts the model's parameters to minimize the difference between its predictions and the labels. This is supervised learning, the most common setup for VR/AR perception. After training, inference runs the fixed model on new data, which on a headset must happen in a few milliseconds per frame and within a tight power budget.
Deep neural networks dominate because they learn useful features directly from raw pixels and signals instead of requiring an engineer to specify them. The cost is data and compute: networks need large, varied training sets, and they need hardware that can run them quickly. On standalone headsets that hardware is a mobile system-on-chip; on PC-tethered systems it is a GPU, and some accelerators include dedicated units for neural network math, such as the Tensor cores used by Nvidia's deep-learning supersampling.
Tracking and input
Hand tracking
Controller-free hand tracking is one of the most visible uses of machine learning in consumer VR. On the Oculus Quest, introduced as an experimental feature in 2019, Meta (then Facebook Reality Labs) built a system that uses only the headset's existing monochrome cameras, with no depth sensor.[1] Deep neural networks predict the location of the hands and of landmarks such as the finger joints, and those predictions are combined with a model of the hand to reconstruct a 26 degree-of-freedom pose of the hand and fingers in real time, entirely on the device.[2][3] Later updates were largely improvements to the underlying models: Hand Tracking 2.1 added a new neural network to reduce overshoot and produce smoother poses during fast motion.[3]
Inside-out tracking and SLAM
Standalone headsets locate themselves in a room with inside-out tracking built on SLAM (simultaneous localization and mapping) and visual-inertial odometry, which fuse camera images with data from inertial measurement units. Meta's Oculus Insight system, which shipped with the Quest in 2019, builds a three-dimensional map of the environment from the headset cameras, identifies visual landmarks, and tracks the headset and controllers against that map at six degrees of freedom, all running on the device's mobile chipset.[4][5] Within these pipelines, learned models are used for sub-tasks such as detecting and matching natural visual features and segmenting parts of the scene, complementing the classical geometric estimation at the core of SLAM.[5]
Eye tracking and gaze estimation
Eye-tracking headsets such as the Meta Quest Pro and Apple Vision Pro use cameras pointed at the eyes and infer gaze direction from the eye images. A 2025 review of eye tracking and gaze estimation for AR/VR reports that modern gaze estimation increasingly relies on learned models, including convolutional and recurrent neural networks, to map eye appearance to gaze direction.[6] Gaze data drives interaction (selecting objects by looking at them) and is the input that makes gaze-contingent foveated rendering possible.[6]
Rendering
Super-resolution and upscaling
Rendering at the high resolution and refresh rate a headset needs is expensive, so several techniques render fewer pixels and use a neural network to fill in the rest. Nvidia's Deep Learning Super Sampling (DLSS) renders a frame at a lower resolution and uses a network running on the GPU's Tensor cores to upscale it toward the quality of a natively rendered frame. DLSS 2.1 added VR support in September 2020, and games including No Man's Sky, Into the Radius, and Wrench were among the first VR titles to use it; Nvidia reported that DLSS roughly doubled VR performance in No Man's Sky at its Ultra preset while holding 90 FPS on a Meta Quest 2 driven by a GeForce RTX 3080.[7]
Foveated reconstruction
A related idea narrows rendering to where the user is looking. Because visual acuity falls off sharply away from the fovea, foveated rendering renders the periphery at lower quality. Meta's DeepFovea, presented at SIGGRAPH Asia 2019, pushed this further: a generative adversarial network reconstructs a plausible full-quality peripheral image from a small fraction of the pixels, by matching the sparse input to a learned model of natural video, and it runs fast enough to drive a gaze-contingent head-mounted display in real time.[8]
Neural rendering
A newer class of methods represents a whole scene with a learned model and renders novel viewpoints from it. Neural radiance fields (NeRF), introduced by Mildenhall and colleagues at ECCV 2020, encode a scene in a small neural network that maps a 3D position and viewing direction to color and density; given only a set of photographs with known camera poses, the network can synthesize new views by volume rendering along camera rays.[9] 3D Gaussian splatting, presented by Kerbl and colleagues at SIGGRAPH 2023, represents a scene as a large set of 3D Gaussian primitives instead of a neural network and rasterizes them, reaching real-time novel-view synthesis at 1080p resolution above 100 frames per second.[10] Both methods turn photographs into explorable 3D scenes, which is relevant to VR for captured environments and virtual tours, but their cost has to be managed to hit VR frame rates. VR-Splatting, published in 2025 by Franke, Fink, and Stamminger, combines a sharp neural point representation in the foveal region with lighter-weight Gaussian splatting in the periphery to render at full per-eye resolution (2016x2240 pixels) in about 10.9 ms per frame, under the 11.1 ms budget for 90 Hz.[11]
Avatars
Machine learning is also used to generate and drive realistic avatars. Meta Reality Labs' Codec Avatars use neural networks trained on faces captured in a multi-camera dome to produce photorealistic avatars and to animate them from a headset's sensors.[12] Because running these models in real time is demanding, Reality Labs has prototyped a headset with a custom accelerator chip dedicated to the neural network computation needed to render a Codec Avatar on a standalone device.[12] Related research lowers the capture requirement: a system demonstrated in 2022 generates a person-specific avatar from a short scan with an iPhone, using a neural "universal prior" model trained on scans of hundreds of people.[13]
References
- ↑ "Introducing Hand Tracking on Oculus Quest - Bringing Your Real Hands into VR". 2019-09-25. https://www.meta.com/blog/introducing-hand-tracking-on-oculus-quest-bringing-your-real-hands-into-vr/.
- ↑ "Using deep neural networks for accurate hand-tracking on Oculus Quest". 2019-09-25. https://ai.meta.com/blog/hand-tracking-deep-neural-networks/.
- ↑ 3.0 3.1 "Quest Hand Tracking 2.1 Reduces Tracking Loss and Improves Stability". 2022-10-14. https://www.uploadvr.com/quest-hand-tracking-2-1-fast-movements/.
- ↑ "Oculus Insight: Facebook Details Quest's Inside Out Tracking System". 2019-08-22. https://www.uploadvr.com/oculus-insight-details-quest/.
- ↑ 5.0 5.1 "C'mon and SLAM: How Oculus tackled portable, 6DOF tracking for the Quest". 2019-08-22. https://www.gamedeveloper.com/extended-reality/c-mon-and-slam-how-oculus-tackled-portable-6dof-tracking-for-the-quest.
- ↑ 6.0 6.1 (2025). "Recent Progress on Eye-Tracking and Gaze Estimation for AR/VR Applications: A Review".{Template:Journal. 14(17). https://www.mdpi.com/2079-9292/14/17/3352. Retrieved 2026-06-16.
- ↑ "'No Man's Sky', 'Into the Radius' and 'Wrench' Among First VR Games to Support DLSS". 2021-05-13. https://www.roadtovr.com/into-the-radius-vr-nvidia-dlss/.
- ↑ Kaplanyan, Anton S.; Sochenov, Anton; Leimkühler, Thomas; Okunev, Mikhail; Goodall, Todd; Rufo, Gizem (2019). "DeepFovea: Neural Reconstruction for Foveated Rendering and Video Compression using Learned Statistics of Natural Videos". ACM SIGGRAPH Asia. https://ai.meta.com/research/publications/deepfovea-neural-reconstruction-for-foveated-rendering-and-video-compression-using-learned-statistics-of-natural-videos/.
- ↑ Mildenhall, Ben; Srinivasan, Pratul P.; Tancik, Matthew; Barron, Jonathan T.; Ramamoorthi, Ravi; Ng, Ren (2020). "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis". European Conference on Computer Vision (ECCV). https://www.matthewtancik.com/nerf.
- ↑
- Kopanas, Georgios(2023). "3D Gaussian Splatting for Real-Time Radiance Field Rendering".{Template:Journal. 42(4). https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/. Retrieved 2026-06-16.
- ↑
- Fink, Laura(2025). "VR-Splatting
- Foveated Radiance Field Rendering via 3D Gaussian Splatting and Neural Points".{Template:Journal. 8(1). https://dl.acm.org/doi/10.1145/3728302. Retrieved 2026-06-16.
- ↑ 12.0 12.1 "Prototype Meta Headset Includes Custom Silicon for Photorealistic Avatars on Standalone". 2022-05-05. https://www.roadtovr.com/meta-reality-labs-research-custom-silicon-codec-avatars-standalone-vr-headset/.
- ↑ "Meta's Photorealistic Avatars Can Be Generated with Just an iPhone". 2022-06-14. https://petapixel.com/2022/06/14/metas-photorealistic-avatars-can-be-generated-with-just-an-iphone/.