Artificial intelligence

Artificial intelligence (AI) is the field of computer science concerned with building machines and software that perform tasks normally associated with human intelligence, such as perception, reasoning, learning, and language. The term was coined by John McCarthy, who defined it as "the science and engineering of making intelligent machines."^[1] Most contemporary AI is built on machine learning, in which statistical models, typically deep neural networks, learn patterns from large datasets rather than following hand-written rules.^[1]

In virtual reality (VR), augmented reality (AR), and the broader category of mixed and extended reality, AI is used in two distinct ways. Inside the device, machine learning runs the perception that makes a headset or pair of glasses work at all: estimating the pose of the head and hands, reconstructing the surrounding room, and predicting where the eyes are looking. Inside applications, generative and language models produce content and behaviour, from synthesised characters that hold a conversation to voice assistants that answer questions about what the user is seeing. This article describes both, after a short note on what AI is.

Background

AI as a research field dates to a 1956 workshop at Dartmouth College, organised by John McCarthy, Marvin Minsky, Nathaniel Rochester, and Claude Shannon.^[1] Early work emphasised symbolic reasoning and logic. The methods that dominate VR and AR today instead come from machine learning and, more specifically, deep learning, where multi-layer neural networks are trained on labelled examples. Computer vision, the subfield concerned with extracting information from images, supplies most of the on-device AI in headsets, because the central problem of an immersive device is interpreting the cameras that look at the user and at the world.

AI for tracking and perception

The sensing that lets a head-mounted display place virtual content in a fixed position relative to the room depends heavily on machine learning. Inside-out tracking, in which cameras on the headset observe the environment to compute the headset's own pose, is implemented with simultaneous localization and mapping (SLAM) and computer vision. Meta has stated that on its Quest headsets a machine-learning-based approach replaced earlier hand-written tracking algorithms, improving how reliably the system recalls a space and how stable tracking stays under difficult lighting.^[2]

Hand tracking is a clear example. Rather than requiring controllers, modern standalone headsets detect the hands and fingers directly from camera images using neural networks trained to estimate joint positions. Meta credits custom neural network architectures for its hand tracking and says its Hand Tracking 2.2 release cut latency by up to 40 percent in typical use and up to 75 percent during fast motion.^[2] The same blog describes Inside-Out Body Tracking, which infers wrist, elbow, shoulder, and torso positions from the Meta Quest 3 side cameras, and a feature called Generative Legs that uses a model to produce plausible leg motion from upper-body pose, so an avatar can have moving legs without any sensor on them.^[2]

Eye tracking and the related technique of foveated rendering also rely on AI. In foveated rendering the system renders the small region the eye is fixating at full detail and the periphery at lower detail, cutting the work the GPU must do. Meta's Quest Pro uses eye-tracked foveated rendering for that purpose.^[2] A research project from Meta, DeepFovea, pushed this further: it used a generative adversarial network to reconstruct a full-quality peripheral image from a small fraction of the rendered pixels. The 2019 SIGGRAPH paper by Anton Kaplanyan and colleagues reported a 47x reduction in rendered pixels at the 50 percent detectability threshold, running at 90 Hz in an Oculus Rift.^[3] Software-only variants that predict gaze without a dedicated eye tracker have continued to appear in the research literature.^[4]

On Apple's headset the same pattern holds. Apple states that the Apple Vision Pro uses "a wide range of advanced machine learning and AI models" for foundational capabilities including hand tracking and room mapping, accelerated by the Neural Engine in its chip.^[5] Its Optic ID authentication uses neural networks to analyse the iris and surrounding region for spoof resistance, with the iris data processed on the device's Secure Enclave and a protected portion of the Apple Neural Engine.^[6]

AI for scene understanding

Beyond tracking the device, AI is used to understand the scene the device is in: detecting surfaces and objects, building a spatial map of the room, and recognising things the user looks at. Computer vision models perform object recognition and semantic segmentation so that virtual content can react to real geometry, for example placing an object on a detected table or occluding it behind a real wall. Google describes its Android XR platform's Gemini assistant as able to "understand what you're seeing and take actions on your behalf" on a headset, and on glasses to "see and hear what you do, so they understand your context."^[7] This camera-based contextual understanding is the same capability that lets a pair of AI glasses answer "what am I looking at" or translate a sign in view.

Generative AI and 3D content

Creating 3D environments and assets by hand is slow, and a body of recent work applies generative AI to the problem. Two reconstruction methods are widely used to turn ordinary photographs or video into explorable 3D scenes: neural radiance fields (NeRF), which represent a scene as a function learned by a neural network, and 3D Gaussian splatting, which represents a scene as a cloud of coloured 3D Gaussians and renders quickly enough for real-time use. Gaussian splatting in particular has been studied for extended reality because its rasterised rendering suits the high, steady frame rates a headset needs.^[8] Separate text-to-3D systems generate new scenes from a written prompt rather than reconstructing real ones; the research literature includes layout-guided generators built on Gaussian splatting and methods aimed at producing complete immersive environments.^[8] These techniques are still largely in research and early tooling rather than shipping mainstream products, but they target the same goal: lowering the cost of making the 3D worlds that VR and AR consume.

AI-driven characters and NPCs

Inside VR experiences, large language models are used to drive non-player characters (NPCs) that converse in natural language instead of choosing from scripted dialogue trees. At the 2024 Game Developers Conference, Inworld AI and NVIDIA showed Covert Protocol, a tech demo built in Unreal Engine 5 in which the player is a detective questioning AI-driven characters; it combined Inworld's character engine with NVIDIA's ACE technologies, including Riva speech recognition and Audio2Face for lip-sync.^[9]

Academic work has examined how these characters perform specifically in VR, where the player speaks aloud and expects an embodied response. A 2024 study of LLM-driven NPCs found that players felt more present when characters responded believably, but that unnatural conversational flow, inconsistent answers, and the model's limited memory of past exchanges could break immersion.^[10] Reported mitigations include prompt engineering and retrieval-augmented generation to keep a character in role and give it access to relevant facts.^[10]

On-device AI assistants in smart glasses

The clearest consumer use of AI in wearable AR is the voice assistant in smart glasses. The Ray-Ban Meta glasses run Meta AI: the wearer says "Hey Meta" and can ask questions, and because the glasses have a camera, the assistant can answer about what is in view.^[2] Meta extended this with the Meta Ray-Ban Display, announced at Connect 2025 and priced at 799 US dollars including the Meta Neural Band, which adds a full-colour monocular display so that Meta AI can show answers, walking directions, and live captions rather than only speaking them; the device began selling on 30 September 2025.^[11] The accompanying Meta Neural Band is a surface-electromyography wristband, rated for up to 18 hours of battery life, that reads muscle signals so the wearer can scroll and select with small finger movements.^[11]

Google's Android XR takes the same approach with the Gemini assistant, which it positions as the primary way to interact with both glasses and headsets; the first Android XR headset named by Google was Samsung's Project Moohan.^[7] Across these products the pattern is consistent: the camera and microphones feed an AI model, and the model returns spoken or on-screen help without the user reaching for a phone.

Privacy and accuracy concerns

Because the camera-and-AI combination in a headset or glasses observes the wearer and bystanders continuously, it raises privacy questions distinct from those of a phone. Vendors describe on-device processing as a mitigation; Apple's Optic ID, for instance, keeps iris data inside the Secure Enclave and does not expose it to apps.^[6] For generative features, the same hallucination and consistency problems documented for language models apply, and the VR NPC research above found that incorrect or out-of-character responses directly reduced the sense of presence the experience was trying to create.^[10]

References

↑ ^1.0 ^1.1 ^1.2 Stanford Institute for Human-Centered Artificial Intelligence. What is artificial intelligence (AI)? https://hai.stanford.edu/ai-definitions/what-is-artificial-intelligence-ai
↑ ^2.0 ^2.1 ^2.2 ^2.3 ^2.4 Meta. The Magic Under the Hood: How AI Is Powering Meta's Technologies Today and in the Future. Meta Quest Blog. https://www.meta.com/blog/ai-powered-technologies-quest-3-pro-ray-ban-meta-smart-glasses/
↑ Kaplanyan, A., Sochenov, A., Leimkuhler, T., Okunev, M., Goodall, T. and Rufo, G. (2019). DeepFovea: neural reconstruction for foveated rendering and video compression using learned natural video statistics. ACM SIGGRAPH 2019 Talks. doi:10.1145/3306307.3328186. https://dl.acm.org/doi/10.1145/3306307.3328186
↑ Ebadulla, F., Mudlapur, C. and BV, G. (2025). GazeProphet: Software-Only Gaze Prediction for VR Foveated Rendering. arXiv:2508.13546. https://arxiv.org/abs/2508.13546
↑ Apple (2024). Apple Vision Pro brings a new era of spatial computing to business. Apple Newsroom. https://www.apple.com/newsroom/2024/04/apple-vision-pro-brings-a-new-era-of-spatial-computing-to-business/
↑ ^6.0 ^6.1 Apple. About Optic ID advanced technology. Apple Support. https://support.apple.com/en-us/118483
↑ ^7.0 ^7.1 Google (2025). A new look at how Android XR will bring Gemini to glasses and headsets. The Keyword (Google blog). https://blog.google/products/android/android-xr-gemini-glasses-headsets/
↑ ^8.0 ^8.1 Qiu, S. and colleagues (2024). Advancing Extended Reality with 3D Gaussian Splatting: Innovations and Prospects. arXiv:2412.06257. https://arxiv.org/abs/2412.06257
↑ NVIDIA (2024). NVIDIA Digital Human Technologies Bring AI Characters to Life. NVIDIA Newsroom. https://nvidianews.nvidia.com/news/nvidia-digital-human-technologies-bring-ai-characters-to-life-6900750
↑ ^10.0 ^10.1 ^10.2 Christiansen, F. R. and colleagues (2024). Exploring Presence in Interactions with LLM-Driven NPCs: A Comparative Study of Speech Recognition and Dialogue Options. Proceedings of the 30th ACM Symposium on Virtual Reality Software and Technology (VRST '24). doi:10.1145/3641825.3687716. https://dl.acm.org/doi/fullHtml/10.1145/3641825.3687716
↑ ^11.0 ^11.1 Meta (2025). Meta Ray-Ban Display: Breakthrough AI Glasses Available Now. Meta Quest Blog. https://www.meta.com/blog/meta-ray-ban-display-ai-glasses-connect-2025/

[hai-1] 1.0 ^1.1 ^1.2 Stanford Institute for Human-Centered Artificial Intelligence. What is artificial intelligence (AI)? https://hai.stanford.edu/ai-definitions/what-is-artificial-intelligence-ai

[metaai-2] 2.0 ^2.1 ^2.2 ^2.3 ^2.4 Meta. The Magic Under the Hood: How AI Is Powering Meta's Technologies Today and in the Future. Meta Quest Blog. https://www.meta.com/blog/ai-powered-technologies-quest-3-pro-ray-ban-meta-smart-glasses/

[deepfovea-3] Kaplanyan, A., Sochenov, A., Leimkuhler, T., Okunev, M., Goodall, T. and Rufo, G. (2019). DeepFovea: neural reconstruction for foveated rendering and video compression using learned natural video statistics. ACM SIGGRAPH 2019 Talks. doi:10.1145/3306307.3328186. https://dl.acm.org/doi/10.1145/3306307.3328186

[gazeprophet-4] Ebadulla, F., Mudlapur, C. and BV, G. (2025). GazeProphet: Software-Only Gaze Prediction for VR Foveated Rendering. arXiv:2508.13546. https://arxiv.org/abs/2508.13546

[applebiz-5] Apple (2024). Apple Vision Pro brings a new era of spatial computing to business. Apple Newsroom. https://www.apple.com/newsroom/2024/04/apple-vision-pro-brings-a-new-era-of-spatial-computing-to-business/

[opticid-6] 6.0 ^6.1 Apple. About Optic ID advanced technology. Apple Support. https://support.apple.com/en-us/118483

[androidxr-7] 7.0 ^7.1 Google (2025). A new look at how Android XR will bring Gemini to glasses and headsets. The Keyword (Google blog). https://blog.google/products/android/android-xr-gemini-glasses-headsets/

[gsxr-8] 8.0 ^8.1 Qiu, S. and colleagues (2024). Advancing Extended Reality with 3D Gaussian Splatting: Innovations and Prospects. arXiv:2412.06257. https://arxiv.org/abs/2412.06257

[nvidia-9] NVIDIA (2024). NVIDIA Digital Human Technologies Bring AI Characters to Life. NVIDIA Newsroom. https://nvidianews.nvidia.com/news/nvidia-digital-human-technologies-bring-ai-characters-to-life-6900750

[acmnpc-10] 10.0 ^10.1 ^10.2 Christiansen, F. R. and colleagues (2024). Exploring Presence in Interactions with LLM-Driven NPCs: A Comparative Study of Speech Recognition and Dialogue Options. Proceedings of the 30th ACM Symposium on Virtual Reality Software and Technology (VRST '24). doi:10.1145/3641825.3687716. https://dl.acm.org/doi/fullHtml/10.1145/3641825.3687716

[rbdisplay-11] 11.0 ^11.1 Meta (2025). Meta Ray-Ban Display: Breakthrough AI Glasses Available Now. Meta Quest Blog. https://www.meta.com/blog/meta-ray-ban-display-ai-glasses-connect-2025/

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]