Depth sensing

Depth sensing is the measurement of the distance from a device to the real-world surfaces around it, producing a depth map in which each pixel records how far away the scene is at that point.^[1] In virtual reality and augmented reality, it gives a headset or phone a model of the geometry of the room rather than just a flat camera image, which is what lets virtual content interact correctly with the physical world. Depth sensing underpins several features that users notice directly, including occlusion of virtual objects by real ones, reconstruction of the room as a mesh, and mixed reality passthrough, and it contributes to Inside-out tracking and Hand tracking.

A depth map is a single-channel image, similar to a grayscale image, where the value of each pixel is the distance from the sensor to the nearest object along that line of sight rather than a brightness.^[1] Paired with the ordinary color image from a camera, this gives an RGB-D image in which every pixel carries both color and distance.^[1] Because a single depth map only stores one distance per line of sight, it captures the visible front surfaces of a scene rather than a full volumetric model; it is sometimes described as 2.5D for that reason.^[2]

Why depth matters in VR and AR

A plain camera overlay has no idea where real surfaces are, so it draws every virtual pixel on top of the camera feed. The virtual object then floats in front of everything, including real objects that are physically nearer, and the illusion that it occupies the room collapses.^[3] Depth sensing solves this by telling the system, for every direction it can see, how far the real world is in that direction. With that information the renderer can compare the distance of each virtual pixel against the distance of the real surface behind it and decide which one the viewer should see.^[3] Depth is therefore the bridge between the flat image a camera captures and the three-dimensional space that VR and AR software needs to reason about.

How depth is measured

Several distinct technologies are used to recover depth. They split broadly into passive methods, which only observe ambient light, and active methods, which emit their own light, usually infrared, and measure how it comes back. Each makes different trade-offs in range, accuracy, power draw, and how well it copes with sunlight or moving objects.

Stereo cameras

Stereo depth works the way human binocular vision does. Two cameras a known distance apart, called the baseline, photograph the same scene from slightly different viewpoints. Software finds matching points between the two images, and the horizontal shift of a point between the left and right view, called the disparity, encodes its distance.^[4] Disparity is inversely proportional to depth: nearer objects shift more between the two views, distant objects shift less. With the focal length f and baseline B known, the depth Z of a point follows from triangulation as Z equals f times B divided by the disparity, so a wider baseline gives more disparity and finer depth resolution for the same point.^[4]^[5]

Pure stereo is passive, so it needs no emitter and works outdoors in daylight. Its main weakness is that it depends on finding matching features between the two images. On textureless or repetitive surfaces, such as a blank white wall or a uniform floor, there is too little to match and the depth estimate degrades or fails, and the same is true in very low light.^[6] Some systems address this with active stereo, projecting an infrared texture onto the scene so that even blank surfaces gain matchable detail.^[6] Stereo depth from a headset's own tracking cameras is the basis of the depth used for passthrough on several standalone headsets.

Structured light

A structured light sensor projects a known pattern, typically a dense grid of infrared dots, onto the scene and watches how that pattern deforms when it lands on surfaces at different distances. Because the projected pattern is known in advance, the way the dots shift and change size reveals the depth at each point through triangulation between the projector and the camera.^[7] The original Microsoft Kinect for the Xbox 360, released in November 2010, is the best-known example. Its depth technology was licensed from the Israeli company PrimeSense and projected an infrared speckle pattern that a separate infrared camera read to compute distance.^[8]^[9]

Structured light can be very accurate at close range, with high spatial resolution and sub-millimeter precision near the sensor.^[10] Its drawbacks are that the projected pattern is easily washed out by bright ambient sunlight, which limits outdoor use, and that building a frame can require capturing the pattern over a short interval, which makes fast-moving objects harder to handle.^[10] Interference can also occur when more than one structured light device illuminates the same scene.

Time of flight

A time-of-flight (ToF) sensor emits a pulse or a modulated wave of infrared light and measures how long the light takes to travel out to a surface and bounce back. Since the speed of light is constant, that round-trip time converts directly into distance.^[9] ToF captures depth for the whole frame in a single measurement cycle rather than scanning a pattern, which makes it well suited to dynamic scenes, and because it supplies its own light it can work in very low light or total darkness.^[11]^[10] The trade-offs are that depth noise grows with distance, and reflections that reach the sensor by more than one path, called multipath interference, can corrupt the reading.^[10] Microsoft switched the second-generation Kinect, released in late 2013, from structured light to time of flight, illustrating the industry move toward ToF for depth.^[9]

LiDAR

LiDAR, short for light detection and ranging, is a form of direct time-of-flight sensing in which the device emits laser pulses and times each returning pulse to find distance. Direct ToF is what Apple calls the LiDAR Scanner it added to the iPhone and iPad Pro lines. In the LiDAR Scanner, a vertical-cavity surface-emitting laser (VCSEL) emits an array of infrared points, and a single-photon avalanche diode (SPAD) array detects the returning photons and times their flight.^[12] Apple describes the scanner as measuring the distance to surrounding objects up to 5 meters away, working both indoors and outdoors, and operating "at the photon level at nano-second speeds."^[13]

Comparison of depth-sensing methods

Method	Principle	Active or passive	Strengths	Weaknesses
Stereo cameras	Triangulation from the disparity between two views a known baseline apart^[4]	Passive (active variants project texture)	No emitter needed, works in daylight, simple hardware^[6]	Fails on textureless, repetitive, or low-light surfaces where matches cannot be found^[6]
Structured light	A known infrared pattern is projected and its distortion gives depth^[7]	Active	High spatial resolution and sub-millimeter accuracy at close range^[10]	Washed out by bright sunlight, weaker with fast motion, device-to-device interference^[10]
Time of flight	The round-trip travel time of emitted light is measured^[9]	Active	Captures a full frame at once, handles dynamic scenes, works in darkness^[11]^[10]	Noise grows with distance, multipath reflections corrupt readings^[10]
LiDAR (direct ToF)	Laser pulses from a VCSEL are timed by a SPAD detector^[12]	Active	Fast, works indoors and outdoors, ranges to several meters^[13]	Sparse point grid that software must densify, added cost and components^[12]

What depth sensing enables

Occlusion

The most visible payoff of depth sensing in AR is correct occlusion, the ability for virtual objects to pass behind real ones. Google describes occlusion as "the ability for digital objects to accurately appear in front of or behind real world objects."^[14] Once a depth map of the real scene exists, the renderer compares the depth of each virtual pixel against the real depth behind it. Where the virtual object is nearer, its color is drawn; where the real surface is nearer, the camera pixel is kept and the virtual pixel is discarded, which is the same depth-buffer logic used in ordinary 3D graphics extended so that the real world acts as one more set of depth values.^[3]

Scene reconstruction and meshing

Depth maps can be accumulated as the device moves and fused into a single three-dimensional model of the room. Each depth pixel becomes a point in space, the points form a point cloud, and the cloud is converted into a triangle mesh that approximates the real surfaces.^[2] This mesh lets virtual objects rest on real tables, bounce off real walls, and be hidden by real furniture even when the user is not looking directly at those surfaces. Apple's Scene Geometry API, added with ARKit 3.5, uses the LiDAR Scanner to scan the environment and create mesh geometry that apps can use for occlusion and lighting.^[15]

Passthrough

In mixed reality passthrough, the headset shows the user a live camera view of their surroundings with virtual content composited in. Depth is what makes that composite read as a single space, because it lets the system place virtual objects at the right distance and hide them behind real ones. On the Meta Quest 3, the real-time depth map used for passthrough is computed by a computer-vision algorithm that compares the views from the two forward-facing tracking cameras, while a separate depth sensor is used mainly to build the room mesh during setup.^[16]

Inside-out tracking and hand tracking

Depth information also supports Inside-out tracking, where the headset locates itself using its own outward-facing sensors, and Hand tracking, where the system follows the user's bare hands. On the original Microsoft HoloLens, the custom time-of-flight depth camera served two roles: it helped with hand tracking and it performed the surface reconstruction needed to place holograms on real objects.^[17] Knowing the distance to the hand and to surrounding surfaces makes it easier to separate the hand from the background and to estimate the pose of the fingers in three dimensions.

Example hardware and software

Microsoft HoloLens

The first-generation HoloLens, released in 2016, carried a custom time-of-flight depth sensor derived from Kinect technology, alongside an inertial measurement unit and four environment-tracking cameras, and used the depth data for both hand tracking and spatial mapping.^[17] The second-generation HoloLens 2 uses a one-megapixel time-of-flight depth camera based on the Azure Kinect sensor, with a 1024 by 1024 sensor and a range of about 0.5 to 5 meters, to drive its environmental understanding and hand tracking.^[18]

Meta Quest 3

The Meta Quest 3 includes a dedicated depth projector and sensor that improve room meshing and produce more accurate virtual boundaries for mixed reality, while the depth map used for live passthrough is generated in software from the two front greyscale tracking cameras.^[16] For developers, the Meta Depth API provides "a real-time depth map that represents the physical environment's depth as it's seen from the user's point of view," which enables dynamic occlusion so that virtual elements can be hidden behind moving real objects, and the Mesh API exposes a Scene Mesh that reconstructs the room into "a single triangle-based mesh."^[19]

Apple LiDAR, ARKit, and ARCore

Apple added the LiDAR Scanner to the iPad Pro in March 2020 and to the iPhone 12 Pro and iPhone 12 Pro Max later that year.^[13]^[15] On those devices ARKit exposes a dense per-pixel depth map and, combined with the Scene Geometry mesh, uses it to make virtual object occlusion more realistic and to improve people occlusion and motion capture.^[15] On Android, Google's ARCore takes a different approach with its Depth API, publicly launched in 2020. It runs a depth-from-motion algorithm that captures multiple images from slightly different viewpoints as the device moves and compares them to estimate the distance to every pixel, so it works on ordinary phones with a single camera and no dedicated depth sensor; where a time-of-flight sensor is present, the algorithm merges its data for better accuracy.^[20]^[14] ARCore reports the best depth results between about 0.5 and 5 meters, with usable estimates out to roughly 65 meters, and uses the depth maps for occlusion, physics, and surface interaction.^[20]

Relationship to computer vision

Depth sensing sits within the broader field of Computer vision, the discipline concerned with extracting structured information from images. Passive stereo depth is a classic computer-vision problem solved by image matching, and the depth-from-motion approach in ARCore applies the same idea over time, using machine learning to improve the result even when the user barely moves.^[20] Active sensors such as structured light, time of flight, and LiDAR add purpose-built hardware to recover depth more directly, but their output still feeds the same downstream computer-vision tasks of meshing, segmentation, and tracking that turn a depth map into a usable model of the world.

References

↑ ^1.0 ^1.1 ^1.2 "Image-Guided Depth Upsampling via Hessian and TV Priors". https://arxiv.org/pdf/1910.14377.
↑ ^2.0 ^2.1 "Real-time scene reconstruction and triangle mesh generation using multiple RGB-D cameras". https://link.springer.com/article/10.1007/s11554-017-0736-x.
↑ ^3.0 ^3.1 ^3.2 "What is occlusion in AR, and how is it managed?". Milvus. https://milvus.io/ai-quick-reference/what-is-occlusion-in-ar-and-how-is-it-managed.
↑ ^4.0 ^4.1 ^4.2 "Stereo Vision and Depth Estimation". GeeksforGeeks. https://www.geeksforgeeks.org/computer-vision/stereo-vision-and-depth-estimation/.
↑ "Reliable Disparity Estimation Using Multiocular Vision with Adjustable Baseline". PMC. https://pmc.ncbi.nlm.nih.gov/articles/PMC11723057/.
↑ ^6.0 ^6.1 ^6.2 ^6.3 "A Comparison and Evaluation of Stereo Matching on Active Stereo Images". Sensors (MDPI). https://www.mdpi.com/1424-8220/22/9/3332.
↑ ^7.0 ^7.1 "Understanding Depth Cameras: Structured Light, TOF, Stereo". DFRobot Wiki. https://wiki.dfrobot.com/tutorial/20145.
↑ "How Microsoft's PrimeSense-based Kinect Really Works". Electronic Design. https://www.electronicdesign.com/technologies/embedded/article/21795925/how-microsofts-primesense-based-kinect-really-works.
↑ ^9.0 ^9.1 ^9.2 ^9.3 "Kinect range sensing: Structured-light versus Time-of-Flight Kinect". Computer Vision and Image Understanding (ScienceDirect). https://www.sciencedirect.com/science/article/abs/pii/S1077314215001071.
↑ ^10.0 ^10.1 ^10.2 ^10.3 ^10.4 ^10.5 ^10.6 ^10.7 "Stereo Vision vs. Structured Light vs. Time of Flight (ToF)". RF Wireless World. https://www.rfwireless-world.com/terminology/stereo-vision-vs-structured-light-vs-time-of-flight.
↑ ^11.0 ^11.1 "Advantages and Disadvantages of Time-of-Flight Cameras". FRAMOS. https://framos.com/articles/advantages-and-disadvantages-of-time-of-flight-cameras/.
↑ ^12.0 ^12.1 ^12.2 "LiDAR: Apple LiDAR and dTOF Analysis". 4Sense (Medium). https://4sense.medium.com/lidar-apple-lidar-and-dtof-analysis-cc18056ec41a.
↑ ^13.0 ^13.1 ^13.2 "Apple unveils new iPad Pro with LiDAR Scanner and trackpad support in iPadOS". Apple Newsroom. 2020-03-18. https://www.apple.com/newsroom/2020/03/apple-unveils-new-ipad-pro-with-lidar-scanner-and-trackpad-support-in-ipados/.
↑ ^14.0 ^14.1 "Google Launches Depth API For ARCore, Increasing Realism And Improving Occlusion". UploadVR. 2020-06-24. https://www.uploadvr.com/google-arcore-depth-api/.
↑ ^15.0 ^15.1 ^15.2 "Apple releases ARKit 3.5, adding Scene Geometry API and lidar support". VentureBeat. 2020-03-24. https://venturebeat.com/technology/apple-releases-arkit-3-5-adding-scene-geometry-api-and-lidar-support.
↑ ^16.0 ^16.1 "Quest 3 Firmware Clip Shows Depth Sensor 3D Room Meshing". UploadVR. https://www.uploadvr.com/quest-3-firmware-clip-shows/.
↑ ^17.0 ^17.1 "What's Inside Microsoft's HoloLens And How It Works". Tom's Hardware. https://www.tomshardware.com/news/microsoft-hololens-components-hpu-28nm,32546.html.
↑ "HoloLens 2, All the Specs". Next Reality. https://hololens.reality.news/news/hololens-2-all-specs-these-are-technical-details-driving-microsofts-next-foray-into-augmented-reality-0194141/.
↑ "Build Believable Mixed Reality Experiences with Mesh API and Depth API". Meta Horizon OS Developers. https://developers.meta.com/horizon/blog/mesh-depth-api-meta-quest-3-developers-mixed-reality/.
↑ ^20.0 ^20.1 ^20.2 "Depth adds realism". Google for Developers. https://developers.google.com/ar/develop/depth.

[depthmap-1] 1.0 ^1.1 ^1.2 "Image-Guided Depth Upsampling via Hessian and TV Priors". https://arxiv.org/pdf/1910.14377.

[reconstruction-2] 2.0 ^2.1 "Real-time scene reconstruction and triangle mesh generation using multiple RGB-D cameras". https://link.springer.com/article/10.1007/s11554-017-0736-x.

[milvus-3] 3.0 ^3.1 ^3.2 "What is occlusion in AR, and how is it managed?". Milvus. https://milvus.io/ai-quick-reference/what-is-occlusion-in-ar-and-how-is-it-managed.

[geeks-4] 4.0 ^4.1 ^4.2 "Stereo Vision and Depth Estimation". GeeksforGeeks. https://www.geeksforgeeks.org/computer-vision/stereo-vision-and-depth-estimation/.

[multiocular-5] "Reliable Disparity Estimation Using Multiocular Vision with Adjustable Baseline". PMC. https://pmc.ncbi.nlm.nih.gov/articles/PMC11723057/.

[activestereo-6] 6.0 ^6.1 ^6.2 ^6.3 "A Comparison and Evaluation of Stereo Matching on Active Stereo Images". Sensors (MDPI). https://www.mdpi.com/1424-8220/22/9/3332.

[dfrobot-7] 7.0 ^7.1 "Understanding Depth Cameras: Structured Light, TOF, Stereo". DFRobot Wiki. https://wiki.dfrobot.com/tutorial/20145.

[electronicdesign-8] "How Microsoft's PrimeSense-based Kinect Really Works". Electronic Design. https://www.electronicdesign.com/technologies/embedded/article/21795925/how-microsofts-primesense-based-kinect-really-works.

[kinectsl-9] 9.0 ^9.1 ^9.2 ^9.3 "Kinect range sensing: Structured-light versus Time-of-Flight Kinect". Computer Vision and Image Understanding (ScienceDirect). https://www.sciencedirect.com/science/article/abs/pii/S1077314215001071.

[rfww-10] 10.0 ^10.1 ^10.2 ^10.3 ^10.4 ^10.5 ^10.6 ^10.7 "Stereo Vision vs. Structured Light vs. Time of Flight (ToF)". RF Wireless World. https://www.rfwireless-world.com/terminology/stereo-vision-vs-structured-light-vs-time-of-flight.

[framos-11] 11.0 ^11.1 "Advantages and Disadvantages of Time-of-Flight Cameras". FRAMOS. https://framos.com/articles/advantages-and-disadvantages-of-time-of-flight-cameras/.

[4sense-12] 12.0 ^12.1 ^12.2 "LiDAR: Apple LiDAR and dTOF Analysis". 4Sense (Medium). https://4sense.medium.com/lidar-apple-lidar-and-dtof-analysis-cc18056ec41a.

[applelidar-13] 13.0 ^13.1 ^13.2 "Apple unveils new iPad Pro with LiDAR Scanner and trackpad support in iPadOS". Apple Newsroom. 2020-03-18. https://www.apple.com/newsroom/2020/03/apple-unveils-new-ipad-pro-with-lidar-scanner-and-trackpad-support-in-ipados/.

[uploadvr-14] 14.0 ^14.1 "Google Launches Depth API For ARCore, Increasing Realism And Improving Occlusion". UploadVR. 2020-06-24. https://www.uploadvr.com/google-arcore-depth-api/.

[venturebeat-15] 15.0 ^15.1 ^15.2 "Apple releases ARKit 3.5, adding Scene Geometry API and lidar support". VentureBeat. 2020-03-24. https://venturebeat.com/technology/apple-releases-arkit-3-5-adding-scene-geometry-api-and-lidar-support.

[upload3-16] 16.0 ^16.1 "Quest 3 Firmware Clip Shows Depth Sensor 3D Room Meshing". UploadVR. https://www.uploadvr.com/quest-3-firmware-clip-shows/.

[tomshardware-17] 17.0 ^17.1 "What's Inside Microsoft's HoloLens And How It Works". Tom's Hardware. https://www.tomshardware.com/news/microsoft-hololens-components-hpu-28nm,32546.html.

[nextreality-18] "HoloLens 2, All the Specs". Next Reality. https://hololens.reality.news/news/hololens-2-all-specs-these-are-technical-details-driving-microsofts-next-foray-into-augmented-reality-0194141/.

[metadepth-19] "Build Believable Mixed Reality Experiences with Mesh API and Depth API". Meta Horizon OS Developers. https://developers.meta.com/horizon/blog/mesh-depth-api-meta-quest-3-developers-mixed-reality/.

[googledepth-20] 20.0 ^20.1 ^20.2 "Depth adds realism". Google for Developers. https://developers.google.com/ar/develop/depth.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]