Jump to content

Depth perception

From VR & AR Wiki

Depth perception is the ability of the visual system to judge the distance of objects and to perceive the three-dimensional structure of a scene from the largely two-dimensional images formed on the retinas.[1] The brain combines several independent sources of information, known as depth cues, that each give partial evidence about distance. These cues are conventionally divided into binocular cues, which require both eyes, and monocular cues, which are available to a single eye.[1][2]

Depth perception is central to virtual reality (VR) and augmented reality (AR) because a head-mounted display must recreate enough of these cues to make a flat, near-eye image read as a volumetric space. Stereoscopic displays reproduce many cues faithfully, but they cannot reproduce one of them, the focus of the eye, which produces a mismatch called the vergence-accommodation conflict that is a recognised source of eye strain and visual fatigue in VR.[3]

History

The role of two eyes in seeing depth was established by Charles Wheatstone, who in 1838 described that each eye receives a slightly different view of a solid object and that the brain fuses these two views into a single impression of depth.[4] To demonstrate the effect Wheatstone built the stereoscope, an instrument that presents a separate image to each eye through mirrors so that a pair of flat pictures, drawn or photographed from two slightly separated viewpoints, appears three-dimensional.[1][4] This mirror-and-prism principle is the direct ancestor of the dual-image optics used in modern VR headsets.

For more than a century after Wheatstone it was assumed that the visual system needed to recognise object outlines before it could compute depth from the two eyes. In 1960 Bela Julesz, working at Bell Labs, disproved this with the random-dot stereogram: a pair of images made of random dots that contains no recognisable shapes to either eye alone, yet produces a clear impression of a shape floating in depth when the two are viewed one to each eye.[4][5] Julesz showed that depth from the difference between the two eyes' images can arise on its own, without any other cue, which is the perceptual basis on which stereoscopic 3D displays rely.[4]

Binocular cues

Binocular cues use the fact that the two eyes, separated horizontally by the interpupillary distance, see the world from slightly different positions.

  • Binocular disparity (the basis of stereopsis) is the small difference between the images of a scene in the left and right eyes. The visual system measures this difference and uses it to triangulate distance. It is the cue that gives the strongest sense of solid, tangible depth and is the one stereoscopic displays are built to deliver.[1][4]
  • Convergence (a form of vergence) is the inward rotation of the eyes needed to point both at the same near object. The brain reads the muscular effort of this rotation as a distance signal. Convergence and disparity are most useful within roughly ten metres and become negligible at long range.[1]

The horizontal overlap of the two eyes' fields, the binocular overlap, sets the region of the visual field where these cues are available; VR headsets are designed with a deliberate amount of overlap between the two displays for this reason.

Monocular cues

Monocular cues are available to one eye and therefore work on ordinary flat pictures and screens as well as in life. Most are pictorial, meaning they survive in a still image:

  • Occlusion (interposition): a nearer object partly hides a farther one, so the object that is covered is judged to be behind. Occlusion gives only the order of objects in depth, not how far apart they are, but it is the most reliable ordinal cue.[2]
  • Relative size: an object that projects a smaller image on the retina is judged farther away than an identical object that projects a larger image.[1][2]
  • Linear perspective: parallel edges, such as the sides of a road, appear to converge with distance, and the degree of convergence indicates how far away parts of the scene are.[2]
  • Texture gradient: the texture of a surface appears finer and more closely packed as it recedes.[1][2]
  • Aerial (atmospheric) perspective: distant objects appear hazier, lower in contrast and less saturated in colour because of light scattering in the air.[1]
  • Lighting and shading: the pattern of light and shadow on a surface reveals its shape and its position relative to a light source.[1]

Two further monocular cues depend on movement or on the eye's optics rather than on a static picture:

  • Motion parallax: when the observer's head moves, near objects sweep across the field of view faster than far ones, and this difference in apparent speed specifies relative depth. Motion parallax is one of the strongest monocular cues and gives a more compelling sense of depth than the static pictorial cues alone.[1][2]
  • Accommodation: the focusing of the eye's crystalline lens, driven by the ciliary muscle, changes with the distance of whatever is being fixated, and the brain uses this muscular state as a weak distance signal at close range, within about two metres.[1]

Depth perception in VR and AR

A stereoscopic head-mounted display reproduces depth by rendering the virtual scene twice, once from the position of each eye, and showing each rendering to the corresponding eye through a separate lens. This delivers binocular disparity directly, and the renderer reproduces the pictorial monocular cues (occlusion, relative size, perspective, texture, shading) as a natural by-product of drawing a three-dimensional scene.[6] Headsets that track head position also reproduce motion parallax, because the rendered viewpoint shifts as the user moves.[6] Recreating these cues is part of why a well-made VR scene can produce a sense of presence and immersion.

The cue a conventional headset cannot reproduce correctly is accommodation. The two displays sit at a single fixed optical distance, usually set near two metres, so the eye must always focus at that one plane no matter how near or far a virtual object is meant to be. Disparity tells the eyes to converge on, say, an object held at arm's length, while the focus the eye must adopt to see a sharp image stays fixed at the display plane. The two signals, which always agree in natural vision, now disagree. This is the vergence-accommodation conflict.[3][6]

In the most cited study of the effect, Hoffman, Girshick, Akeley and Banks (2008) used a special bench display that could present matching or mismatching focus cues. When focus cues matched the simulated distance, observers fused stereoscopic images faster, discriminated finer differences in depth (higher stereoacuity), perceived depth with less distortion, and reported significantly less eye and head fatigue. The authors described this as the first demonstration that a mismatch between the stimuli to vergence and accommodation by itself causes visual fatigue and discomfort.[3] A later study by Shibata, Kim, Hoffman and Banks (2011) mapped a "zone of comfort," a range of vergence-accommodation mismatch within which most viewers are comfortable, and found that comfort depends on both the size and the direction of the mismatch and on the viewing distance.[7] The conflict is one of the contributing factors in visual discomfort and VR sickness.[8]

Approaches to reproducing focus

Several display approaches aim to restore the missing accommodation cue so that focus matches vergence:

Approach How it addresses focus
Varifocal display Tracks the user's gaze and physically or optically changes the focal distance of the display to match where the eyes are looking.[8][9]
Multifocal display Presents several focal planes at once so the eye can accommodate to the nearest available plane and receive approximately correct blur.[8][6]
Light field display Recreates the directions of light rays as they would arrive from a real scene, letting each eye focus naturally between near and far virtual objects.[8]
Holographic / retinal display Forms the image so that the eye can adopt a focus appropriate to the simulated depth rather than to a fixed screen.[8]

The best known varifocal research prototype is Half Dome, shown in 2018 by Facebook Reality Labs (formerly Oculus Research) and presented in detail at Display Week by a team led by Douglas Lanman. The first version used eye tracking to mechanically move the displays back and forth so that focus followed the user's gaze; later versions replaced the moving displays with stacks of liquid-crystal lenses that switch between focal states with no moving parts.[9][8] As of 2026 these focus-correct designs remain research and prototype hardware, and shipping consumer headsets still use a single fixed focal plane.[8]

See also

References