Volumetric video
Volumetric video is a technique for capturing a real person, object, or scene as a dynamic three-dimensional representation that a viewer can look at from any angle. Unlike conventional flat video, which records a scene from one fixed viewpoint, a volumetric recording stores the 3D geometry and surface appearance of the subject over time, so it can be placed in a virtual or augmented environment and observed from a freely chosen position.[1][2]
Because the output is a moving 3D model rather than a 2D image plane, volumetric video supports viewing with six degrees of freedom (6DoF), meaning the viewer can change both orientation and position. This distinguishes it from 360-degree video, where the viewer can only rotate the head (yaw, pitch, and sometimes roll) and the perspective stays fixed.[3] Volumetric content is used as a source for virtual reality, augmented reality, and mixed reality experiences, as well as for visual effects, live broadcast, and on conventional 2D screens.[4] The recordings are often described as "holograms" in marketing, though they are not holograms in the optical-physics sense.[4]
History
The academic foundation of volumetric video is the work on free-viewpoint and "virtualized" capture of dynamic events in the 1990s. In 1997 Takeo Kanade and colleagues at Carnegie Mellon University described Virtualized Reality, a system that recorded a moving scene with a dome of 51 cameras (each 512 by 512 pixels, running at 30 frames per second) arranged around a five-meter geodesic dome. The system computed the 3D structure of the event using multi-baseline stereo and then rendered the scene from arbitrary new viewpoints by triangulation and texture mapping.[5] This class of capture is referred to in the research literature as free-viewpoint video (FVV).[1]
A widely cited step toward production-quality, streamable output came in 2015, when Alvaro Collet and colleagues at Microsoft Research published "High-Quality Streamable Free-Viewpoint Video" at SIGGRAPH (ACM Transactions on Graphics, volume 34, issue 4). The paper presented what the authors called the first end-to-end system to record performances with a dense set of RGB and infrared cameras, reconstruct dynamic textured surfaces, and compress them into a streamable 3D video format. Their capture rig used 106 calibrated and synchronized cameras, and the system encoded the result as tracked textured triangle meshes inside an MPEG stream so it could be played back in real time on consumer devices.[1] This work became the technical basis for Microsoft's Mixed Reality Capture Studios.[4]
Commercial capture facilities followed. In June 2018 the joint venture Volucap GmbH opened the first volumetric video studio on the European continent in Potsdam-Babelsberg, founded by ARRI, the Fraunhofer-Gesellschaft, Interlake System, Studio Babelsberg, and UFA. The studio is built around 32 cameras mounted on a light rotunda nearly four meters high and uses Fraunhofer Heinrich Hertz Institute software called 3D Human Body Reconstruction, which turns the captured footage into dynamic 3D models that can be processed like computer-generated ones.[2][6]
How it works
A volumetric capture stage surrounds the subject with an array of synchronized cameras, typically several dozen to over a hundred, recording the subject from all sides at the same time. The cameras must be precisely calibrated so their images can be combined into a single consistent 3D coordinate space.[4][1] Many systems pair colour (RGB) cameras with depth sensing, using infrared structured light, time-of-flight sensors, or laser-based depth sensing to measure the distance to surfaces. Reconstruction software then merges the colour, depth, and silhouette information to recover the shape of the subject for each frame and to apply the recorded colour as a texture.[1][3]
Microsoft's stage is a representative example: the studio in San Francisco used 106 cameras, split into 53 RGB cameras that record colour video from many angles and 53 infrared cameras that read a pattern of laser dots projected onto the subject to map its surface. The combined feeds produced roughly 10 gigabytes of data per second, and the system devoted extra processing to faces so that expressions came through clearly. The active capture space was about eight feet in diameter and ten feet tall.[7] Microsoft later built more portable mobile stages that use 64 cameras instead of 106.[4]
Data representations and compression
The reconstructed result can be stored in several forms. A point cloud represents the surface as a set of coloured 3D points; a voxel grid divides space into small 3D cells; and a textured mesh represents the surface as connected triangles with an image atlas applied to them.[3][1] Each frame of a volumetric recording is far larger than a frame of ordinary video, so compression is a central problem. The Collet system encoded tracked meshes into an MPEG video stream.[1] For point-cloud content, MPEG standardized Video-based Point Cloud Compression (V-PCC) as part of ISO/IEC 23090-5, which converts each 3D frame into 2D image patches (called atlas data) so they can be compressed with existing video codecs such as HEVC. V-PCC is built on the more general Visual Volumetric Video-based Coding (V3C) framework.[8][9]
Volumetric video is related to several other 3D-capture and rendering methods but is distinguished by its focus on moving, time-varying subjects.
Photogrammetry reconstructs a 3D model from a set of overlapping still photographs of a static object or scene. It produces a single fixed model rather than a moving one, so a volumetric capture can be thought of, in part, as photogrammetry-like reconstruction repeated for every frame of a performance.[10]
Neural radiance fields (NeRF) use a neural network to learn a scene from images and synthesize photorealistic new viewpoints. NeRF was developed mainly for static scenes, and extending it to moving subjects (dynamic NeRF) is an active research area aimed at producing neural volumetric video with smaller data sizes than explicit meshes or point clouds.[10][11] Gaussian splatting, a newer representation that depicts a scene as many overlapping 3D Gaussian blobs, is likewise being adapted from static capture toward dynamic, volumetric use.[10]
Applications
VR, AR, and mixed reality
The most direct use of volumetric video in this field is placing a recorded person into an immersive scene. Microsoft Mixed Reality Capture Studios recorded performances that could be viewed in augmented reality, virtual reality, and on 2D screens; documented examples include the New York Times "Ashley Graham: Unfiltered" interactive AR piece and corporate keynote holograms.[4] The company 8i built its business around streaming captured human "holograms" to phones, browsers, and VR or AR headsets, and in 2017 released a consumer AR app called Holo that let users place pre-recorded volumetric figures into their own photos and videos.[12][13]
Sports and live events
Intel's True View system (originally developed by the Israeli company Replay Technologies as freeD) installs an array of high-resolution cameras around a stadium to produce volumetric replays. The system uses dozens of cameras based on 20-megapixel industrial sensors to build a 3D voxel model of the play, which a rendering engine can then show from any angle within the cameras' coverage; a single 15-to-30-second clip can involve up to a terabyte of source data. True View has been deployed in the home venues of around twenty National Football League teams and in soccer stadiums in Europe.[14][15]
Current status
As of 2026 volumetric video is produced commercially by a network of specialist capture studios rather than by a single dominant platform. In August 2023 Microsoft announced that the volumetric specialist Arcturus had become the go-to-market partner for its Mixed Reality Capture Studios technology; the arrangement pairs creators with licensed MRCS studios (whose license holders include Metastage in Los Angeles, Dimension in London, and Jump in Korea) and combines the capture pipeline with Arcturus's HoloSuite editing and streaming tools.[16][17] Volucap continues to operate its studio near Berlin, and 8i remains an active volumetric video company.[13][6] Research effort has shifted toward neural and learned representations (dynamic NeRF and Gaussian splatting) and toward standardized compression such as MPEG V-PCC, with the shared goal of reducing the large data sizes that have limited streaming of volumetric content to consumer devices.[9][11]
References
- ↑ 1.0 1.1 1.2 1.3 1.4 1.5 1.6 Collet, A., Chuang, M., Sweeney, P., Gillett, D., Evseev, D., Calabrese, D., Hoppe, H., Kirk, A. and Sullivan, S. (2015). High-Quality Streamable Free-Viewpoint Video. ACM Transactions on Graphics (SIGGRAPH 2015), 34(4). Microsoft Corporation. https://dl.acm.org/doi/10.1145/2766945
- ↑ 2.0 2.1 Fraunhofer Heinrich Hertz Institute (2018). Volumetric studio opens in Babelsberg just outside Berlin. https://www.hhi.fraunhofer.de/en/news/nachrichten/2018/volumetric-studio-opens-in-babelsberg-just-outside-berlin.html
- ↑ 3.0 3.1 3.2 Alpha3D. Volumetric 3D video streaming explained for game developers. https://www.alpha3d.io/kb/future-of-3d/volumetric-3d-video-streaming/
- ↑ 4.0 4.1 4.2 4.3 4.4 4.5 Microsoft (2019). Microsoft Mixed Reality Capture Studios create holograms to educate and entertain. Microsoft Source. https://news.microsoft.com/source/features/work-life/microsoft-mixed-reality-capture-studios-create-holograms-to-educate-and-entertain/
- ↑ Kanade, T., Rander, P. and Narayanan, P.J. (1997). Virtualized Reality: Constructing Virtual Worlds from Real Scenes. IEEE MultiMedia, 4(1), pp. 34-47. https://www.ri.cmu.edu/pub_files/pub4/kanade_takeo_2006_1/kanade_takeo_2006_1.pdf
- ↑ 6.0 6.1 Volucap. Volumetric studio opens in Babelsberg just outside Berlin. https://volucap.com/volucap-first-studio-europe/
- ↑ Roettgers, J. (2018). 106 Cameras, Holograms and Sticky Tape: Inside Microsoft's Mixed Reality Capture Studios. Variety. https://variety.com/2018/digital/features/microsoft-mixed-reality-capture-behind-the-scenes-1202784950/
- ↑ ISO/IEC 23090-5:2023. Information technology - Coded representation of immersive media - Part 5: Visual volumetric video-based coding (V3C) and video-based point cloud compression (V-PCC). https://www.iso.org/standard/83535.html
- ↑ 9.0 9.1 Schwarz, S. et al. (2022). An Overview of the MPEG Standard for Storage and Transport of Visual Volumetric Video-Based Coding. Frontiers in Signal Processing. https://www.frontiersin.org/journals/signal-processing/articles/10.3389/frsip.2022.883943/full
- ↑ 10.0 10.1 10.2 Varjo. Gaussian splatting vs. photogrammetry vs. NeRFs. https://get.teleport.varjo.com/blog/photogrammetry-vs-nerfs-gaussian-splatting-pros-and-cons
- ↑ 11.0 11.1 Wang, L. et al. (2022). NeuVV: Neural Volumetric Videos with Immersive Rendering and Editing. arXiv:2202.06088. https://arxiv.org/abs/2202.06088
- ↑ Lang, B. (2017). 8i Lands $27M in Series B Funding and Reveals Tango-powered Mixed Reality App 'Holo'. Road to VR. https://www.roadtovr.com/8i-unveils-tango-powered-mixed-reality-app-holo-27m-funding/
- ↑ 13.0 13.1 8i. The Future of Human Connection Through Volumetric Video. https://8i.com/about/
- ↑ Vision Systems Design (2017). 360-degree sports replay vision system from Intel now installed in 11 NFL stadiums. https://www.vision-systems.com/boards-software/article/16750713/360-degree-sports-replay-vision-system-from-intel-now-installed-in-11-nfl-stadiums
- ↑ Takahashi, D. (2019). Intel True View is a cool technology for immersive sports viewing. VentureBeat. https://venturebeat.com/2019/09/19/intel-true-view-is-a-cool-technology-for-immersive-sports-viewing/
- ↑ Arcturus (2023). Arcturus Becomes Go-To-Market Partner for Microsoft's Mixed Reality Capture Studios Technology. https://arcturus.studio/blog/microsoft-partnership/
- ↑ Peddie, J. (2023). Microsoft's MRCS rehomed at Arcturus. Jon Peddie Research. https://www.jonpeddie.com/news/microsofts-mrcs-rehomed-at-arcturus/