SLAM: Difference between revisions - VR & AR Wiki - Virtual Reality & Augmented Reality Wiki

Line 1:

[[SLAM]] (**S**imultaneous **L**ocalization **A**nd **M**apping) is a computational problem and a set of [[algorithms]] used primarily in robotics and autonomous systems, including [[VR headset]]s and [[AR headset]]s. The ~~goal of~~ SLAM is ~~for~~ a device, using data from its onboard [[sensors]] (like [[cameras]], [[IMU]]s, and sometimes [[depth sensors]]), to construct a [[map]] of an unknown [[environment]] while simultaneously determining its own position and orientation ([[pose]]) within that newly created map. This enables [[inside-out tracking]], meaning the device tracks its position in [[3D space]] without needing external sensors or markers (like [[Lighthouse]] base stations).

[[SLAM]] (**S**imultaneous **L**ocalization **A**nd **M**apping) is a computational problem and a set of [[algorithms]] used primarily in robotics and autonomous systems, including [[VR headset]]s and [[AR headset]]s. The core challenge SLAM addresses is often described as a "chicken-and-egg problem": to know where you are, you need a map, but to build a map, you need to know where you are. SLAM solves this by enabling a device, using data from its onboard [[sensors]] (like [[cameras]], [[IMU]]s, and sometimes [[depth sensors]] like [[Time-of-Flight|Time-of-Flight (ToF)]]), to construct a [[map]] of an unknown [[environment]] while simultaneously determining its own position and orientation ([[pose]]) within that newly created map. This self-contained process enables [[inside-out tracking]], meaning the device tracks its position in [[3D space]] without needing external sensors or markers (like [[Lighthouse]] base stations).

=== How SLAM Works ===

SLAM systems typically involve several key components working together:

SLAM systems typically involve several key components working together in a continuous feedback loop:

* '''[[Feature Detection|Feature Detection/Tracking]]:''' Identifying salient points or features in the sensor data (e.g., corners in camera images). These features are tracked ~~over time~~ as the device moves.

* '''[[Feature Detection|Feature Detection/Tracking]]:''' Identifying salient points or features (often called [[landmarks]]) in the sensor data (e.g., corners in camera images using methods like the [[ORB feature detector]]). These features are tracked frame-to-frame as the device moves.

* '''[[Mapping]]:''' Using the tracked features and the device's estimated movement to build and update a representation (the map) of the environment. This map might consist of feature points~~, lines, planes,~~ or denser representations like point ~~clouds~~ or ~~meshes~~.

* '''[[Mapping]]:''' Using the tracked features and the device's estimated movement (odometry) to build and update a representation (the map) of the environment. This map might consist of sparse feature points (common for localization-focused SLAM) or denser representations like [[point cloud]]s or [[mesh]]es (useful for environmental understanding).

* '''[[Localization]] (or Pose Estimation):''' Estimating the device's current position and orientation (pose) relative to the map it has built.

* '''[[Localization]] (or Pose Estimation):''' Estimating the device's current position and orientation (pose) relative to the map it has built, often by observing how known landmarks appear from the current viewpoint.

* '''[[Loop Closure]]:''' Recognizing when the device has returned to a previously visited location. This is crucial for correcting accumulated drift in the map and pose estimate, leading to a globally consistent map.

* '''[[Loop Closure]]:''' Recognizing when the device has returned to a previously visited location by matching current sensor data to earlier map data (e.g., using appearance-based methods like [[bag-of-words]]). This is crucial for correcting accumulated [[Drift (tracking)|drift]] (incremental errors) in the map and pose estimate, leading to a globally consistent map.

* '''[[Sensor Fusion]]:''' Often combining data from multiple sensors ~~(e.g~~.~~, cameras and~~ [[~~IMU~~]]s in [[~~Visual Inertial Odometry|VIO~~]]~~) to improve~~ robustness ~~and accuracy~~ against ~~challenges like~~ fast motion or textureless surfaces.

* '''[[Sensor Fusion]]:''' Often combining data from multiple sensors. [[Visual Inertial Odometry|Visual-Inertial Odometry (VIO)]] is extremely common in modern SLAM, fusing camera data with [[IMU]] data. The IMU provides high-frequency motion updates, improving robustness against fast motion, motion blur, or visually indistinct (textureless) surfaces where camera tracking alone might struggle.

=== SLAM vs. [[Visual Inertial Odometry]] (VIO) ===

While related and often used together, SLAM and [[Visual Inertial Odometry]] (VIO) have different primary goals:

* '''[[VIO]]''' primarily focuses on estimating the device's ego-motion (how it moves relative to its immediate surroundings) by fusing visual data from cameras and motion data from an [[IMU]]. It's excellent for short-term, low-latency tracking but can accumulate [[Drift (tracking)|drift]] over time and doesn't necessarily build a persistent, globally consistent map optimized for re-localization or ~~sharing~~.

* '''[[VIO]]''' primarily focuses on estimating the device's ego-motion (how it moves relative to its immediate surroundings) by fusing visual data from cameras and motion data from an [[IMU]]. It's excellent for short-term, low-latency tracking but can accumulate [[Drift (tracking)|drift]] over time and doesn't necessarily build a persistent, globally consistent map optimized for re-localization or loop closure. Systems like Apple's [[ARKit]] and Google's [[ARCore]] rely heavily on VIO for tracking, adding surface detection and limited mapping but typically without the global map optimization and loop closure found in full SLAM systems.

* '''SLAM''' focuses on building a map of the environment and localizing the device within that map. It aims for global consistency, often incorporating techniques like loop closure. Many modern VR/AR tracking systems use VIO for the high-frequency motion estimation component within a larger SLAM framework that handles mapping, persistence, and drift correction.

* '''SLAM''' focuses on building a map of the environment and localizing the device within that map. It aims for global consistency, often incorporating techniques like loop closure to correct drift. Many modern VR/AR tracking systems use VIO for the high-frequency motion estimation component within a larger SLAM framework that handles mapping, persistence, and drift correction. Essentially, VIO provides the odometry, while SLAM builds and refines the map using that odometry and sensor data.

=== Importance in VR/AR ===

SLAM (often ~~in conjunction with~~ VIO) is fundamental technology for modern standalone [[VR headset]]s and [[AR headset]]s/[[Smart Glasses|glasses]]:

SLAM (often incorporating VIO) is fundamental technology for modern standalone [[VR headset]]s and [[AR headset]]s/[[Smart Glasses|glasses]]:

* '''[[6DoF]] Tracking:''' Enables full six-degrees-of-freedom tracking (positional and rotational) without external base stations ~~or markers~~, allowing users to move freely within their [[Playspace|playspace]].

* '''[[6DoF]] Tracking:''' Enables full six-degrees-of-freedom tracking (positional and rotational) without external base stations, allowing users to move freely within their [[Playspace|playspace]].

* '''[[World Locking|World-Locking]]:''' Ensures virtual objects appear stable and fixed in the real world (for AR/[[Mixed Reality|MR]]) or that the virtual environment remains stable relative to the user's playspace (for VR).

* '''[[Roomscale VR|Roomscale]] Experiences:''' Defines boundaries and understands the physical playspace for safety and ~~interaction~~.

* '''[[Roomscale VR|Roomscale]] Experiences & Environment Understanding:''' Defines boundaries (like [[Meta Quest Insight|Meta's Guardian]]) and understands the physical playspace (surfaces, obstacles) for safety, interaction, and realistic occlusion (virtual objects hidden by real ones).

* '''[[Passthrough AR|Passthrough]] and [[Mixed Reality]]:''' Helps align virtual content accurately with the real-world view captured by device cameras.

* '''Persistent Anchors & Shared Experiences:''' Allows digital content to be saved and anchored to specific locations in the real world (spatial anchors), enabling multi-user experiences where participants see the same virtual objects in the same real-world spots across different sessions or devices.

* '''Persistent Anchors & Shared Experiences:''' Allows digital content to be saved and anchored to specific locations in the real world ([[Spatial Anchor|spatial anchors]]), enabling multi-user experiences where participants see the same virtual objects in the same real-world spots across different sessions or devices.

=== Types ~~of SLAM~~ ===

=== Types and Algorithms ===

SLAM systems can be categorized based on the primary sensors used:

SLAM systems can be categorized based on the primary sensors used and the algorithmic approach:

* '''Visual SLAM (vSLAM):''' Relies mainly on [[cameras]]. Can be monocular (one camera), stereo (two cameras), or RGB-D (using a [[depth sensor]]). Often fused with [[IMU]] data ([[Visual Inertial Odometry|VIO-SLAM]]). ~~Popular research algorithms include~~ [[ORB-SLAM3]] and [[RTAB-Map]].

* '''Visual SLAM (vSLAM):''' Relies mainly on [[cameras]]. Can be monocular (one camera), stereo (two cameras), or RGB-D (using a [[depth sensor]]). Often fused with [[IMU]] data ([[Visual Inertial Odometry|VIO-SLAM]]).

* '''[[LiDAR]] SLAM:''' Uses Light Detection and Ranging sensors. Common in robotics and autonomous vehicles, and used in some high-end AR/MR devices (like [[Apple Vision Pro]]) often ~~in conjunction~~ with cameras for ~~improved~~ mapping and tracking robustness.

* '''[[ORB-SLAM2]]''': A widely cited open-source library using [[ORB feature detector|ORB features]]. It supports monocular, stereo, and RGB-D cameras but is purely vision-based (no IMU). Known for robust relocalization and creating sparse feature maps.

* '''Filter-based vs. Optimization-based:''' Historically, methods like [[Extended Kalman Filter|EKF-SLAM]] were common (filter-based). Modern systems often use graph-based optimization techniques (like [[bundle adjustment]]) ~~for higher accuracy~~, especially after loop closures.

* '''[[ORB-SLAM3]]''': An evolution of ORB-SLAM2 (released c. 2020/21) adding tight visual-inertial fusion (camera + IMU) for significantly improved accuracy and robustness, especially during fast motion. Supports [[fisheye lens|fisheye]] cameras and multi-map capabilities (handling different sessions or areas). Still produces a sparse map, considered state-of-the-art in research for VIO-SLAM accuracy.

* '''[[RTAB-Map]]''' (Real-Time Appearance-Based Mapping): An open-source graph-based SLAM approach focused on long-term and large-scale mapping. Uses appearance-based loop closure. While it can use sparse features, it's often used with RGB-D or stereo cameras to build *dense* maps (point clouds, [[occupancy grid]]s, meshes) useful for navigation or scanning. Can also incorporate [[LiDAR]] data. Tends to be more computationally intensive than sparse methods.

* '''[[LiDAR]] SLAM:''' Uses Light Detection and Ranging sensors. Common in robotics and autonomous vehicles, and used in some high-end AR/MR devices (like [[Apple Vision Pro]]), often fused with cameras and IMUs for enhanced mapping and tracking robustness.

* '''Filter-based vs. Optimization-based:''' Historically, methods like [[Extended Kalman Filter|EKF-SLAM]] were common (filter-based). Modern systems often use graph-based optimization techniques (like [[bundle adjustment]]) which optimize the entire trajectory and map simultaneously, especially after loop closures, generally leading to higher accuracy.

=== Examples in VR/AR Devices ===

Many consumer VR/AR devices utilize SLAM or SLAM-like systems, often incorporating VIO:

* '''[[Meta Quest]] Headsets ([[Meta Quest 2]], [[Meta Quest 3]], [[Meta Quest Pro]]):''' Use [[Meta Quest Insight|Insight tracking]], a sophisticated inside-out system based heavily on VIO with SLAM components for mapping, boundary definition, and ~~persistence~~.

* '''[[Meta Quest]] Headsets ([[Meta Quest 2]], [[Meta Quest 3]], [[Meta Quest Pro]]):''' Use [[Meta Quest Insight|Insight tracking]], a sophisticated inside-out system based heavily on VIO (using 4 low-light [[fisheye lens|fisheye]] cameras and an IMU on Quest 2/Pro/3) with SLAM components for mapping (sparse feature map), boundary definition (Guardian), persistence, and enabling features like Passthrough and Space Sense. Considered a breakthrough for affordable, high-quality consumer VR tracking.

* '''[[Microsoft HoloLens|HoloLens 1]] & [[Microsoft HoloLens 2|HoloLens 2]]:''' Employ advanced SLAM systems using cameras, depth ~~sensors~~, and ~~IMUs~~ for robust spatial mapping and tracking.

* '''[[Microsoft HoloLens|HoloLens 1]] (2016) & [[Microsoft HoloLens 2|HoloLens 2]]:''' Employ advanced SLAM systems using multiple visible-light tracking cameras, a [[Time-of-Flight|ToF]] [[depth sensor]], and an IMU for robust spatial mapping (generating a [[mesh]] of the environment) and tracking. All processing is done on-device.

* '''[[Magic Leap 1]] & [[Magic Leap 2]]:''' Utilize SLAM for environment mapping and head tracking.

* '''[[Magic Leap 1]] (2018) & [[Magic Leap 2]]:''' Utilize SLAM ("Visual Perception") with an array of cameras and sensors for environment mapping (creating a digital mesh) and head tracking. [[Magic Leap 2]] allows saving and reusing mapped spaces ("Spatial Anchors").

* '''[[Apple Vision Pro]]:''' Features an advanced tracking system fusing data from numerous cameras, [[LiDAR]], and IMUs, implementing sophisticated VIO and SLAM techniques for detailed spatial understanding.

* '''[[Apple Vision Pro]]:''' Features an advanced tracking system fusing data from numerous cameras, [[LiDAR]], depth sensors, and IMUs, implementing sophisticated VIO and SLAM techniques for detailed spatial understanding and persistent anchoring.

* Many [[Windows Mixed Reality]] headsets.

* [[Pico Neo 3 Link|Pico Neo 3]], [[Pico 4]].

Line 40:

Line 43:

[[Category:Computer Vision]]

[[Category:Core Concepts]]

[[Category:Algorithms]]