Jump to content

Pose

From VR & AR Wiki

Pose is the combination of the position and orientation of a rigid body in three-dimensional space. In virtual reality (VR) and augmented reality (AR), the pose of the headset, of each hand controller, and of tracked accessories is the basic quantity a system must measure to draw a stable scene and to let a user reach into it. A full pose has six independent values: three for position (translation along the X, Y and Z axes) and three for orientation (rotation about those axes). For this reason a device that reports a complete pose is described as having six degrees of freedom (6DoF), while a device that reports orientation only has three degrees of freedom (3DoF).[1][2]

The process of measuring a pose continuously over time is called positional tracking when it includes position, or pose tracking more generally. Estimating and updating pose is the job of a VR or AR runtime's tracking subsystem, and the quality of that estimate (its accuracy, its update rate, and how little it lags real motion) is one of the determinants of comfort and presence in a head-mounted display.[1][3]

Definition and representation

A pose is a rigid transformation: it describes how to map an object's own local coordinate frame onto a reference frame, usually a fixed world frame, using only a rotation and a translation, with no scaling or shearing. The two parts are stored separately. Position is a three-component vector, conventionally in meters, that gives the offset of the object's origin from the world origin. Orientation describes how the object is turned and is most often stored as a unit quaternion, a four-number representation that avoids the gimbal-lock failure of Euler angles and is cheap to interpolate and to renormalize.[1][4]

Two cross-platform definitions illustrate the convention. In the OpenXR standard maintained by the Khronos Group, a pose is the structure XrPosef, which contains an orientation field of type XrQuaternionf and a position field of type XrVector3f; the specification states that position is expressed in meters, that orientation is a unit quaternion, and that the rotation described by the orientation is always applied before the translation described by the position.[4] Google's ARCore uses the same model: its Pose class is documented as an immutable rigid transformation from an object's local coordinate space to the world coordinate space, defined by a quaternion rotation followed by a translation. ARCore's C interface, ArPose, packs a pose into an array of seven floats in the order qx, qy, qz, qw, tx, ty, tz, that is, the four quaternion components followed by the three translation components.[5][6]

The same six values can be written in other forms. Orientation may be expressed as a 3 by 3 rotation matrix or as a set of Euler angles (roll, pitch and yaw), and the whole pose may be written as a single 4 by 4 homogeneous transformation matrix that combines rotation and translation, which is convenient because such matrices compose by multiplication. These forms are mathematically interchangeable; quaternions are favored for storage and for the running estimate inside a tracker, while a 4 by 4 matrix is what a graphics pipeline ultimately consumes.[1][5]

The six degrees of freedom of a pose
Component Type Degrees of freedom Typical units or form
Position Translation along X, Y, Z 3 meters (vector)
Orientation Rotation about X, Y, Z (roll, pitch, yaw) 3 unit quaternion, rotation matrix, or Euler angles
Full pose Position and orientation 6 quaternion plus vector, or 4 by 4 matrix

How pose is estimated

No single sensor measures a full pose directly, so tracking systems combine complementary sensors whose errors differ. Welch and Foxlin's 2002 survey of motion tracking, written from work at the University of North Carolina at Chapel Hill and InterSense, argued that there is no one tracking technology that works for every purpose and that practical systems pair sensors so that the strengths of one cover the weaknesses of another; the paper's title calls this a respectable arsenal rather than a silver bullet.[3]

The dominant pattern in modern headsets follows that advice. An inertial measurement unit (IMU), built from a gyroscope and an accelerometer (sometimes with a magnetometer), reports angular velocity and linear acceleration at a high rate, often around 1000 Hz. Integrating the gyroscope gives orientation almost instantly and smoothly, but integrating accelerometer and gyroscope signals to recover position accumulates error quickly, a problem called drift. Optical sensing supplies the missing absolute reference: cameras observe either external infrared markers (outside-in tracking) or natural features of the room (inside-out tracking) to fix position and to correct the IMU's drift. Merging the fast, drifting inertial estimate with the slower, drift-free optical estimate is called sensor fusion and is commonly implemented with a Kalman filter or a similar estimator.[1][3][7]

When the optical part of that fusion uses a camera plus the IMU to track pose against the surrounding scene, the technique is called visual-inertial odometry (VIO). VIO is the basis of phone-based AR: Apple's ARKit world tracking and Google's ARCore both estimate the device's 6DoF pose by combining camera images with the phone's motion sensors, detecting feature points in the video and matching them across frames while the IMU bridges the gaps between camera updates.[7][8][5] On dedicated headsets the same idea drives standalone tracking systems such as Oculus Insight, and earlier tethered headsets instead used external optical references such as the Constellation system of the Oculus Rift CV1 or the Lighthouse base stations of room-scale VR.[3]

Sensing modalities used to estimate pose
Modality What it measures well Main limitation
Inertial (gyroscope, accelerometer) Orientation and short-term motion at high rate Position drifts as errors integrate
Optical, marker-based (outside-in tracking) Accurate position and orientation in a fixed volume Needs external sensors and line of sight
Optical, markerless (inside-out tracking, VIO) Position and orientation without external hardware Depends on lighting and scene texture
Magnetic / magnetometer Absolute heading reference Distorted by nearby metal and electronics

Role in virtual and augmented reality

Pose is what links a user's real movement to the rendered viewpoint. For a head-mounted display, the headset pose sets the position and direction of the virtual camera each frame, so head pose tracking is the difference between a fixed image strapped to the face and a world the user can lean into and walk around. A 6DoF head pose enables room-scale VR, where physical steps move the in-world viewpoint, whereas a 3DoF system can only follow where the head is pointed.[1][2]

Controller and hand pose let the user act on that world. The 6DoF pose of each tracked controller, or of the hands themselves under hand tracking, places virtual hands and tools so that grabbing, pointing and throwing line up with what the eyes see. OpenXR exposes these as poses an application queries each frame: action spaces such as a controller's grip and aim poses, and reference spaces for the head and the play area, are all XrPosef values resolved at a requested display time.[4]

In AR the role is registration: the device pose anchors virtual content to the physical world so that a placed object appears to stay on a real table as the phone or headset moves. Both ARKit and ARCore report the device pose relative to a world coordinate space established when a session starts, and developers attach virtual objects to that space through anchors so the content holds its place.[8][5]

Pose, latency, and prediction

Because rendering and display take time, the pose used to draw a frame is slightly stale by the time photons reach the eyes. The total delay from a real movement to the corresponding change on screen is the motion-to-photon latency, and excess latency is a known cause of discomfort in VR. Tracking systems therefore do not render the pose as last measured; they predict the pose the headset will have at the moment the frame is actually displayed, using the measured velocity and acceleration to extrapolate forward, an approach known as predictive tracking.[1][9]

A second pose-based correction happens after rendering. Asynchronous timewarp (ATW), added to the Oculus software by John Carmack around April 2014, takes the most recently rendered frame and, just before the display refreshes, reprojects it using the latest available head pose so the image matches where the head is at that instant. This decouples the displayed result from the application's frame rate and hides some of the remaining latency, at the cost of being a re-projection rather than a freshly rendered view.[9][10]

See also

References