Face tracking

See also: Terms and Technical Terms

Face tracking (also written face-tracking or facial tracking, and sometimes called facial expression tracking) is a sensor technology that estimates the movements and expressions of a user's face, such as the mouth, lips, jaw, cheeks, and brows, and converts them into data that can drive an avatar or interface in real time. In the context of virtual reality (VR) and augmented reality (AR), it refers to the integration of inward-facing cameras or depth sensors within a head-mounted display (HMD), or in an add-on accessory, to reproduce a wearer's facial movements on a digital representation of themselves. The goal is to raise social presence and the sense of non-verbal communication in shared virtual spaces by letting a person smile, frown, talk, or raise an eyebrow and have their avatar do the same.^[1]

Face tracking is distinct from two related sensing modalities that are often discussed alongside it. Eye tracking measures where the user is looking (gaze) using the pupil and cornea, whereas face tracking is concerned with the muscles and surfaces of the rest of the face. Hand tracking estimates the pose of the hands and fingers. Some high-end headsets combine all three, treating eye tracking, face tracking, and hand tracking as separate inputs that together produce a more lifelike avatar and more natural interaction. Because facial movement is closely tied to identity, emotion, and speech, face tracking also raises privacy questions about how facial data is captured, processed, and stored.^[2]

Background

The representation underlying most modern face tracking traces to the Facial Action Coding System (FACS), a taxonomy of human facial movement. FACS was originally developed by the Swedish anatomist Carl-Herman Hjortsjo and was adopted and published by the psychologists Paul Ekman and Wallace V. Friesen in 1978; Ekman, Friesen, and Joseph C. Hager released a significant update in 2002. FACS decomposes any facial expression into a set of independent "action units" (AUs), each defined as the contraction or relaxation of one or more facial muscles, for example raising the inner brow, wrinkling the nose, or dropping the jaw. Because the action units are independent of any emotional interpretation, they can be combined to describe essentially any expression.^[3]

In computer animation, FACS action units are typically realized as blendshapes (also called morph targets), where each named shape corresponds to a facial movement and carries a weight, usually from 0.0 to 1.0, indicating how strongly that movement is activated. A face mesh is deformed by summing the active blendshapes, so that a smile, for instance, is produced by blending several shapes at once. This blendshape-with-weight model is the common output format of VR and AR face tracking systems, which lets the same tracking data drive many different avatar designs.^[3]^[4]

How it works

Face tracking in an HMD generally relies on one of two approaches: optical sensing of the face from cameras placed inside or beneath the headset, or, where no such cameras exist, inference of facial movement from other signals such as the microphone.

Optical (camera-based) tracking

Camera-based systems place small infrared (IR) cameras and IR illuminators so they can see parts of the face that the headset normally hides. Cameras pointed at the eyes and brows capture upper-face movement, while a downward or forward camera (often in an add-on under the visor) captures the lower face: the lips, cheeks, jaw, and sometimes the tongue. Computer vision algorithms analyze these images and output a stream of blendshape weights describing the current expression. IR illumination is used because it is invisible to the wearer and gives consistent imaging inside the dark interior of a headset. Depth-sensing front cameras, such as those used to scan the face before a session, can additionally build a 3D model of the user's features for a more lifelike result.^[4]^[5]

Audio-based estimation

Headsets that lack inward-facing cameras can still animate a mouth by estimating lip and jaw movement from the user's speech captured by the microphone. This audio-driven approach, sometimes called lip sync, does not see the face and cannot reproduce expressions that make no sound (for example a silent smile or a raised eyebrow), so it is less expressive than camera-based tracking. Meta uses microphone-based estimation for Meta Quest 2 and Meta Quest 3, which have no inward-facing cameras, and reserves true camera-based face tracking for headsets that do.^[4]

Representation and data levels

Whichever sensing method is used, the output is normally a set of named blendshapes with activation strengths rather than raw imagery. Meta's Movement SDK, for example, exposes facial expressions through OpenXR as defined blendshapes whose strength indicates activation; its expression sets include on the order of 63 to 71 named blendshapes covering the brows, cheeks, jaw, lips, and (in the larger set) the tongue. Apple's ARKit, used on iPhone and iPad for face-based AR, reports 52 blendshape coefficients (each 0.0 to 1.0) derived from the TrueDepth camera at 60 frames per second, which developers map onto an avatar or "puppet" that follows the user's expressions.^[4]^[6]

Devices and implementations

Face tracking is far less common in consumer headsets than eye tracking, and several popular VR products that include eye tracking do not track facial expression at all.

Headsets with built-in face tracking

Meta Quest Pro: Released in October 2022, the Quest Pro was, in Meta's words, "the first headset we've built that integrates inward-facing sensors to capture natural facial expressions and eye tracking." Its face-tracking feature, branded Natural Facial Expressions, uses inward-facing cameras to detect facial movement, which the system converts into activations of FACS-based blendshapes such as jaw drop and nose wrinkle so that an avatar can mirror the wearer. The feature is off by default and the user chooses whether to enable it. Meta states that the images the headset captures of the face never leave the device and are deleted after processing, so neither Meta nor third-party apps receive the raw images.^[1]^[2]
Apple Vision Pro: Apple's headset does not animate a generic cartoon avatar but instead drives a Persona, a photorealistic digital likeness of the wearer used in FaceTime and other video apps. The Persona is created from a one-time facial scan in which the device's cameras "capture images and 3D measurements of your face, head, upper body, and facial expressions," and it is then animated live during a call to reflect the user's expressions and head movement. Apple states that the data used to build the Persona does not leave the device, and that during a call the user's Persona is sent securely to the other participants.^[5]^[7]

Add-on facial trackers

For headsets without integrated face tracking, manufacturers have sold accessories that clip onto the underside of the visor to capture the lower face.

HTC VIVE Facial Tracker: An add-on for the VIVE Pro and compatible PC VR headsets that tracks up to 38 facial movements across the lips, jaw, teeth, tongue, chin, and cheeks. It uses dual cameras with IR illumination at a 60 Hz tracking rate and a sub-10-millisecond response time, connecting over a built-in USB-C cable, with data exposed through HTC's SRanipal SDK for Unity and Unreal Engine. HTC suggested pairing it with the eye-tracking VIVE Pro Eye for full-face capture. This original tracker has since been discontinued.^[8]^[9]
VIVE Focus / XR Elite facial trackers: HTC later introduced facial trackers for its standalone line. The Facial Tracker for the VIVE Focus series captures expressions through 38 blendshapes across the lips, jaw, cheeks, chin, teeth, and tongue at 60 Hz, and the VIVE Full Face Tracker for the VIVE XR Elite, shown at CES 2024, combines lower-face tracking with eye tracking and automatic interpupillary distance calibration.^[10]^[11]

Headsets without face tracking

A number of devices include eye tracking but no facial expression tracking. The PlayStation VR2 is a notable example: it carries two inward-facing IR cameras that provide Tobii-based eye tracking, used mainly for foveated rendering and gaze-based interaction, but it does not track lower-face or facial expression for avatars. Likewise, mainstream standalone headsets such as the Meta Quest 2 and Meta Quest 3 have neither inward-facing cameras nor true facial expression tracking, relying on audio-based mouth estimation instead.^[12]^[4]

The table below summarizes facial sensing on several current devices.

Device	Eye tracking	Face (expression) tracking	Method
Meta Quest Pro	Yes	Yes (Natural Facial Expressions)	Inward-facing cameras, FACS blendshapes^[2]
Apple Vision Pro	Yes	Yes (drives a Persona)	Cameras plus depth sensing, on-device^[5]
PlayStation VR2	Yes (Tobii)	No	Eye-only IR cameras for foveated rendering^[12]
Meta Quest 2 / Meta Quest 3	No	No (audio lip estimation only)	Microphone-based mouth estimation^[4]
HTC VIVE Pro + VIVE Facial Tracker	Via VIVE Pro Eye	Yes (lower face, add-on)	Dual IR cameras, 38 movements, 60 Hz^[8]

Applications

Avatar expression in social VR

The primary use of face tracking is to animate a user's avatar in social VR so that conversation carries facial cues, not just voice. Meta promotes Natural Facial Expressions for products such as Horizon Worlds and Horizon Workrooms, where letting an avatar smile, raise an eyebrow, or make eye contact is intended to strengthen social presence, "the feeling that you're right there together with someone."^[1] Apple's Persona serves a similar purpose for telepresence-style video calls on Apple Vision Pro.^[5]

In the community-driven platform VRChat, face tracking is enabled through third-party tooling rather than a built-in feature. The open-source application VRCFaceTracking acts as a bridge between tracking hardware and VRChat, translating raw sensor output into the Open Sound Control (OSC) parameters that VRChat avatars expect. It works with a range of devices, including the Quest Pro (which exposes eye and face data through Meta's Face and Eye OpenXR extensions) and HTC's SRanipal-based trackers. Because VRChat avatars and trackers use many different shape conventions, VRCFaceTracking defines an open standard called Unified Expressions that is designed to be compatible with shapes from other standards such as ARKit/PerfectSync, SRanipal, and FACS. On the Quest Pro specifically, this tracking can be delivered only to the PC version of VRChat over a link connection, not to the standalone Quest build.^[13]^[14]

Other uses

Beyond avatars, facial expression data can feed user experience research and affective applications by providing an objective signal of expression during a session, and it can support more natural non-verbal communication in collaborative and training scenarios. Because the same blendshape data can drive any compatible character, face tracking is also used for virtual production and live performance, where a performer's expressions are mapped onto a digital character in real time.^[6]

Privacy considerations

Facial movement is sensitive biometric data: a face conveys identity, emotion, attention, and speech, so the capture and handling of facial data has drawn the same kind of scrutiny that surrounds eye tracking. The dominant industry response has been to process facial data on the device and avoid transmitting raw imagery. Meta states that Natural Facial Expressions is off by default, that the feature produces only a set of numbers (expression estimates) rather than storing pictures, and that the captured images of the face never leave the headset and are deleted after processing.^[2] Apple similarly says that the data used to build a Vision Pro Persona does not leave the device, although after a call a Persona may remain stored in encrypted form on the other participants' devices for up to 30 days.^[5] Even with on-device processing, the broader concern remains that detailed facial expression data could in principle reveal emotional or health-related information, which is why these features are typically opt-in and governed by per-application permissions.^[2]

References

↑ ^1.0 ^1.1 ^1.2 "Meta Connect 2022: Meta Quest Pro, More Social VR and a Look Into the Future". Meta Platforms, Inc.. 2022-10-11. https://about.fb.com/news/2022/10/meta-quest-pro-social-vr-connect-2022/.
↑ ^2.0 ^2.1 ^2.2 ^2.3 ^2.4 "Learn about Natural Facial Expressions on Meta Quest Pro". Meta Platforms, Inc.. https://www.meta.com/help/quest/402982851992067/.
↑ ^3.0 ^3.1 "Facial Action Coding System". https://en.wikipedia.org/wiki/Facial_Action_Coding_System.
↑ ^4.0 ^4.1 ^4.2 ^4.3 ^4.4 ^4.5 "Face Tracking in Movement SDK for OpenXR". Meta Platforms, Inc.. https://developers.meta.com/horizon/documentation/native/android/move-face-tracking/.
↑ ^5.0 ^5.1 ^5.2 ^5.3 ^5.4 "Persona & Privacy". Apple Inc.. https://www.apple.com/legal/privacy/data/en/persona/.
↑ ^6.0 ^6.1 "blendShapes - ARFaceAnchor". Apple Inc.. https://developer.apple.com/documentation/arkit/arfaceanchor/blendshapes.
↑ "Apple shares an in-depth look at Vision Pro privacy in new paper". 2024-02-13. https://9to5mac.com/2024/02/13/apple-vision-pro-privacy/.
↑ ^8.0 ^8.1 "About VIVE Facial Tracker". HTC Corporation. https://www.vive.com/us/support/facial-tracker/category_howto/about-the-tracker.html.
↑ "Vive Facial Tracker". https://docs.vrcft.io/docs/hardware/addons/vive/face-tracker.
↑ "Facial Tracker for VIVE Focus Series". HTC Corporation. https://www.vive.com/us/accessory/facial-tracker-for-vive-focus-series/.
↑ "Experience Ultimate VR Immersion with HTC Vive Full-Face Tracker at CES 2024". HTC Corporation. 2024-01-09. https://blog.vive.com/us/experience-ultimate-vr-immersion-with-htc-vive-full-face-tracker-ces-2024/.
↑ ^12.0 ^12.1 "PlayStation VR2". https://en.wikipedia.org/wiki/PlayStation_VR2.
↑ "Unified Expressions". https://docs.vrcft.io/docs/tutorial-avatars/tutorial-avatars-extras/unified-blendshapes.
↑ "Quest Pro". https://docs.vrcft.io/docs/v4.0/hardware/quest-pro.

[MetaConnect2022-1] 1.0 ^1.1 ^1.2 "Meta Connect 2022: Meta Quest Pro, More Social VR and a Look Into the Future". Meta Platforms, Inc.. 2022-10-11. https://about.fb.com/news/2022/10/meta-quest-pro-social-vr-connect-2022/.

[MetaPrivacy-2] 2.0 ^2.1 ^2.2 ^2.3 ^2.4 "Learn about Natural Facial Expressions on Meta Quest Pro". Meta Platforms, Inc.. https://www.meta.com/help/quest/402982851992067/.

[FACSWiki-3] 3.0 ^3.1 "Facial Action Coding System". https://en.wikipedia.org/wiki/Facial_Action_Coding_System.

[MetaMoveDocs-4] 4.0 ^4.1 ^4.2 ^4.3 ^4.4 ^4.5 "Face Tracking in Movement SDK for OpenXR". Meta Platforms, Inc.. https://developers.meta.com/horizon/documentation/native/android/move-face-tracking/.

[ApplePersonaLegal-5] 5.0 ^5.1 ^5.2 ^5.3 ^5.4 "Persona & Privacy". Apple Inc.. https://www.apple.com/legal/privacy/data/en/persona/.

[ARKitBlend-6] 6.0 ^6.1 "blendShapes - ARFaceAnchor". Apple Inc.. https://developer.apple.com/documentation/arkit/arfaceanchor/blendshapes.

[9to5PersonaPrivacy-7] "Apple shares an in-depth look at Vision Pro privacy in new paper". 2024-02-13. https://9to5mac.com/2024/02/13/apple-vision-pro-privacy/.

[HTCViveFTSpec-8] 8.0 ^8.1 "About VIVE Facial Tracker". HTC Corporation. https://www.vive.com/us/support/facial-tracker/category_howto/about-the-tracker.html.

[VRCFTViveFT-9] "Vive Facial Tracker". https://docs.vrcft.io/docs/hardware/addons/vive/face-tracker.

[ViveFocusFTSpec-10] "Facial Tracker for VIVE Focus Series". HTC Corporation. https://www.vive.com/us/accessory/facial-tracker-for-vive-focus-series/.

[ViveFullFaceCES-11] "Experience Ultimate VR Immersion with HTC Vive Full-Face Tracker at CES 2024". HTC Corporation. 2024-01-09. https://blog.vive.com/us/experience-ultimate-vr-immersion-with-htc-vive-full-face-tracker-ces-2024/.

[PSVR2Wiki-12] 12.0 ^12.1 "PlayStation VR2". https://en.wikipedia.org/wiki/PlayStation_VR2.

[VRCFTUnified-13] "Unified Expressions". https://docs.vrcft.io/docs/tutorial-avatars/tutorial-avatars-extras/unified-blendshapes.

[VRCFTQuestPro-14] "Quest Pro". https://docs.vrcft.io/docs/v4.0/hardware/quest-pro.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]