Jump to content

Speech recognition

From VR & AR Wiki

Speech recognition, also called automatic speech recognition (ASR) or speech-to-text, is the conversion of spoken language into text or machine commands by a computer. A speech recognition system captures an audio signal through a microphone, processes it to remove noise, and uses acoustic and language models to infer the most probable sequence of words. The technology is distinct from speaker recognition, which identifies who is speaking rather than what is said, and from natural language understanding, which interprets the meaning of recognized text.[1]

In virtual reality (VR) and augmented reality (AR), speech recognition provides a hands-free input channel that works while a user's hands are occupied with motion controllers, while the user is reaching into a virtual scene, or when a head-mounted display has no physical keyboard. It is used for system commands, text dictation, search, and as one element of multimodal interfaces that combine voice with eye gaze and hand tracking. Major platforms that include built-in speech input are Microsoft HoloLens, Apple Vision Pro, and Meta Quest headsets.[2][3][4]

How it works

A speech recognition pipeline begins with signal capture and pre-processing. The microphone produces a waveform that is cleaned to reduce background noise and normalize volume, then split into short overlapping frames from which acoustic features are extracted.[5]

Classical systems used two statistical models. An acoustic model mapped the audio features to phonetic units, the basic sound segments of a language, and a language model estimated which word sequences were most likely. For decades the dominant acoustic model was the hidden Markov model (HMM), often combined with Gaussian mixture models to represent the audio distributions.[6][7]

From the early 2010s deep neural networks replaced Gaussian mixture models for the acoustic component, and large labelled datasets plus faster hardware improved accuracy.[8] Later systems moved toward end-to-end models that map audio directly to text without a separate, hand-built pronunciation stage. Connectionist temporal classification (CTC) allowed training on un-segmented audio, the Listen, Attend and Spell encoder-decoder architecture introduced in 2015 used an attention mechanism, and the transformer architecture published in 2017 became the basis for many later models.[8] OpenAI's Whisper, released in September 2022, is an encoder-decoder transformer trained on about 680,000 hours of multilingual audio collected from the web using weak supervision; OpenAI published its code and model weights under the MIT license.[9][8]

Recognition accuracy is commonly measured by the word error rate (WER), the proportion of inserted, deleted, or substituted words relative to a reference transcript. In 2017 Microsoft reported a 5.1 percent WER on the Switchboard conversational telephone test set, which the company described as reaching parity with professional human transcribers on that task.[10][11] Accuracy in practice still depends on the acoustic environment: one clinical study of voice control on a head-mounted display found recognition rates fell sharply as operating-room background noise rose, dropping below 40 percent above roughly 60 dB sound pressure level.[12]

History

The first speech recognizers handled only small, isolated vocabularies. In 1952 Bell Laboratories built Audrey (the Automatic Digit Recognizer), which could recognize the spoken digits zero to nine, but reliably only for its inventor's voice.[13] IBM's Shoebox, developed by William C. Dersch and demonstrated at the 1962 Seattle World's Fair, recognized 16 spoken words: the digits zero to nine plus six command words (plus, minus, total, subtotal, false, off) that drove an attached adding machine.[14]

In the 1970s the U.S. Defense Advanced Research Projects Agency funded the Speech Understanding Research program (1971-1976) with the goal of recognizing connected speech over a vocabulary of at least 1,000 words. Carnegie Mellon University's Harpy system, built by Bruce Lowerre under Raj Reddy, met the target using a graph-search method called beam search.[13][15] Statistical methods then took over: James and Janet Baker applied hidden Markov models in their Dragon work, and in 1987 Kai-Fu Lee at Carnegie Mellon produced Sphinx-I, an early speaker-independent continuous recognition system that combined HMMs with beam search.[13][7] The Bakers' company Dragon Systems shipped Dragon NaturallySpeaking in 1997, described as the first general continuous dictation product for personal computers, accepting natural speech at roughly 100 words per minute.[16]

Selected milestones in speech recognition
Year System or event Organization Note
1952 Audrey Bell Labs Recognized spoken digits 0-9, speaker-dependent[13]
1962 Shoebox IBM 16 words (digits plus arithmetic commands), shown at Seattle World's Fair[14]
1971-1976 Speech Understanding Research DARPA Funded 1,000-word connected-speech research[15]
1976 Harpy Carnegie Mellon University Met the DARPA target using beam search[13]
1987 Sphinx-I Carnegie Mellon University Early speaker-independent continuous recognition[7]
1997 Dragon NaturallySpeaking Dragon Systems First continuous dictation product for PCs[16]
2017 Human-parity result Microsoft Research 5.1 percent WER on Switchboard[10]
2022 Whisper OpenAI Open-weight transformer ASR trained on 680,000 hours[9]

Use in virtual and augmented reality

Speech recognition gives VR and AR systems an input channel that does not require the user to look at a keyboard or free a hand. In a head-mounted display, on-screen typing relies on pointing at a virtual keyboard one key at a time, which is slow; dictation lets the user enter text by speaking instead. Voice also lets a user cut through nested menus with a single command rather than navigating step by step.[2]

A recurring design pattern for headsets is to pair voice with gaze. Because head or eye gaze already indicates which object the user is looking at, a short spoken command can act on that target without the user naming it. Microsoft documents this for HoloLens as a "see it, say it" model, in which the spoken label on a button is also its voice command, and it describes combining eye gaze with deictic phrases so a user can look at a hologram and say "put this" then look elsewhere and say "over here."[2] This descends from research on multimodal interfaces, notably Richard Bolt's 1980 Put-That-There system at the MIT Architecture Machine Group, which let a user move graphical objects on a large display by combining pointing gestures with spoken commands.[17]

Microsoft HoloLens

On HoloLens, voice is one of the primary input methods alongside gaze and hand gestures, and it runs on the same Windows speech engine used by other Universal Windows apps, in the device's display language.[2] The system command "select" acts like an air tap on whatever the gaze cursor is pointing at; it is handled by a low-power keyword detector so it can be spoken at any time with little battery cost.[2] Built-in HoloLens commands include "Go to Start," "What can I say?," "Take a picture," "Start recording," and brightness and volume controls, and the holographic keyboard can switch to a dictation mode by selecting its microphone button.[2] Microsoft's voice assistant Cortana was previously the route for conversational queries on the device. Microsoft retired the standalone Cortana app in Windows in spring 2023 and Cortana across several other products in fall 2023.[18] HoloLens 2 voice commands and dictation continue to function independently of Cortana, running on the device's Windows speech engine.[2] Microsoft's Mixed Reality Toolkit lets developers register their own keywords through a Speech Input Profile so any object can respond to a spoken word.[2]

Apple Vision Pro

Apple Vision Pro includes Voice Control, an accessibility feature that lets a user operate the headset entirely by speaking, including performing the system's gestures by voice. Apple states that the feature requires a Wi-Fi connection for a one-time download, after which it works without an internet connection.[3] When entering text, Voice Control distinguishes a default dictation mode, in which spoken words become text, from a spelling mode for entering words letter by letter and a command mode in which text entry is disabled so the system only acts on commands. Spoken commands such as "Tap," "Swipe up," "Open Control Center," "Go home," and "Take screenshot" map to gestures and actions, and the user can overlay item names, numbers, or a grid on screen to refer to elements by name.[3] Vision Pro also includes Apple's Siri assistant for voice queries and app control, with an option called Type to Siri for users who prefer to interact without speaking.[19]

Meta Quest

Meta Quest headsets provide Voice Dictation for entering text by speaking instead of typing on the virtual keyboard. Meta offers two modes: an online dictation mode, available to users in the United States, which sends audio to Meta servers for processing without recording or storing it, and an opt-in on-device dictation mode, available globally, which processes voice data on the headset and requires downloading a language model.[4] Separately, Meta provides the Voice SDK for developers, which adds voice interaction to Quest apps and is powered by the Wit.ai natural language understanding service. The SDK supports speech-to-text transcription and intent recognition, ships with more than 50 built-in intents, entities, and traits, and lets a developer integrate voice once for Quest and other platforms.[20]

Other uses

In enterprise and clinical AR, voice control keeps a worker's hands free for a task while they issue commands to a headset, for example a surgeon navigating an AR overlay without touching a non-sterile control.[12] Voice input is also studied as a way to disambiguate selection in dense 3D scenes: research combining a large language model with speech and pointing has examined selecting among multiple overlapping 3D objects in VR, and context-aware voice assistants such as the GazePointAR prototype use gaze and pointing to resolve pronouns like "this" and "that" in wearable AR.[21][22]

Limitations

Voice input has constraints that shape where it is used in VR and AR. Microsoft's design guidance notes that voice is poor for fine-grained continuous control, because a command like "make it a little louder" does not specify an amount, so scaling or moving holograms by voice is difficult.[2] Recognition can also misfire on unusual words, names, or abbreviations, and Microsoft advises that voice command actions be non-destructive and easy to undo in case someone speaking nearby triggers a command by accident.[2] Background noise degrades accuracy, which matters for headsets used in factories, operating rooms, or other loud settings.[12] Voice is also not always socially acceptable, since a user may be reluctant to talk to a device in a shared or quiet space or to dictate confidential text aloud, and at least one platform restricts cloud-based dictation by region while offering on-device processing as a privacy-preserving alternative.[2][4]

References

  1. "What is Automatic Speech Recognition? A Comprehensive Overview of ASR Technology". https://www.assemblyai.com/blog/what-is-asr.
  2. 2.00 2.01 2.02 2.03 2.04 2.05 2.06 2.07 2.08 2.09 2.10 "Voice input - Mixed Reality". 2026-01-06. https://learn.microsoft.com/en-us/windows/mixed-reality/design/voice-input.
  3. 3.0 3.1 3.2 "Use Voice Control to interact with Apple Vision Pro". https://support.apple.com/guide/apple-vision-pro/perform-actions-with-your-voice-tan14d179ad1/visionos.
  4. 4.0 4.1 4.2 "Learn about Voice Dictation on Meta Quest". https://www.meta.com/help/quest/463323051789865/.
  5. "Automatic Speech Recognition (ASR): How it works and key applications". https://thelevel.ai/blog/automatic-speech-recognition-asr.
  6. "What is ASR and how do speech recognition models work?". https://www.gladia.io/blog/how-do-speech-recognition-models-work.
  7. 7.0 7.1 7.2 "Speech Recognition from Audrey to Alexa: A Brief History". https://dictateit.com/speech-recognition-from-audrey-to-alexa-a-brief-history/.
  8. 8.0 8.1 8.2 "Whisper (speech recognition system)". https://en.wikipedia.org/wiki/Whisper_(speech_recognition_system).
  9. 9.0 9.1 "Introducing Whisper". 2022-09-21. https://openai.com/index/whisper/.
  10. 10.0 10.1 "Microsoft researchers achieve new conversational speech recognition milestone". 2017-08-20. https://www.microsoft.com/en-us/research/blog/microsoft-researchers-achieve-new-conversational-speech-recognition-milestone/.
  11. Xiong, W. (2017). "The Microsoft 2017 Conversational Speech Recognition System". Technical report MSR-TR-2017-39. https://arxiv.org/abs/1708.06073.
  12. 12.0 12.1 12.2 (2023). "Augmented reality during parotid surgery: real-life evaluation of voice control of a head mounted display".{Template:Journal. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9988782/. Retrieved 2026-06-21.
  13. 13.0 13.1 13.2 13.3 13.4 "Audrey, Alexa, HAL, and More". https://computerhistory.org/blog/audrey-alexa-hal-and-more/.
  14. 14.0 14.1 "IBM Shoebox". https://en.wikipedia.org/wiki/IBM_Shoebox.
  15. 15.0 15.1 "History of ASR Technologies". https://www.uslegalsupport.com/blog/asr-history/.
  16. 16.0 16.1 "Dragon NaturallySpeaking". https://en.wikipedia.org/wiki/Dragon_NaturallySpeaking.
  17. Bolt, R.A. (1980). "Put-that-there: Voice and gesture at the graphics interface". Proceedings of the 7th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH '80). pp. 262-270. https://dl.acm.org/doi/10.1145/800250.807503.
  18. "End of support for Cortana". https://support.microsoft.com/en-us/topic/end-of-support-for-cortana-d025b39f-ee5b-4836-a954-0ab646ee1efa.
  19. "Find out what Siri can do on Apple Vision Pro". https://support.apple.com/guide/apple-vision-pro/find-out-what-siri-can-do-tan462de531e/visionos.
  20. "Voice SDK Overview". https://developers.meta.com/horizon/documentation/unity/voice-sdk-overview/.
  21. "Large Language Model-assisted Speech and Pointing Benefits Multiple 3D Object Selection in Virtual Reality". 2024. https://arxiv.org/abs/2410.21091.
  22. "GazePointAR: A Context-Aware Multimodal Voice Assistant for Pronoun Disambiguation in Wearable Augmented Reality". Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 2024. https://arxiv.org/abs/2404.08213.