This document outlines a proposed standards framework for configurable voice interactions on the web, extending W3C efforts such as the Web Speech API [[WEB-SPEECH-API]]. The framework aims to standardize controls for voice-driven user interfaces (e.g., virtual assistants, podcasts, voice-enabled web applications), addressing gaps in timing, state management, context, user preferences, task handling, and error recovery.
Six API components are proposed: Temporal Control, Interaction State Machine, Context Management, Configuration Schema, Task Queue, and Recovery Protocol. Each builds upon or complements existing W3C standards to provide a coherent, interoperable platform for voice-driven web experiences.
These notes are part of a Voice AI white paper by Paola Di Maio. They were shared for discussion with Patric and Francois of the W3C WebDX CG (via email, 29 January 2026) to explore possible usefulness and relevance to the WebDX Community Group.
This is a discussion draft and does not represent consensus of the W3C AI Knowledge Representation Community Group or any other W3C body. Feedback is welcome.
Voice-driven web applications—from browser-based virtual assistants to interactive podcasts and AI-powered conversational interfaces—are proliferating rapidly. However, the current landscape suffers from significant fragmentation. Differences between implementations (e.g., Chrome vs. Safari Speech APIs) lead to poor user experience and increased developer burden.
The code snippets and API proposals in this document outline a hypothetical "Proposed Standards Framework" for configurable voice interactions on the web. They aim to standardize controls for voice-driven UIs, addressing gaps in:
Without standards, developers are forced to reinvent solutions, risking security (e.g., unhandled states) and privacy issues (e.g., context persistence). Building on W3C foundations ensures interoperability, boosts accessibility (e.g., for non-visual users), and accelerates adoption in domains such as e-commerce and education.
A W3C working group could prototype these proposals in 6–12 months, starting with polyfills. This document recommends prioritizing Temporal Control and Recovery Protocol first for maximum impact.
As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.
The Temporal Control API provides precise user control over spoken content, mimicking media player controls but applied to voice synthesis and AI-generated audio output. This enables users to perform actions such as skipping filler in AI responses or replaying important information.
This API builds on the existing Media Session API [[MEDIA-SESSION]] and extends it for voice-specific use cases.
[Exposed=Window]
interface VoicePlaybackController {
undefined pause();
undefined resume();
undefined rewind(double seconds);
undefined fastForward(double seconds);
undefined setSpeed(double rate);
double getCurrentTime();
};
pause()
Halts audio playback mid-response, preserving the current
position. The user agent MUST maintain the playback position
so that resume() can continue from the same point.
resume()
Resumes playback from the pause point. If pause()
has not been called, calling resume() SHOULD have
no effect.
rewind(seconds)Moves the playback position backward by the specified number of seconds. If the resulting position would be negative, the position MUST be clamped to zero.
fastForward(seconds)Skips forward by the given number of seconds. If the resulting position exceeds the total duration, the playback SHOULD complete or stop at the end.
setSpeed(rate)
Adjusts playback speed. The rate parameter accepts
values from 0.5 (half-speed) to 2.0
(double-speed). User agents MAY support wider ranges.
getCurrentTime()Returns the elapsed time in seconds from the start of the current voice output segment.
This pattern closely mimics the HTML5 MediaElement API,
where properties like pause(), play(),
currentTime, and playbackRate enable
interactive controls for web-based media. In custom voice wrappers
(e.g., React players or Gradio demos), the interface is abstracted
into a playback object for cleaner voice-command
integration, such as "pause playback" or "rewind 10 seconds."
The SpeechSynthesis interface in the Web Speech API
[[WEB-SPEECH-API]] already provides pause(),
resume(), and cancel() methods, as well
as a rate property. This proposal extends that
capability with rewind, fast-forward, and precise time tracking.
The Interaction State Machine defines the application states during voice exchanges, with user-triggered transitions. This ensures predictable UI feedback (e.g., visual indicators) and prevents confusion in real-time voice applications.
This API partially builds on the SpeechRecognition
interface in [[WEB-SPEECH-API]].
enum VoiceInteractionState {
"listening", // User is speaking
"thinking", // User is formulating
"responding", // System is speaking
"processing" // Background work in progress
};
| State | Description | Typical UI Indicator |
|---|---|---|
listening |
The user is actively speaking; the system is capturing input. | Pulsing microphone icon |
thinking |
The user is formulating a response; the system waits. | Dimmed or paused indicator |
responding |
The system is speaking or producing voice output. | Speaker/waveform animation |
processing |
Background work is in progress (e.g., API call, computation). | Loading spinner |
[Exposed=Window]
interface VoiceInteractionStateMachine : EventTarget {
readonly attribute VoiceInteractionState currentState;
undefined transitionTo(VoiceInteractionState newState);
attribute EventHandler onstatechange;
};
The Context Management API handles conversation continuity across sessions or interruptions. This is vital for complex tasks such as booking travel, multi-step workflows, or long-running voice sessions.
No existing W3C standard directly addresses this need; it represents a custom requirement for voice-driven web applications.
[Exposed=Window]
interface VoiceContext {
undefined bookmark(DOMString label);
undefined switch(VoiceContext newContext);
undefined restore(DOMString bookmarkLabel);
DOMString getSummary();
Promise<undefined> persist();
};
bookmark(label)Saves a state snapshot of the current conversation context, identified by the given label. User agents MUST ensure that bookmarks are retrievable by label.
switch(newContext)
Loads a different VoiceContext, allowing the user
to transition between conversation threads or topics.
restore(bookmarkLabel)Returns to a previously saved state identified by bookmarkLabel.
getSummary()Generates and returns a textual summary or recap of the current conversation context.
persist()
Saves the current context to persistent storage (e.g.,
localStorage, IndexedDB, or a server-side store).
This method returns a Promise that resolves when the persist
operation is complete.
Privacy Consideration: Persistent context storage introduces privacy risks. Implementations SHOULD provide clear user controls for viewing, exporting, and deleting stored context data, consistent with privacy-by-design principles.
The Configuration Schema defines a JSON-based structure for user and application preferences that customize voice interaction behavior. This builds upon the Web App Manifest [[WEB-APP-MANIFEST]] pattern of declarative JSON configuration.
{
"pauseTimeout": 3000,
"interruptionStyle": "patient",
"thinkingMode": "explicit",
"speedPreference": 1.0,
"backgroundProcessing": true
}
| Property | Type | Default | Description |
|---|---|---|---|
pauseTimeout |
unsigned long |
3000 |
Milliseconds before the system assumes the user has completed speaking. |
interruptionStyle |
DOMString |
"patient" |
Defines how the system handles interruptions.
"patient" ignores minor interruptions;
"eager" responds immediately to any input.
|
thinkingMode |
DOMString |
"explicit" |
"explicit" shows a visible "I'm thinking…"
indicator; "implicit" processes silently.
|
speedPreference |
double |
1.0 |
Default playback speed for voice output (0.5 to 2.0). |
backgroundProcessing |
boolean |
true |
Whether to allow background task execution during voice sessions. |
The Task Queue API manages asynchronous operations during voice sessions. It enables non-blocking task execution (e.g., sending an email while the voice assistant continues speaking), with status checks and cancellation support.
This API builds on existing patterns from Web Workers [[WEB-WORKERS]] and the JavaScript Promises model.
[Exposed=Window]
interface VoiceTaskQueue {
DOMString enqueue(VoiceTask task);
VoiceTaskStatus getStatus(DOMString taskId);
undefined onComplete(DOMString taskId, VoiceTaskCallback callback);
undefined cancel(DOMString taskId);
};
callback VoiceTaskCallback = undefined (VoiceTaskResult result);
enum VoiceTaskStatus {
"queued",
"running",
"completed",
"failed",
"cancelled"
};
enqueue(task)Adds an asynchronous task to the queue. Returns a unique taskId string that can be used for status queries and cancellation.
getStatus(taskId)Returns the current status of the task identified by taskId.
onComplete(taskId, callback)Registers a callback to be invoked when the specified task completes (whether successfully or with an error).
cancel(taskId)Requests cancellation of the specified task. If the task has already completed, this method SHOULD have no effect.
The Recovery Protocol provides resilience for voice sessions, restoring state after crashes, disconnections, or other failures. It builds partially on Service Workers [[SERVICE-WORKERS]] patterns for offline and recovery capabilities.
[Exposed=Window]
interface VoiceSessionRecovery {
DOMString summarize();
Promise<undefined> saveState();
Promise<undefined> restoreState();
sequence<VoiceHistoryEntry> getHistory();
};
summarize()Returns a human-readable summary of the current session state, suitable for resuming a conversation after interruption.
saveState()Persists the current session state to durable storage. Implementations SHOULD save state automatically at regular intervals and on significant state transitions.
restoreState()Restores the most recently saved session state, allowing seamless resumption after a failure.
getHistory()Returns an ordered sequence of history entries for the current session, enabling review and navigation of past interactions.
Storage Consideration: Implementations SHOULD implement a retention policy to prevent unbounded storage growth. Expired or low-priority history entries MAY be pruned automatically.
The following table summarizes each proposed API component in terms of its relationship to existing standards, key benefits, potential drawbacks, and assessed usefulness.
| API / Feature | Builds on Existing Standards? | Key Benefits | Potential Drawbacks | Usefulness (1–5) |
|---|---|---|---|---|
| Temporal Control | Yes (Media Session API) | Precise user control; accessibility | Browser consistency challenges | 5 – Essential for UX |
| Interaction State Machine | Partial (SpeechRecognition) | Clear feedback; reduces errors | Overkill for simple apps | 4 – Highly valuable |
| Context Management | No (Custom need) | Long-session support; multi-device | Privacy risks with persistence | 5 – Game-changer |
| Configuration Schema | Yes (Web App Manifest) | Personalization; cross-app harmony | Vendor lock-in if inflexible | 4 – User-centric |
| Task Queue | Yes (Web Workers, Promises) | Non-blocking ops; reliability | Complexity in async error handling | 4 – Practical |
| Recovery Protocol | Partial (Service Workers) | Crash-proof sessions; offline OK | Storage bloat over time | 5 – Critical for production |
The following existing web standards provide the foundation upon which this framework is proposed:
The HTML5 <audio> and <video>
elements support pause(), play(),
currentTime, and playbackRate natively,
providing the foundational playback control model.
The SpeechSynthesis interface [[WEB-SPEECH-API]]
handles text-to-speech playback with pause(),
resume(), cancel(), and
rate adjustment. It requires custom UIs for full
controls per WCAG 2.2 [[WCAG22]] accessibility requirements
(e.g., pause/stop/volume).
The Media Session API [[MEDIA-SESSION]] enables voice/OS integration for play/pause/seek via system notifications, making it ideal for voice assistants and platform-level media controls.
The proposed playback controls represent a de facto standard in voice-enabled AI due to browser consistency and WCAG compliance. These patterns are observed in implementations such as Chrome's "Listen to this page" feature (play/pause/rewind/fast-forward) and Google Assistant media responses (play/pause/stop/start over).
For custom voice AI applications (e.g., Gradio/LLM demos), these
controls can be wrapped in a playback object as
proposed in this specification. While no formal "voice AI standard"
exists beyond the foundational APIs, the patterns described here
are interoperable across platforms including OpenAI Voice and
Web Speech implementations.
The recommended approach for advancing this framework is:
A W3C working group could prototype this in 6–12 months, beginning with polyfills. Prioritizing the Temporal Control API and Recovery Protocol is recommended for maximum initial impact.
All proposed APIs SHOULD be implemented with accessibility as a primary concern, consistent with WCAG 2.2 [[WCAG22]] requirements. Key considerations include:
Implementations of this framework MUST address the following security and privacy concerns:
The following external resources informed this proposal: