This document outlines a proposed standards framework for configurable voice interactions on the web, extending W3C efforts such as the Web Speech API [[WEB-SPEECH-API]]. The framework aims to standardize controls for voice-driven user interfaces (e.g., virtual assistants, podcasts, voice-enabled web applications), addressing gaps in timing, state management, context, user preferences, task handling, and error recovery.

Six API components are proposed: Temporal Control, Interaction State Machine, Context Management, Configuration Schema, Task Queue, and Recovery Protocol. Each builds upon or complements existing W3C standards to provide a coherent, interoperable platform for voice-driven web experiences.

This version of the document includes definitions of novel types and interfaces proposed for standardization. These types do not exist in any current W3C or WHATWG specification and represent new contributions requiring community review. A companion version of this document exists without these new type definitions for comparison purposes.

These notes are part of a Voice AI white paper by Paola Di Maio. They were shared for discussion with Patric and Francois of the W3C WebDX CG (via email, 29 January 2026) to explore possible usefulness and relevance to the WebDX Community Group.

This is a discussion draft and does not represent consensus of the W3C AI Knowledge Representation Community Group or any other W3C body. Feedback is welcome.

Introduction

Voice-driven web applications—from browser-based virtual assistants to interactive podcasts and AI-powered conversational interfaces—are proliferating rapidly. However, the current landscape suffers from significant fragmentation. Differences between implementations (e.g., Chrome vs. Safari Speech APIs) lead to poor user experience and increased developer burden.

The code snippets and API proposals in this document outline a hypothetical "Proposed Standards Framework" for configurable voice interactions on the web. They aim to standardize controls for voice-driven UIs, addressing gaps in:

Without standards, developers are forced to reinvent solutions, risking security (e.g., unhandled states) and privacy issues (e.g., context persistence). Building on W3C foundations ensures interoperability, boosts accessibility (e.g., for non-visual users), and accelerates adoption in domains such as e-commerce and education.

A W3C working group could prototype these proposals in 6–12 months, starting with polyfills. This document recommends prioritizing Temporal Control and Recovery Protocol first for maximum impact.

As well as sections marked as non-normative, all authoring guidelines, diagrams, examples, and notes in this specification are non-normative. Everything else in this specification is normative.

Proposed New Types New

This specification introduces the following novel types that do not currently exist in any published W3C, WHATWG, or other web standards document. These types are proposed as new additions to the web platform to support configurable voice interactions. Each requires community review, discussion, and formal standardization.

Proposed Type Kind Defined In Purpose Nearest Existing Standard
VoicePlaybackController Interface § Temporal Control API Unified playback control for voice synthesis output Media Session API, HTMLMediaElement
VoiceInteractionState Enum § Interaction State Machine Defines the possible states of a voice interaction session SpeechRecognition (partial)
VoiceInteractionStateMachine Interface § Interaction State Machine Manages state transitions and emits change events SpeechRecognition (partial)
VoiceContext Interface § Context Management API Conversation continuity across sessions and interruptions None — novel requirement
VoiceTask Interface § Task Queue API Represents an asynchronous task within a voice session Web Workers, Promises
VoiceTaskResult Interface § Task Queue API Encapsulates the outcome of a completed voice task Web Workers, Promises
VoiceTaskQueue Interface § Task Queue API Manages queuing, status, and cancellation of async tasks Web Workers, Promises
VoiceTaskStatus Enum § Task Queue API Status values for queued tasks (queued, running, completed, failed, cancelled) Web Workers
VoiceTaskCallback Callback § Task Queue API Callback invoked on task completion Promises
VoiceHistoryEntry Interface § Recovery Protocol A single entry in the session interaction history Service Workers (partial)
VoiceSessionRecovery Interface § Recovery Protocol Session persistence, restoration, and history retrieval Service Workers (partial)

Open Question: Should these types be proposed as extensions to existing specifications (e.g., Web Speech API, Media Session API) or as a standalone new specification? Community input is sought on the preferred standardization pathway.

A companion version of this document (voice-ai-standards-withoutnewtypes.html) presents the same API proposals without formal WebIDL definitions of these new types, instead referencing them as forward declarations. That version is intended for initial discussion where the focus is on the API design rather than type-level specifics.

Temporal Control API

The Temporal Control API provides precise user control over spoken content, mimicking media player controls but applied to voice synthesis and AI-generated audio output. This enables users to perform actions such as skipping filler in AI responses or replaying important information.

This API builds on the existing Media Session API [[MEDIA-SESSION]] and extends it for voice-specific use cases.

Interface Definition New Type

[Exposed=Window]
interface VoicePlaybackController {
  undefined pause();
  undefined resume();
  undefined rewind(double seconds);
  undefined fastForward(double seconds);
  undefined setSpeed(double rate);
  double getCurrentTime();
};
        

Methods

pause()

Halts audio playback mid-response, preserving the current position. The user agent MUST maintain the playback position so that resume() can continue from the same point.

resume()

Resumes playback from the pause point. If pause() has not been called, calling resume() SHOULD have no effect.

rewind(seconds)

Moves the playback position backward by the specified number of seconds. If the resulting position would be negative, the position MUST be clamped to zero.

fastForward(seconds)

Skips forward by the given number of seconds. If the resulting position exceeds the total duration, the playback SHOULD complete or stop at the end.

setSpeed(rate)

Adjusts playback speed. The rate parameter accepts values from 0.5 (half-speed) to 2.0 (double-speed). User agents MAY support wider ranges.

getCurrentTime()

Returns the elapsed time in seconds from the start of the current voice output segment.

Relationship to Existing Standards

This pattern closely mimics the HTML5 MediaElement API, where properties like pause(), play(), currentTime, and playbackRate enable interactive controls for web-based media. In custom voice wrappers (e.g., React players or Gradio demos), the interface is abstracted into a playback object for cleaner voice-command integration, such as "pause playback" or "rewind 10 seconds."

The SpeechSynthesis interface in the Web Speech API [[WEB-SPEECH-API]] already provides pause(), resume(), and cancel() methods, as well as a rate property. This proposal extends that capability with rewind, fast-forward, and precise time tracking.

Interaction State Machine

The Interaction State Machine defines the application states during voice exchanges, with user-triggered transitions. This ensures predictable UI feedback (e.g., visual indicators) and prevents confusion in real-time voice applications.

This API partially builds on the SpeechRecognition interface in [[WEB-SPEECH-API]].

Defined States New Type

enum VoiceInteractionState {
  "listening",
  "thinking",
  "responding",
  "processing"
};
        
State Description Typical UI Indicator
listening The user is actively speaking; the system is capturing input. Pulsing microphone icon
thinking The user is formulating a response; the system waits. Dimmed or paused indicator
responding The system is speaking or producing voice output. Speaker/waveform animation
processing Background work is in progress (e.g., API call, computation). Loading spinner

Interface Definition New Type

[Exposed=Window]
interface VoiceInteractionStateMachine : EventTarget {
  readonly attribute VoiceInteractionState currentState;
  undefined transitionTo(VoiceInteractionState newState);
  attribute EventHandler onstatechange;
};
        

Context Management API

The Context Management API handles conversation continuity across sessions or interruptions. This is vital for complex tasks such as booking travel, multi-step workflows, or long-running voice sessions.

No existing W3C standard directly addresses this need; it represents a custom requirement for voice-driven web applications.

Interface Definition New Type

[Exposed=Window]
interface VoiceContext {
  undefined bookmark(DOMString label);
  undefined switch(VoiceContext newContext);
  undefined restore(DOMString bookmarkLabel);
  DOMString getSummary();
  Promise<undefined> persist();
};
        

Methods

bookmark(label)

Saves a state snapshot of the current conversation context, identified by the given label. User agents MUST ensure that bookmarks are retrievable by label.

switch(newContext)

Loads a different VoiceContext, allowing the user to transition between conversation threads or topics.

restore(bookmarkLabel)

Returns to a previously saved state identified by bookmarkLabel.

getSummary()

Generates and returns a textual summary or recap of the current conversation context.

persist()

Saves the current context to persistent storage (e.g., localStorage, IndexedDB, or a server-side store). This method returns a Promise that resolves when the persist operation is complete.

Privacy Consideration: Persistent context storage introduces privacy risks. Implementations SHOULD provide clear user controls for viewing, exporting, and deleting stored context data, consistent with privacy-by-design principles.

Configuration Schema

The Configuration Schema defines a JSON-based structure for user and application preferences that customize voice interaction behavior. This builds upon the Web App Manifest [[WEB-APP-MANIFEST]] pattern of declarative JSON configuration.

Schema Definition

{
  "pauseTimeout": 3000,
  "interruptionStyle": "patient",
  "thinkingMode": "explicit",
  "speedPreference": 1.0,
  "backgroundProcessing": true
}
        

Properties

Property Type Default Description
pauseTimeout unsigned long 3000 Milliseconds before the system assumes the user has completed speaking.
interruptionStyle DOMString "patient" Defines how the system handles interruptions. "patient" ignores minor interruptions; "eager" responds immediately to any input.
thinkingMode DOMString "explicit" "explicit" shows a visible "I'm thinking…" indicator; "implicit" processes silently.
speedPreference double 1.0 Default playback speed for voice output (0.5 to 2.0).
backgroundProcessing boolean true Whether to allow background task execution during voice sessions.

Task Queue API

The Task Queue API manages asynchronous operations during voice sessions. It enables non-blocking task execution (e.g., sending an email while the voice assistant continues speaking), with status checks and cancellation support.

This API builds on existing patterns from Web Workers [[WEB-WORKERS]] and the JavaScript Promises model.

Interface Definitions New Types

[Exposed=Window]
interface VoiceTask {
  readonly attribute DOMString taskId;
  readonly attribute DOMString type;
  readonly attribute any payload;
};

[Exposed=Window]
interface VoiceTaskResult {
  readonly attribute DOMString taskId;
  readonly attribute VoiceTaskStatus status;
  readonly attribute any result;
  readonly attribute DOMString? error;
};

[Exposed=Window]
interface VoiceTaskQueue {
  DOMString enqueue(VoiceTask task);
  VoiceTaskStatus getStatus(DOMString taskId);
  undefined onComplete(DOMString taskId, VoiceTaskCallback callback);
  undefined cancel(DOMString taskId);
};

callback VoiceTaskCallback = undefined (VoiceTaskResult result);

enum VoiceTaskStatus {
  "queued",
  "running",
  "completed",
  "failed",
  "cancelled"
};
        

Methods

enqueue(task)

Adds an asynchronous task to the queue. Returns a unique taskId string that can be used for status queries and cancellation.

getStatus(taskId)

Returns the current status of the task identified by taskId.

onComplete(taskId, callback)

Registers a callback to be invoked when the specified task completes (whether successfully or with an error).

cancel(taskId)

Requests cancellation of the specified task. If the task has already completed, this method SHOULD have no effect.

Recovery Protocol

The Recovery Protocol provides resilience for voice sessions, restoring state after crashes, disconnections, or other failures. It builds partially on Service Workers [[SERVICE-WORKERS]] patterns for offline and recovery capabilities.

Interface Definitions New Types

[Exposed=Window]
interface VoiceHistoryEntry {
  readonly attribute DOMString entryId;
  readonly attribute DOMTimeStamp timestamp;
  readonly attribute DOMString role;
  readonly attribute DOMString content;
  readonly attribute DOMString? state;
};

[Exposed=Window]
interface VoiceSessionRecovery {
  DOMString summarize();
  Promise<undefined> saveState();
  Promise<undefined> restoreState();
  sequence<VoiceHistoryEntry> getHistory();
};
        

Methods

summarize()

Returns a human-readable summary of the current session state, suitable for resuming a conversation after interruption.

saveState()

Persists the current session state to durable storage. Implementations SHOULD save state automatically at regular intervals and on significant state transitions.

restoreState()

Restores the most recently saved session state, allowing seamless resumption after a failure.

getHistory()

Returns an ordered sequence of history entries for the current session, enabling review and navigation of past interactions.

Storage Consideration: Implementations SHOULD implement a retention policy to prevent unbounded storage growth. Expired or low-priority history entries MAY be pruned automatically.

Comparison and Evaluation

The following table summarizes each proposed API component in terms of its relationship to existing standards, key benefits, potential drawbacks, and assessed usefulness.

API / Feature Builds on Existing Standards? Key Benefits Potential Drawbacks Usefulness (1–5)
Temporal Control Yes (Media Session API) Precise user control; accessibility Browser consistency challenges 5 – Essential for UX
Interaction State Machine Partial (SpeechRecognition) Clear feedback; reduces errors Overkill for simple apps 4 – Highly valuable
Context Management No (Custom need) Long-session support; multi-device Privacy risks with persistence 5 – Game-changer
Configuration Schema Yes (Web App Manifest) Personalization; cross-app harmony Vendor lock-in if inflexible 4 – User-centric
Task Queue Yes (Web Workers, Promises) Non-blocking ops; reliability Complexity in async error handling 4 – Practical
Recovery Protocol Partial (Service Workers) Crash-proof sessions; offline OK Storage bloat over time 5 – Critical for production

Relevant Existing Web Standards

The following existing web standards provide the foundation upon which this framework is proposed:

HTML5 Media Elements

The HTML5 <audio> and <video> elements support pause(), play(), currentTime, and playbackRate natively, providing the foundational playback control model.

Web Speech API

The SpeechSynthesis interface [[WEB-SPEECH-API]] handles text-to-speech playback with pause(), resume(), cancel(), and rate adjustment. It requires custom UIs for full controls per WCAG 2.2 [[WCAG22]] accessibility requirements (e.g., pause/stop/volume).

Media Session API

The Media Session API [[MEDIA-SESSION]] enables voice/OS integration for play/pause/seek via system notifications, making it ideal for voice assistants and platform-level media controls.

Voice AI Usage Patterns

The proposed playback controls represent a de facto standard in voice-enabled AI due to browser consistency and WCAG compliance. These patterns are observed in implementations such as Chrome's "Listen to this page" feature (play/pause/rewind/fast-forward) and Google Assistant media responses (play/pause/stop/start over).

For custom voice AI applications (e.g., Gradio/LLM demos), these controls can be wrapped in a playback object as proposed in this specification. While no formal "voice AI standard" exists beyond the foundational APIs, the patterns described here are interoperable across platforms including OpenAI Voice and Web Speech implementations.

Proposed Publishing Path

The recommended approach for advancing this framework is:

  1. Propose as a W3C Community Group specification or WHATWG extension to the Media Session / Web Speech standards.
  2. Develop a polyfill repository demonstrating voice commands integrated with ontology and knowledge representation tools.
  3. Gather implementation experience and community feedback.
  4. Refine and advance toward a formal Working Group specification if sufficient interest and support exist.

A W3C working group could prototype this in 6–12 months, beginning with polyfills. Prioritizing the Temporal Control API and Recovery Protocol is recommended for maximum initial impact.

Accessibility Considerations

All proposed APIs SHOULD be implemented with accessibility as a primary concern, consistent with WCAG 2.2 [[WCAG22]] requirements. Key considerations include:

Security and Privacy Considerations

Implementations of this framework MUST address the following security and privacy concerns:

External References

The following external resources informed this proposal: