Audio‑Driven Adaptive Playback Speed for Web Video

18 March 2026 by

Suraj Barman

Problem Statement

Video platforms often play speech at a fixed rate, forcing viewers to accept either a slow pace or a rushed, unintelligible experience. The extension described aims to keep the perceived syllable tempo within a comfortable window, allowing the underlying video to run faster while preserving clarity. It does this by continuously measuring the speaker's natural syllable rate and adjusting playback speed in small increments.

The core challenge is to extract a reliable syllable cadence from a live audio stream without disrupting the main playback thread. The design must operate within the constraints of a browser extension, respect content‑security policies, and consume minimal CPU.

Audio Capture Strategy

The script locates the largest HTMLMediaElement on the page and invokes captureStream(). This yields a MediaStream that mirrors the audio output at the current playback speed. Because the stream is a clone, the original video continues uninterrupted, satisfying the requirement for non‑intrusive monitoring.

Signal Conditioning Pipeline

Two BiquadFilterNode stages form a band‑pass that isolates the 300‑3000 Hz region where vowel energy predominates. A subsequent AnalyserNode supplies time‑domain data at roughly 33 Hz, providing a balance between temporal resolution and processing load.

Syllable Rate Extraction

Earlier attempts used peak detection on a smoothed RMS envelope, but continuous fast speech produced a flat high‑energy region that broke the algorithm. The current method treats the RMS envelope as a low‑frequency carrier, applies a first‑order high‑pass filter, and counts zero‑crossings that rise from negative to positive. Each crossing corresponds to a vowel nucleus, giving a direct estimate of syllable frequency.

A minimum spacing of 70 ms between crossings caps the detection at about 14 syllables per second, which comfortably exceeds typical conversational rates. A four‑second sliding window smooths the raw crossing count into a stable measuredRate.

Playback Speed Control Loop

The algorithm computes the speakers natural rate by dividing measuredRate by the current playback speed, then derives a target speed that would map the natural rate to a predefined comfortable syllable rate. The resulting speed value is constrained within user‑specified bounds and blended with the existing speed using an exponential moving average (α = 0.25). This smoothing yields adjustments that are perceptible yet free of abrupt jumps.

When prolonged silence is detected (energy below a threshold for three seconds), the loop gently returns the speed toward 1×, preventing unnecessary acceleration during gaps.

Performance and Security Considerations

The use of AudioWorklet was abandoned because many video sites enforce a strict CSP that blocks blob URLs required for worklet scripts. By keeping all processing in the main thread and limiting the polling interval to 30 ms, the extension stays within acceptable CPU budgets while avoiding CSP conflicts.

All configurable constants-filter cutoffs, polling interval, smoothing factor-are exposed in a single source file, enabling rapid iteration without rebuilding the extension package.

Future Extensions

Potential improvements include adaptive filter bandwidth based on detected speaker pitch, integration with subtitle timing to cross‑validate syllable counts, and a fallback path that uses the Web Speech API for languages where vowel formants differ markedly.