--- title: Browser Speak sdk: static app_file: index.html pinned: false --- # Local Voice LLM This is a static in-browser voice demo: microphone audio is captured locally, Silero VAD detects turn boundaries, SmolLM2 135M streams text, and Supertonic TTS speaks chunks as soon as punctuation or a length threshold makes them safe to synthesize. The first TTS chunk uses an aggressive short-clause threshold to reduce time-to-first-audio; later chunks are larger for smoother playback. The VAD silence slider defaults to 480 ms after testing showed 280 ms could split natural or synthesized pauses; lower values remain available for latency experiments. There is no server inference path in the app. ## Run Serve the directory over localhost so module workers and microphone permissions work: ```bash python3 -m http.server 5173 ``` Open `http://localhost:5173`, click **Load models**, then **Start mic**. After a stack is loaded, the same button becomes **Unload models** so another STT/LLM candidate can be selected without losing benchmark rows. A hosted static Space is available for manual testing at https://huggingface.co/spaces/Mike0021/browser-speak. The direct app host is https://mike0021-browser-speak.static.hf.space. To re-check the deployed Space from this workspace, run: ```bash node tools/run-hosted-smoke.mjs ``` The hosted smoke runner checks the HF Space SHA, the static host's `x-repo-commit` header, desktop/mobile layout, the evidence export/restore path, and the client-side/no-server smoke path. It writes `/tmp/browser-speak-hosted-smoke.json`, with UI, evidence-export, and no-server artifacts linked from that summary. Set `BROWSER_SPEAK_HOSTED_SKIP_CLIENT_SIDE=true` for a quick deploy/header/UI check without loading the full model stack. The hosted smoke runner records the exact Space SHA, Hub API card configuration, and static host commit in `/tmp/browser-speak-hosted-smoke.json`; the app also shows the served build in the Runtime panel and includes it in **Download JSON** exports. The latest May 28, 2026 hosted smoke in this workspace passed desktop/mobile UI checks and the monitored benchmark phase reported 0 network requests, 0 server-inference suspects, and 0 benchmark errors while running TTS, identity, and loopback rows. The Hub API parses the Space card as `sdk: static` with `app_file: index.html`; `hf spaces info` may still surface a non-blocking static-runtime `CONFIG_ERROR`, so the authoritative deploy checks are the static host `x-repo-commit` header plus hosted smoke results. Current validation status from `/tmp/browser-speak-evidence-summary.json`: demo files, UI smoke, evidence-export smoke, client-side/no-server smoke, 3-run synthetic loopback, first-TTS-chunk safety, hosted Space no-server smoke, and artifact freshness pass. The remaining required external evidence is three real human microphone rows and at least one completed hardware WebGPU row from a browser that exposes a real GPU adapter instead of SwiftShader. On a laptop or desktop with a real microphone, run this from the repo to collect hosted browser evidence in one pass: ```bash node tools/run-hosted-evidence-capture.mjs ``` The helper opens the hosted Space in visible Chrome, loads the stack with `BROWSER_SPEAK_DEVICE=auto`, collects the 3-row real-mic series, enriches the artifact with hosted metadata from the browser rows, then runs `tools/audit-browser-evidence.mjs` and `tools/summarize-evidence.mjs`. On a browser that exposes hardware WebGPU, those `auto` rows also satisfy the hardware-WebGPU row requirement. Use `BROWSER_SPEAK_HOSTED_EVIDENCE_DRY_RUN=true node tools/run-hosted-evidence-capture.mjs` for a display/mic preflight without launching the full capture. The first load downloads model files and the selected Supertonic voice embedding from Hugging Face/CDN into the browser cache. The app stages VAD/STT, LLM, and TTS worker loading instead of starting every large Transformers.js load at once; this lowers peak cache/network contention in headless and low-memory browsers without changing post-load conversation latency. The small Supertonic voice embeddings are fetched with a cache bypass and then kept in worker memory, which avoids stale HTTP-cache stalls seen in repeated headless runs. The remaining selectable voice embeddings are preloaded in the background and can also be preloaded explicitly by the benchmark harness. After that, inference runs in the tab with Transformers.js 4.2.0 and ONNX Runtime Web. The Runtime panel shows the browser's WebGPU adapter status before loading. **Auto** uses hardware WebGPU when it is exposed, falls back to WASM when WebGPU is missing, and also falls back to WASM for software adapters such as SwiftShader unless **WebGPU** is selected explicitly for an experiment. The Latency panel includes validation cards for the two external evidence gates: three real microphone rows and at least one hardware WebGPU row. The hardware card uses the same adapter classification as the Runtime panel and updates to ready once a completed benchmark row records real WebGPU execution. On a browser with a real adapter, **Run WebGPU evidence row** loads the default WebGPU evidence stack when needed and captures a fast identity row for the downloaded JSON. **Run evidence capture** is the one-button browser path: it records a hardware WebGPU row when possible, switches from an already-loaded non-WebGPU stack to the default WebGPU evidence stack while preserving existing benchmark rows, runs the 3-row real-mic series, then starts a JSON download and leaves **Download JSON** enabled for saving another copy. ## Chosen Stack | Role | Default | Why | | --- | --- | --- | | VAD | `onnx-community/silero-vad` | Small ONNX model used in the Transformers.js Moonshine web demo; fast enough to run on 512-sample chunks and drive turn-taking. The app defaults to 480 ms of trailing silence, configurable from 200-800 ms. The fp32 ONNX file is about 2.2 MB. | | STT | `onnx-community/moonshine-base-ONNX` | Moonshine Base is the default balanced STT after the current fake-mic series reached 3/3 exact rows, 0% median WER, 1.66 s median ASR, and 7.21 s median speech-end-to-audio. Moonshine Tiny remains selectable as the low-latency experiment, but it failed the current exact fake-mic gate by repeatedly hearing variants like "What happens this?" and timing out. Whisper Tiny English is selectable as the higher-accuracy fallback; it reached 3/3 and 0% WER in fake-mic validation, but its 4.54 s median ASR raised speech-end-to-audio to 10.94 s. The demo uses fp32 encoder + q4 merged decoder because the q8 WASM path failed on Transformers.js 4.2.0 in local verification. | | LLM | `HuggingFaceTB/SmolLM2-135M-Instruct` | The fastest instruct model found for this stack. It is tagged for Transformers.js and includes ONNX q4/q4f16 files; q4 WASM is about 182 MB and q4f16 WebGPU is about 118 MB. In headless WASM it cut first-token latency by roughly 2-3x versus SmolLM2 360M. A tiny pinned identity example keeps the default 135M stack passing the LLM OK gate without switching to the much slower 360M. | | TTS | `onnx-community/Supertonic-TTS-ONNX` | Transformers.js packaging of `Supertone/supertonic`, using local ONNX inference and Supertonic voice embeddings. The demo defaults to voice F2 and 2 inference steps because F2 is still faster in current full-stack fake-mic and default-suite runs; M2 was slightly faster in isolated TTS but slower in loopback and fake-mic full-stack validation, so it remains a candidate for real-mic/WebGPU validation rather than the default. The selected voice is loaded before the app reports ready, and the other voices are preloaded in the background. The repo is about 263 MB. | `onnx-community/SmolLM2-360M-Instruct-ONNX` remains an optional higher-quality but slower LLM candidate. Granite 4.0 350M, `onnx-community/Qwen3-0.6B-ONNX`, and SmolLM2 1.7B are exposed as WebGPU-only candidates in the UI. Granite 4.0 350M is a current Transformers.js/browser package with WebGPU demo spaces and a smaller q4f16 footprint than Qwen3, but its q4 WASM external data is still about 576 MB, so it is guarded until it can be measured on real WebGPU. Qwen3 has a current Transformers.js model card and WebGPU q4f16 example, but its q4 WASM path was large enough to terminate the local headless browser during loading. Relevant sources: - Supertonic runs locally via ONNX and its current v3 family is 99M parameters with browser/WebGPU examples: https://github.com/supertone-inc/supertonic - The Transformers.js Supertonic ONNX package documents `pipeline('text-to-speech', 'onnx-community/Supertonic-TTS-ONNX')`: https://huggingface.co/onnx-community/Supertonic-TTS-ONNX - Transformers.js enables WebGPU by setting `device: "webgpu"`: https://huggingface.co/docs/transformers.js/guides/webgpu - SmolLM2 135M Instruct is tagged for Transformers.js and includes ONNX files: https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct - Granite 4.0 350M ONNX Web is tagged for Transformers.js and has WebGPU demo spaces: https://huggingface.co/onnx-community/granite-4.0-350m-ONNX-web - Qwen3 0.6B ONNX includes a Transformers.js WebGPU example: https://huggingface.co/onnx-community/Qwen3-0.6B-ONNX - Moonshine Tiny Transformers.js model card: https://huggingface.co/onnx-community/moonshine-tiny-ONNX - Moonshine paper: https://arxiv.org/abs/2410.15608 ## Benchmark Plan Use **Run benchmark suite** after loading models to execute the current stack's TTS, barge-in, identity, chat, and voice-loopback benchmarks in sequence. Use **Benchmark real mic** for one real microphone row, **Run 3 real-mic series** to collect the three real-mic repetitions expected by the summary cards, or **Run evidence capture** to collect the hardware WebGPU row when available and then the full real-mic series. The real-mic paths prompt the user to say "What app is this?" so each row captures real-mic STT WER/CER, end-to-end latency, and the identity-answer LLM quality gate. The Input panel includes a live microphone input-level meter sourced from the same audio worklet chunks that feed VAD/STT, so manual testers can confirm that the browser is receiving audio before waiting on recognition. The Latency panel includes a **Real mic validation** card that shows the exact phrase, the current row during a series, and current-stack progress toward the 3-row target. **Run TTS benchmark** isolates Supertonic latency for the current voice and inference-step setting, **Run barge-in check** verifies that the speech-start interruption path cancels in-flight TTS before stale audio can play, **Run identity benchmark** checks the same strict "What app is this?" quality gate without the microphone, **Run chat benchmark** measures a normal non-identity text turn without the identity primer, and **Run voice loopback** synthesizes a local Supertonic utterance and feeds it through the same VAD/STT path when a test environment has no microphone. Each run is appended to the **Benchmarks** table with the selected stack, prompt/transcript, response, latency metrics, loopback or real-mic STT word error rate, and an identity-answer LLM quality gate for the "What app is this?" prompt; failed or timed-out runs are kept as rows with an `error` field so candidate comparisons do not hide unstable stacks. The summary cards focus on the currently loaded stack when models are loaded, otherwise all rows. **Copy JSON** exports `{ summary, results }`, where `summary` contains `all`, `current`, and `byStack` aggregate objects with TTS, identity, chat, real-mic, loopback, and barge-in medians for comparing candidate stacks. Microphone startup is capped at 15 seconds. If browser permissions, fake media devices, or audio worklet startup hang, the active mic benchmark is cancelled and the event log records the failure instead of leaving a silent pending run. Loopback audio is also followed by an explicit ASR flush after the synthesized prompt and trailing silence have been fed, so sticky VAD/STT candidates fail or finish as rows instead of waiting for the global benchmark timeout. Exported result rows include runtime metadata under `stack.environment`: page URL, secure-context status, browser user agent, CPU concurrency hint, device memory hint when exposed, WebGPU availability, WebGPU adapter info/features/software-adapter classification when exposed by the browser, microphone or loopback sample rate, and sanitized microphone track settings such as channel count, sample rate, latency, and audio-processing flags when the browser reports them. The benchmark stack also records voice, TTS steps, VAD silence, partial-ASR state, full-stack model load time as `modelLoadMs`, LLM worker acknowledgement latency as `llmStartMs`, LLM prompt build time and input size as `llmPromptBuildMs` / `llmPromptTokens`, first-response latency, and scheduled full-audio completion as `audioEndMs` / `speechEndToAudioEndMs`. Keep that metadata with WebGPU results so latency numbers can be compared across machines. The latency panel counts **speech end** from the detected acoustic end of speech, not from the later worker event that closes the turn after trailing silence. The separate **VAD close delay** metric shows how much silence was observed before the turn was finalized. | Metric | Captured in UI | | --- | --- | | STT WER | Mic or loopback transcript word error rate against the prompted phrase; exported with character error rate as `sttCer` | | LLM OK | Pass/fail gate plus concept score for identity prompts; output must mention client/browser/local execution, speech recognition, an LLM, and TTS/Supertonic | | VAD close delay | Trailing silence observed before the VAD closes the turn | | Speech end -> transcript | Detected acoustic speech end to final Moonshine transcript; captured for mic and loopback runs | | Transcript -> first token | Transcript dispatch to first streamed LLM token | | Transcript -> TTS queued | Transcript dispatch to the first speakable LLM chunk being sent to Supertonic | | First TTS synth | Worker-side Supertonic inference time for the first playable response chunk | | Transcript -> first audio | Transcript dispatch to first playable Supertonic chunk | | Speech end -> first audio | End-to-end turn latency from VAD speech end to first playable response audio | | Transcript -> audio done | Scheduled end of the last playable Supertonic response chunk, useful for comparing short versus long candidate outputs | | LLM decode | Streaming token rate from the worker | | Barge-in check | Synthetic speech-start cancellation of an in-flight TTS chunk; exported with `bargeInMs` and `bargeInPass` | Candidate comparisons to run: | Component | Candidates | | --- | --- | | STT | Moonshine Base, Moonshine Tiny, Whisper Tiny English | | LLM | SmolLM2 135M, SmolLM2 360M, Granite 4.0 350M, Qwen3 0.6B, SmolLM2 1.7B | | TTS | Supertonic steps 2-8, voices F1/F2/M1/M2 | For model comparisons, choose the candidate model before **Load models**, run **Run benchmark suite**, add at least three real microphone rows with **Run 3 real-mic series**, then click **Unload models** and choose the next candidate. The benchmark table is preserved across unloads, shows prompt token/build metrics for LLM-backed rows, and **Copy JSON** can export the summary and all rows after the comparison set. For TTS voice/step comparisons, change the voice or step slider after loading and rerun **Run TTS benchmark** plus the text/loopback benchmarks; those values are captured per row. Use **Download JSON** after benchmark or real-mic runs on the hosted Space to save a timestamped evidence file with the same payload as **Copy JSON** and `window.browserSpeakBench.exportResults()`. The export includes `schemaVersion`, a unique `exportId`, `generatedAt`, `hostMetadata` with the hosted Space commit when the static host exposes it, the current `runtime`, a bounded `network` trace of model-worker fetches with benchmark-phase server-inference suspect counts, an `evidence` summary for real-mic/hardware-WebGPU/hosted-build readiness, an embedded `evidenceGuide` with the audit command and row requirements, aggregate `summary`, and raw `results`. Each benchmark row also records worker-network request counts and the recent per-row worker requests when any fetch occurs while the row is active. The **Run evidence capture** button starts a timestamped `browser-speak-evidence-*.json` download when the capture sequence finishes; if the browser blocks the automatic download, **Download JSON** saves the same payload manually. Benchmark rows are also autosaved in browser storage for the current host and Space commit, so a reload during manual evidence capture can restore rows from the same deployed build; **Clear** removes both the visible rows and the saved copy. Use `node tools/audit-browser-evidence.mjs path/to/browser-speak-benchmarks.json` to audit a JSON file downloaded from the hosted browser UI. This checks that the export came from the static Space revision, reports the page-generated `evidence` summary, verifies that at least three real-mic rows meet the WER/identity/latency gates from raw rows, and verifies that at least one completed row reports hardware WebGPU rather than a software adapter. Browser-downloaded JSON includes worker-side fetch telemetry, but it cannot prove whole-page no-server network behavior or Chrome launch flags by itself, so pair it with `node tools/run-hosted-smoke.mjs` for CDP network evidence or `node tools/run-real-mic-series.mjs` for harness-collected real-mic provenance. When served from localhost or the hosted Hugging Face Space, the page installs `window.browserSpeakBench` for browser automation. It exposes `loadStack(options)`, `setRuntimeOptions(options)`, `preloadVoice(options)`, `runSuite()`, `runMicSeries()`, `runEvidenceCapture()`, one-off runners such as `runTts()`, `runLoopback()`, and `runWebGpuEvidence()`, `exportResults()`, `downloadResults(options)`, `clearResults()`, `webgpuInfo()`, and `state()`. This lets a Playwright, Puppeteer, or DevTools Protocol harness drive the same UI-backed benchmark paths and collect the same JSON as **Copy JSON**. `setRuntimeOptions()` and `loadStack()` also accept `ttsChunking` so first-response chunk thresholds can be benchmarked without editing source; the exported stack records the active chunk profile. The helper `node tools/run-fake-mic-benchmark.mjs` synthesizes the scripted prompt with local Supertonic, relaunches Chrome with that WAV as `--use-file-for-fake-audio-capture`, and runs three scripted mic rows through getUserMedia, the audio worklet, VAD, STT, LLM, and TTS. Set `BROWSER_SPEAK_REUSE_FAKE_MIC_WAV=true` to reuse an existing fixture WAV and skip the synthesis/load pass during regression checks; row timeouts can be adjusted with `BROWSER_SPEAK_FAKE_MIC_TURN_TIMEOUT_MS` and `BROWSER_SPEAK_FAKE_MIC_ROW_TIMEOUT_MS`. The CDP harnesses use a 60-second protocol watchdog by default, configurable with `BROWSER_SPEAK_CDP_TIMEOUT_MS`, short 5-second status-poll watchdogs, configurable with `BROWSER_SPEAK_CDP_POLL_TIMEOUT_MS`, and a page-unresponsive watchdog, configurable with `BROWSER_SPEAK_PAGE_UNRESPONSIVE_TIMEOUT_MS`, so a slow or wedged browser target produces a controlled failure and cleanup instead of blocking indefinitely. A fake capture file is useful for regression testing the mic pipeline, but real microphone rows are still required before judging conversational quality. Use `node tools/run-ui-smoke.mjs` for a fast desktop/mobile UI check before or after model benchmarks. It loads the static page in Chrome, waits for the runtime/WebGPU status to settle, verifies the automation API and key controls, checks for body or visible element overflow outside the intentionally scrollable benchmark table, and writes `/tmp/browser-speak-ui-smoke.json` plus screenshots in `/tmp/browser-speak-ui-smoke`. Override with `BROWSER_SPEAK_UI_VIEWPORTS`, `BROWSER_SPEAK_UI_JSON`, `BROWSER_SPEAK_UI_SCREENSHOT_DIR`, `BROWSER_SPEAK_URL`, `BROWSER_SPEAK_HEADLESS=false`, `BROWSER_SPEAK_CHROME_ARGS`, or `CHROME_BIN`. Use `node tools/run-evidence-export-smoke.mjs` to verify the browser evidence-protection path without loading models. It seeds a same-build saved benchmark row, reloads the page, confirms the row restores, triggers `downloadResults({ prefix: "browser-speak-evidence" })`, verifies a JSON download request, then runs Clear and verifies the saved copy is removed. It writes `/tmp/browser-speak-evidence-export-smoke.json`; override with `BROWSER_SPEAK_EVIDENCE_EXPORT_JSON`, `BROWSER_SPEAK_EVIDENCE_EXPORT_PROFILE_DIR`, `BROWSER_SPEAK_URL`, `BROWSER_SPEAK_HEADLESS=false`, `BROWSER_SPEAK_CHROME_ARGS`, or `CHROME_BIN`. Use `node tools/run-client-side-smoke.mjs` to verify the no-server-inference claim with Chrome DevTools Protocol network events. The harness loads the default local stack, preloads every Supertonic voice embedding so background asset fetches have settled, clears any restored benchmark rows, clears the network recorder, then runs worker-backed TTS, identity, and voice-loopback benchmarks. This monitored phase covers TTS, LLM generation, synthesized speech through VAD/STT, and response TTS after all assets are loaded, and the exported rows are only the rows produced by that monitored run. It skips the blocking TTS warmup by default in automation because the benchmark rows still exercise real local Supertonic inference and this avoids a headless-only warmup stall; set `BROWSER_SPEAK_TTS_WARMUP=true` to match the interactive UI load path. It writes `/tmp/browser-speak-client-side-smoke.json` with CDP network evidence plus the page export metadata/host commit, and fails on POST-like traffic, known inference hosts, inference-shaped paths, unexpected hosts, missing benchmark rows, row-level benchmark errors during the benchmark phase, or an unresponsive page that stays inaccessible longer than the watchdog. Override with `BROWSER_SPEAK_CLIENT_SIDE_TASKS`, `BROWSER_SPEAK_CLIENT_SIDE_JSON`, `BROWSER_SPEAK_CLIENT_SIDE_PROFILE_DIR`, `BROWSER_SPEAK_CLIENT_SIDE_REUSE_PROFILE=true`, `BROWSER_SPEAK_LOAD_TIMEOUT_MS`, `BROWSER_SPEAK_TASK_TIMEOUT_MS`, `BROWSER_SPEAK_VOICE_PRELOAD_TIMEOUT_MS`, `BROWSER_SPEAK_PAGE_UNRESPONSIVE_TIMEOUT_MS`, `BROWSER_SPEAK_TTS_WARMUP=true`, `BROWSER_SPEAK_URL`, `BROWSER_SPEAK_HEADLESS=false`, `BROWSER_SPEAK_CHROME_ARGS`, or `CHROME_BIN`. Use `node tools/run-loopback-series.mjs` for a quick repeated synthetic-voice stability check of the local VAD/STT -> LLM -> TTS path after one model load. It writes `/tmp/browser-speak-loopback-series.json` plus raw UI export JSON at `/tmp/browser-speak-loopback-series-raw.json`, defaults to three loopback turns using the synthetic prompt "Identify this browser demo." at 1.00x prompt speed, and reports exact-transcript count, WER/CER medians, identity-pass count, and speech-end-to-audio medians. Override with `BROWSER_SPEAK_LOOPBACK_COUNT`, `BROWSER_SPEAK_LOOPBACK_SPEED`, `BROWSER_SPEAK_LOOPBACK_TEXT`, `BROWSER_SPEAK_LOOPBACK_JSON`, `BROWSER_SPEAK_LOOPBACK_RAW_JSON`, `BROWSER_SPEAK_LOAD_TIMEOUT_MS`, `BROWSER_SPEAK_TASK_TIMEOUT_MS`, `BROWSER_SPEAK_URL`, `BROWSER_SPEAK_HEADLESS=false`, `BROWSER_SPEAK_CHROME_ARGS`, `CHROME_BIN`, or the same `BROWSER_SPEAK_LLM` / `BROWSER_SPEAK_ASR` / `BROWSER_SPEAK_VOICE` / `BROWSER_SPEAK_TTS_STEPS` / `BROWSER_SPEAK_VAD_SILENCE_MS` stack overrides. Use `node tools/run-real-mic-series.mjs` on a machine with a real microphone to collect the human-speech validation rows that synthetic loopback cannot prove. It launches a visible Chrome session by default, loads the selected stack, prompts for `What app is this?` three times, keeps the microphone open across rows, then writes `/tmp/browser-speak-real-mic-series.json` with WER/CER, identity-pass count, mic input stats, and end-to-end speech-end-to-audio medians. The script uses Chrome's real input device; it only auto-accepts the media permission by default, and it rejects fake capture flags such as `--use-fake-device-for-media-stream` or `--use-file-for-fake-audio-capture` unless `BROWSER_SPEAK_REAL_MIC_ALLOW_FAKE_CAPTURE=true` is set for harness debugging. Use `BROWSER_SPEAK_REAL_MIC_DRY_RUN=true` to inspect config and Linux display preflight without launching Chrome; the dry run writes `/tmp/browser-speak-real-mic-series-dry-run.json` unless `BROWSER_SPEAK_REAL_MIC_JSON` is set. Override with `BROWSER_SPEAK_REAL_MIC_AUTO_ACCEPT=false` if you want to click the browser permission prompt manually, `BROWSER_SPEAK_REAL_MIC_COUNT`, `BROWSER_SPEAK_REAL_MIC_ROW_TIMEOUT_MS`, `BROWSER_SPEAK_REAL_MIC_REQUIRE_EXACT=true`, `BROWSER_SPEAK_REAL_MIC_JSON`, `BROWSER_SPEAK_REAL_MIC_PROFILE_DIR`, `BROWSER_SPEAK_REAL_MIC_REUSE_PROFILE=true`, `BROWSER_SPEAK_KEEP_BROWSER_OPEN=true`, `BROWSER_SPEAK_HEADLESS=true`, `BROWSER_SPEAK_TTS_WARMUP=false`, `BROWSER_SPEAK_URL`, `BROWSER_SPEAK_CHROME_ARGS`, `CHROME_BIN`, or the same stack override variables as the other harnesses. Use `node tools/run-local-candidate-benchmark.mjs` to run a configurable WASM candidate matrix through the same UI-backed benchmark tasks and write `/tmp/browser-speak-local-candidates.json`. The default matrix compares Moonshine Base, Moonshine Tiny, and Whisper Tiny English with the default SmolLM2 135M + Supertonic F2 stack using `tts,loopback` tasks. Override the cross-product matrix with `BROWSER_SPEAK_LOCAL_LLMS`, `BROWSER_SPEAK_LOCAL_ASRS`, `BROWSER_SPEAK_LOCAL_VOICES`, `BROWSER_SPEAK_LOCAL_TTS_STEPS`, `BROWSER_SPEAK_LOCAL_VAD_SILENCE_MS`, and `BROWSER_SPEAK_LOCAL_TASKS`, or pass exact stack objects in `BROWSER_SPEAK_LOCAL_STACKS`; stack objects may include `ttsChunking` with `firstTargetChars`, `targetChars`, `firstMinSpaceChars`, `minSpaceChars`, `firstSentenceMinChars`, `sentenceMinChars`, `firstClauseMinChars`, and `clauseMinChars`. Use `BROWSER_SPEAK_LOCAL_MAX_STACKS=1 BROWSER_SPEAK_LOCAL_TASKS=tts` for a quick smoke, `BROWSER_SPEAK_LOCAL_REUSE_PROFILE=true` to preserve a warmed browser profile between runs, `BROWSER_SPEAK_TTS_WARMUP=true` to include the interactive UI's blocking TTS warmup, `BROWSER_SPEAK_PAGE_UNRESPONSIVE_TIMEOUT_MS` to tune the renderer watchdog, and `BROWSER_SPEAK_LOCAL_DRY_RUN=true` to inspect the planned matrix without launching Chrome or loading models. The per-candidate output promotes model load time plus TTS, identity, chat, and loopback medians so short smoke runs and larger sweeps are comparable. Use `node tools/run-tts-sweep-benchmark.mjs` for an efficient Supertonic voice/step sweep after one model load. It defaults to F1/F2/M1/M2 at 2, 4, and 8 inference steps, preloads each voice before timing that candidate, writes `/tmp/browser-speak-tts-sweep.json`, and records first-audio, synthesis, round-trip, playback-delay, audio-done, and model-load metrics per candidate. Override with `BROWSER_SPEAK_TTS_SWEEP_VOICES`, `BROWSER_SPEAK_TTS_SWEEP_STEPS`, and `BROWSER_SPEAK_TTS_SWEEP_DRY_RUN=true`. Use `node tools/run-webgpu-benchmark.mjs` on a real WebGPU browser to compare browser-ready LLM candidates with the same TTS, barge-in, identity, chat, and loopback suite. The script first calls `webgpuInfo()` and writes `/tmp/browser-speak-webgpu-results.json`; if no hardware adapter is exposed, it exits successfully with `{ "skipped": true, "reason": "WebGPU unavailable" }` or a software-adapter reason instead of trying to load WebGPU-only models. The script uses the same software-adapter classification surfaced in the Runtime panel. By default it tests SmolLM2 135M, Granite 4.0 350M, and Qwen3 0.6B on WebGPU with Moonshine Base, Supertonic F2, 2 TTS steps, and 480 ms VAD silence. Override the run with `BROWSER_SPEAK_WEBGPU_LLMS`, `BROWSER_SPEAK_ASR`, `BROWSER_SPEAK_URL`, `BROWSER_SPEAK_WEBGPU_JSON`, `BROWSER_SPEAK_HEADLESS=false`, `BROWSER_SPEAK_CHROME_ARGS`, `BROWSER_SPEAK_ALLOW_SOFTWARE_WEBGPU=true`, or `CHROME_BIN`. The output keeps `summary.byStack`, per-candidate status, adapter metadata, model load time, and all benchmark rows, so it can be compared directly with the UI export and fake-mic JSON. Use `node tools/audit-validation.mjs` as the final evidence gate. It reads the default JSON artifacts from `/tmp`, writes `/tmp/browser-speak-validation-audit.json`, and requires passing UI smoke, client-side/no-server smoke, 3-row loopback stability, first-TTS-chunk boundary safety, real human microphone rows with no fake-capture Chrome flags, and at least one completed hardware WebGPU candidate. The current container is expected to fail the last two gates because no real microphone artifact exists and Chrome exposes SwiftShader instead of a hardware adapter. Set `BROWSER_SPEAK_AUDIT_SOFT=true` to write the audit without a failing exit code, or set `BROWSER_SPEAK_AUDIT_REQUIRE_REAL_MIC=false` / `BROWSER_SPEAK_AUDIT_REQUIRE_HARDWARE_WEBGPU=false` only for local development checks. Use `node tools/run-final-validation.mjs` to rerun the full validation sequence and write `/tmp/browser-speak-final-validation.json`. By default it runs UI smoke, evidence-export smoke, client-side/no-server smoke, loopback stability, the WebGPU benchmark/probe, and the final audit; it skips real-microphone capture unless `BROWSER_SPEAK_FINAL_REAL_MIC=true` is set because that step needs a human speaker and real input device. Use `BROWSER_SPEAK_FINAL_REAL_MIC=dry-run` for a display/microphone preflight only, `BROWSER_SPEAK_FINAL_SOFT=true` to complete local runs while still recording audit failures for missing real-mic or hardware-WebGPU evidence, and `BROWSER_SPEAK_FINAL_SKIP_LOCAL=true` or the narrower `BROWSER_SPEAK_FINAL_SKIP_UI=true` / `BROWSER_SPEAK_FINAL_SKIP_EVIDENCE_EXPORT=true` / `BROWSER_SPEAK_FINAL_SKIP_CLIENT_SIDE=true` / `BROWSER_SPEAK_FINAL_SKIP_LOOPBACK=true` / `BROWSER_SPEAK_FINAL_SKIP_WEBGPU=true` switches when reusing fresh artifacts. The final JSON separates each step's `commandStatus` from `evidenceStatus`, so a soft audit can show `commandStatus: "pass"` while the overall `passed` field remains `false` until every required evidence gate is satisfied. Use `node tools/summarize-evidence.mjs` after local, hosted, browser-downloaded, real-mic, or WebGPU runs to consolidate the current evidence into `/tmp/browser-speak-evidence-summary.json`. It reports artifact freshness against the current source fingerprint, hosted no-server status, browser-downloaded evidence status, local audit checks, and the next missing actions. Current candidate metadata gathered on May 26, 2026: | Role | Candidate | Browser-ready package | Approx selected asset size | | --- | --- | --- | --- | | STT | Moonshine Tiny | `onnx-community/moonshine-tiny-ONNX` | fp32 encoder 29 MB + q4 decoder 43 MB | | STT | Moonshine Base | `onnx-community/moonshine-base-ONNX` | fp32 encoder 77 MB + q4 decoder 69 MB | | STT | Whisper Tiny English | `onnx-community/whisper-tiny.en` | fp32 encoder 31 MB + q4 decoder 83 MB | | LLM | SmolLM2 135M | `HuggingFaceTB/SmolLM2-135M-Instruct` | q4f16 118 MB / q4 182 MB | | LLM | SmolLM2 360M | `onnx-community/SmolLM2-360M-Instruct-ONNX` | q4f16 260 MB / q4 369 MB | | LLM | Granite 4.0 350M | `onnx-community/granite-4.0-350m-ONNX-web` | q4f16 350 MB external data / q4 576 MB external data | | LLM | Qwen3 0.6B | `onnx-community/Qwen3-0.6B-ONNX` | q4f16 543 MB / q4 877 MB | | TTS | Supertonic | `onnx-community/Supertonic-TTS-ONNX` | repo 263 MB, selected voice loaded before ready, other voices lazy/background preloaded | ## Current Local Measurements Measured on this workspace with Chrome 145 headless, x86_64, 4 vCPU, Transformers.js 4.2.0, and no hardware WebGPU adapter. The WASM benchmark harness disables GPU; the dedicated WebGPU probe exposes SwiftShader as a software adapter and skips hardware candidate runs by default. The verification used the actual ASR/LLM/TTS workers. A CDP network smoke loaded the stack with automation TTS warmup disabled, preloaded every Supertonic voice embedding, then observed zero benchmark-phase network requests, zero server-inference suspects, zero benchmark row errors, and no missing task rows while running TTS, identity, and loopback benchmarks. Headless Chrome blocks autoplay for programmatic clicks, so the current first-audio benchmark records when the first synthesized Supertonic chunk is received from the worker, which is the earliest playable point; user-gesture playback should add only a small scheduling delay. These are WASM fallback numbers, so they are useful for correctness and worst-case shape, not for the intended WebGPU latency target. | Test | STT | LLM | TTS steps | Speech end -> transcript | STT WER | LLM OK | VAD close | Transcript -> first token | Transcript -> TTS queued | First TTS synth | Transcript -> first audio | Speech end -> first audio | Decode | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | One-load TTS sweep | N/A | N/A | Supertonic M2, 2 steps | N/A | N/A | N/A | N/A | N/A | 0 ms | 2.00 s | 2.00 s | N/A | N/A | | One-load TTS sweep | N/A | N/A | Supertonic F2, 2 steps | N/A | N/A | N/A | N/A | N/A | 0 ms | 2.10 s | 2.10 s | N/A | N/A | | One-load TTS sweep | N/A | N/A | Supertonic M1, 2 steps | N/A | N/A | N/A | N/A | N/A | 0 ms | 2.27 s | 2.27 s | N/A | N/A | | One-load TTS sweep | N/A | N/A | Supertonic M2, 4 steps | N/A | N/A | N/A | N/A | N/A | 0 ms | 2.32 s | 2.32 s | N/A | N/A | | One-load TTS sweep | N/A | N/A | Supertonic F1, 2 steps | N/A | N/A | N/A | N/A | N/A | 0 ms | 2.89 s | 2.90 s | N/A | N/A | | One-load TTS sweep | N/A | N/A | Supertonic M1, 4 steps | N/A | N/A | N/A | N/A | N/A | 0 ms | 3.15 s | 3.15 s | N/A | N/A | | One-load TTS sweep | N/A | N/A | Supertonic F2, 4 steps | N/A | N/A | N/A | N/A | N/A | 0 ms | 3.49 s | 3.49 s | N/A | N/A | | One-load TTS sweep | N/A | N/A | Supertonic F1, 4 steps | N/A | N/A | N/A | N/A | N/A | 0 ms | 3.58 s | 3.58 s | N/A | N/A | | One-load TTS sweep | N/A | N/A | Supertonic M1, 8 steps | N/A | N/A | N/A | N/A | N/A | 0 ms | 4.20 s | 4.20 s | N/A | N/A | | One-load TTS sweep | N/A | N/A | Supertonic M2, 8 steps | N/A | N/A | N/A | N/A | N/A | 0 ms | 4.27 s | 4.28 s | N/A | N/A | | One-load TTS sweep | N/A | N/A | Supertonic F1, 8 steps | N/A | N/A | N/A | N/A | N/A | 0 ms | 4.50 s | 4.50 s | N/A | N/A | | One-load TTS sweep | N/A | N/A | Supertonic F2, 8 steps | N/A | N/A | N/A | N/A | N/A | 0 ms | 5.00 s | 5.00 s | N/A | N/A | | Current default TTS suite row | N/A | N/A | Supertonic F2, 2 steps | N/A | N/A | N/A | N/A | N/A | 1 ms | 1.56 s | 1.56 s | N/A | N/A | | Current default chat suite row | N/A | SmolLM2 135M q4 WASM | Supertonic F2, 2 steps | N/A | N/A | N/A | N/A | 2.79 s | 3.35 s | 724 ms | 4.08 s | N/A | 1.6 tok/s | | Current default identity suite row | N/A | SmolLM2 135M q4 WASM | Supertonic F2, 2 steps | N/A | N/A | pass 4/4 | N/A | 4.37 s | 4.37 s | 643 ms | 5.02 s | N/A | 1.3 tok/s | | Current default barge-in suite row | N/A | N/A | Supertonic F2, 2 steps | N/A | N/A | N/A | N/A | N/A | 0 ms | cancelled | cancelled | N/A | N/A | | Text benchmark | N/A | SmolLM2 360M q4 WASM | 2 | N/A | N/A | fail 3/4 | N/A | 9.56 s | 13.43 s | 1.31 s | 14.75 s | N/A | 0.5 tok/s | | Current default voice loopback suite row | Moonshine Base fp32/q4 | SmolLM2 135M q4 WASM | Supertonic F2, 2 steps | 1.80 s | 0% | pass 4/4 | 480 ms | 4.14 s | 4.14 s | 545 ms | 4.69 s | 6.49 s | 1.5 tok/s | | 3-run loopback stability series | Moonshine Base fp32/q4 | SmolLM2 135M q4 WASM | Supertonic F2, 2 steps | 1.92 s | 0% median, 3/3 exact | pass 3/3 | 480 ms | 5.13 s | 5.14 s | 610 ms | 5.70 s | 7.57 s | 1.4 tok/s | | M2 loopback validation | Moonshine Base fp32/q4 | SmolLM2 135M q4 WASM | Supertonic M2, 2 steps | 2.33 s | 0% | pass 4/4 | 480 ms | 6.64 s | 8.48 s | 1.74 s | 10.23 s | 12.56 s | 0.9 tok/s | | Fake microphone scripted series | Moonshine Base fp32/q4 | SmolLM2 135M q4 WASM | Supertonic F2, 2 steps | 1.66 s | 0% | pass 3/3 rows | 480 ms | 4.98 s | 4.98 s | 599 ms | 5.50 s | 7.21 s | 1.4 tok/s | | Whisper Tiny fake microphone validation | Whisper Tiny English fp32/q4 | SmolLM2 135M q4 WASM | Supertonic F2, 2 steps | 4.54 s | 0% | pass 4/4 | 480 ms | 4.70 s | 5.80 s | 603 ms | 6.40 s | 10.94 s | 1.5 tok/s | | M2 fake microphone validation | Moonshine Base fp32/q4 | SmolLM2 135M q4 WASM | Supertonic M2, 2 steps | 2.37 s | 0% | pass 4/4 | 480 ms | 6.64 s | 7.84 s | 1.43 s | 9.27 s | 11.64 s | 1.1 tok/s | | Previous voice loopback | Moonshine Base fp32/q4 | SmolLM2 135M q4 WASM | Supertonic F1, 2 steps | 1.65 s | 25% | pass 4/4 | 480 ms | 3.74 s | 4.81 s | 773 ms | 5.58 s | 7.24 s | 1.6 tok/s | | Voice loopback | Moonshine Tiny fp32/q4 | SmolLM2 135M q4 WASM | 2 | 1.08 s | 25% | pass 4/4 | 480 ms | 3.72 s | 4.84 s | 951 ms | 5.79 s | 6.88 s | 1.4 tok/s | | Voice loopback | Whisper Tiny English fp32/q4 | SmolLM2 135M q4 WASM | 2 | 5.62 s | 0% | pass 4/4 | 480 ms | 4.63 s | 6.63 s | 902 ms | 7.53 s | 13.16 s | 1.3 tok/s | The first-clause TTS threshold reduced the earlier SmolLM2 360M text benchmark's first-audio time from 45.84 s to 12.25 s in headless WASM by moving TTS queueing from 44.43 s after transcript to 11.01 s. Switching the default LLM to SmolLM2 135M, splitting generic and identity prompts, and using a word-boundary-safe first chunk reduced normal chat first audio to 4.34 s in the earlier default suite. The first response chunk now targets roughly 5 characters, keeps a minimum safe space boundary, and searches forward to the next word boundary before falling back to a hard character cut; this fixed the rejected early failure that produced "I'm read", keeps the current chat first chunk as "I'm ready", and lets identity turns start on the safe first-word chunk "This". Identity turns use a stricter system prompt plus tiny pinned examples, and repeated identity prompts now ignore prior identity history, keeping the fake-mic prompt at 187 input tokens across all three scripted rows instead of growing turn by turn. The identity-intent detector also tolerates narrow ASR near-misses such as "browse demo" or "browser dome" for "browser demo" and "identifies" for "identify"; the original transcript and WER are still preserved. The latest fake-mic series measured 4.98 s median transcript-to-first-token and 7.21 s median speech-end-to-first-audio. A current-code 360M text run improved the identity answer from 2/4 to 3/4 concepts, but first audio was still 14.75 s and the answer omitted the LLM component, so 135M remains the default latency/quality choice. Qwen3 0.6B terminated the local headless browser during WASM model load before it reached benchmark-ready state, so the app now blocks Qwen3 and SmolLM2 1.7B on WASM fallback and requires WebGPU for those candidates. The one-load TTS sweep confirms 2-step Supertonic is the right default for latency; 8 steps roughly doubled first-audio time, and M2 at 2 steps was the fastest isolated voice/step setting at 2.00 s versus F2 at 2.10 s. The app still defaults to F2 because the latest F2 fake-mic series reached 7.21 s median speech-end-to-audio, while M2's repeated fake-mic validation reached 0% WER but slower 11.64 s speech-end-to-audio and 23.42 s speech-end-to-audio-done; M2 also measured slower in loopback, so real-mic and WebGPU validation are still needed before reconsidering the default. The TTS worker now loads the selected voice before reporting ready, then preloads the remaining voices in the background; the latest hosted client-side smoke measured a 1.65 s TTS first-audio row with all benchmark-phase network requests already settled. The current default benchmark suite was verified with actual workers and produced TTS, barge-in, identity, chat, and loopback rows before re-enabling controls. The barge-in check verified that a synthetic speech-start event cancels in-flight TTS before stale audio plays and returns the TTS tile to Ready. After repeated loopback checks, the synthetic loopback stability gate now uses the prompt "Identify this browser demo.", 1.00x prompt speed, and a short silence preroll; the latest targeted run completed 3/3 rows, reached 3/3 exact transcripts with 0% median WER, passed the identity answer gate on all 3 rows, and reached 7.57 s median speech-end-to-audio. The previous "Please identify this browser demo." prompt passed the identity gate but intermittently dropped the low-value opening word, pushing WER to 20%; the shorter "What app is this?", "What application is this?", "What is this app?", and "What demo is this?" synthetic prompts also intermittently dropped opening words or confused "application" with "cation/location", so they remain stress options rather than the default loopback gate. A partial-ASR-off loopback comparison was slightly faster at 5.94 s median speech-end-to-audio, but it dropped to 2/3 exact transcripts with "Wap is this.", so partial previews remain enabled by default. The loopback gate is still treated as a synthetic stability signal, not a substitute for real microphone testing. Normal response TTS remains at 1.08x. The fake microphone scripted series reached 3/3 completed rows with 0% median WER, 1.66 s median ASR, 7.21 s median speech-end-to-audio, and 19.30 s median speech-end-to-audio-done through the browser capture path; because Chrome loops fake WAV input, the harness ignores non-matching partial prompt captures until the reference phrase is heard, then stops the mic so the response can finish. A stale-ASR guard now ignores queued speech events after scripted fake-mic capture stops; the M2 regression run completed 3/3 rows after previously timing out from those stale interruptions. The live input-level meter uses the same microphone worklet chunks as VAD/STT, so it helps distinguish permission/device silence from VAD or transcription delay during real-mic testing. The loopback benchmark now feeds synthesized audio at real-time chunk intervals, uses a 480 ms VAD close delay after a 280 ms default split the prompt into separate words, and posts an explicit ASR flush after trailing silence so sticky candidates finish. Moonshine Base remains the default STT because it is much faster than Whisper Tiny English while less error-prone than Tiny: Whisper Tiny English also reached 0% WER in fake mic, but raised median ASR to 4.54 s and speech-end-to-audio to 10.94 s, while Moonshine Tiny failed the exact fake-mic gate before completing one row. The scripted mic benchmark still uses "What app is this?" as its reference phrase so real-mic rows can report WER/CER against the short spoken phrase, and the 3-run mic series keeps the same microphone open while collecting the repeated rows. The latest full-stack model loads completed in about 15-35 seconds cold in this environment, with a same-profile fake-mic reload at about 10.5 seconds. A real WebGPU browser should be benchmarked next; this headless environment exposes only the SwiftShader software adapter. ## Known Limitations - The default stack is English-first because Moonshine Tiny and `Supertone/supertonic` are English-focused. Supertonic 3 has broader language coverage, but the current low-friction Transformers.js TTS path uses the Supertonic ONNX package listed above. - Partial transcripts are periodic previews, not true token-level streaming ASR. VAD and ASR are scheduled on separate worker queues so previews do not block turn-boundary detection, but they still add model inference work on low-end devices, so the UI has a toggle. In the latest synthetic loopback comparison, disabling previews was slightly faster but less exact, so previews stay enabled until real-mic validation proves the opposite tradeoff. - Initial model download is large and network-bound. This is model loading only; user audio, transcript, LLM generation, TTS inference, and loaded voice embeddings stay client-side afterward. The selected Supertonic voice is loaded before the UI reports ready; other voices load lazily or in the background. - WASM fallback works, but first-token latency is still borderline for natural conversation in this environment. With the default 2-step TTS setting, first Supertonic synthesis is about 0.8-1.5 seconds depending on chunk length and voice, while LLM first-token/decode still dominate the response delay. The intended low-latency experience should be verified on WebGPU. - The loopback benchmark validates the local VAD/STT path, but it is not a substitute for real microphone testing. It now feeds chunks at audio-time cadence, but it still uses synthesized speech and can show synthetic-speaker ASR quirks. - The fake microphone harness validates browser capture plumbing with Chrome's fake audio device, but it still uses synthetic Supertonic speech and cannot prove behavior with human speech, room acoustics, microphone gain, or echo cancellation. - Microphone capture is routed through a muted Web Audio gain node so the capture worklet stays active without monitoring the user's mic through the speakers.