Spaces:

Mike0021
/

browser-speak

Configuration error

App Files Files Community

Mike0021 commited on 18 days ago

Commit

2ef02c9

verified ·

1 Parent(s): d2ae80e

Tolerate browse demo ASR identity near-miss

Browse files

Files changed (2) hide show

README.md +1 -1
app.js +1 -0

README.md CHANGED Viewed

@@ -178,7 +178,7 @@ Measured on this workspace with Chrome 145 headless, x86_64, 4 vCPU, Transformer
 | Voice loopback | Moonshine Tiny fp32/q4 | SmolLM2 135M q4 WASM | 2 | 1.08 s | 25% | pass 4/4 | 480 ms | 3.72 s | 4.84 s | 951 ms | 5.79 s | 6.88 s | 1.4 tok/s |
 | Voice loopback | Whisper Tiny English fp32/q4 | SmolLM2 135M q4 WASM | 2 | 5.62 s | 0% | pass 4/4 | 480 ms | 4.63 s | 6.63 s | 902 ms | 7.53 s | 13.16 s | 1.3 tok/s |
-The first-clause TTS threshold reduced the earlier SmolLM2 360M text benchmark's first-audio time from 45.84 s to 12.25 s in headless WASM by moving TTS queueing from 44.43 s after transcript to 11.01 s. Switching the default LLM to SmolLM2 135M, splitting generic and identity prompts, and using a word-boundary-safe first chunk reduced normal chat first audio to 4.34 s in the earlier default suite. The first response chunk now targets roughly 5 characters, keeps a minimum safe space boundary, and searches forward to the next word boundary before falling back to a hard character cut; this fixed the rejected early failure that produced "I'm read", keeps the current chat first chunk as "I'm ready", and lets identity turns start on the safe first-word chunk "This". Identity turns use a stricter system prompt plus tiny pinned examples, and repeated identity prompts now ignore prior identity history, keeping the fake-mic prompt at 187 input tokens across all three scripted rows instead of growing turn by turn. The identity-intent detector also tolerates narrow ASR near-misses such as "browser dome" for "browser demo" and "identifies" for "identify"; the original transcript and WER are still preserved. The latest fake-mic series measured 4.98 s median transcript-to-first-token and 7.21 s median speech-end-to-first-audio. A current-code 360M text run improved the identity answer from 2/4 to 3/4 concepts, but first audio was still 14.75 s and the answer omitted the LLM component, so 135M remains the default latency/quality choice. Qwen3 0.6B terminated the local headless browser during WASM model load before it reached benchmark-ready state, so the app now blocks Qwen3 and SmolLM2 1.7B on WASM fallback and requires WebGPU for those candidates. The one-load TTS sweep confirms 2-step Supertonic is the right default for latency; 8 steps roughly doubled first-audio time, and M2 at 2 steps was the fastest isolated voice/step setting at 2.00 s versus F2 at 2.10 s. The app still defaults to F2 because the latest F2 fake-mic series reached 7.21 s median speech-end-to-audio, while M2's repeated fake-mic validation reached 0% WER but slower 11.64 s speech-end-to-audio and 23.42 s speech-end-to-audio-done; M2 also measured slower in loopback, so real-mic and WebGPU validation are still needed before reconsidering the default. The TTS worker now loads the selected voice before reporting ready, then preloads the remaining voices in the background; the latest hosted client-side smoke measured a 1.65 s TTS first-audio row with all benchmark-phase network requests already settled. The current default benchmark suite was verified with actual workers and produced TTS, barge-in, identity, chat, and loopback rows before re-enabling controls. The barge-in check verified that a synthetic speech-start event cancels in-flight TTS before stale audio plays and returns the TTS tile to Ready. After repeated loopback checks, the synthetic loopback stability gate now uses the prompt "Identify this browser demo.", 1.00x prompt speed, and a short silence preroll; the latest targeted run completed 3/3 rows, reached 3/3 exact transcripts with 0% median WER, passed the identity answer gate on all 3 rows, and reached 7.57 s median speech-end-to-audio. The previous "Please identify this browser demo." prompt passed the identity gate but intermittently dropped the low-value opening word, pushing WER to 20%; the shorter "What app is this?", "What application is this?", "What is this app?", and "What demo is this?" synthetic prompts also intermittently dropped opening words or confused "application" with "cation/location", so they remain stress options rather than the default loopback gate. A partial-ASR-off loopback comparison was slightly faster at 5.94 s median speech-end-to-audio, but it dropped to 2/3 exact transcripts with "Wap is this.", so partial previews remain enabled by default. The loopback gate is still treated as a synthetic stability signal, not a substitute for real microphone validation. Normal response TTS remains at 1.08x. The fake microphone scripted series reached 3/3 completed rows with 0% median WER, 1.66 s median ASR, 7.21 s median speech-end-to-audio, and 19.30 s median speech-end-to-audio-done through the browser capture path; because Chrome loops fake WAV input, the harness ignores non-matching partial prompt captures until the reference phrase is heard, then stops the mic so the response can finish. A stale-ASR guard now ignores queued speech events after scripted fake-mic capture stops; the M2 regression run completed 3/3 rows after previously timing out from those stale interruptions. The live input-level meter uses the same microphone worklet chunks as VAD/STT, so it helps distinguish permission/device silence from VAD or transcription delay during real-mic testing. The loopback benchmark now feeds synthesized audio at real-time chunk intervals, uses a 480 ms VAD close delay after a 280 ms default split the prompt into separate words, and posts an explicit ASR flush after trailing silence so sticky candidates finish. Moonshine Base remains the default STT because it is much faster than Whisper Tiny English while less error-prone than Tiny: Whisper Tiny English also reached 0% WER in fake mic, but raised median ASR to 4.54 s and speech-end-to-audio to 10.94 s, while Moonshine Tiny failed the exact fake-mic gate before completing one row. The scripted mic benchmark still uses "What app is this?" as its reference phrase so real-mic rows can report WER/CER against the short spoken phrase, and the 3-run mic series keeps the same microphone open while collecting the repeated rows. The latest full-stack model loads completed in about 15-35 seconds cold in this environment, with a same-profile fake-mic reload at about 10.5 seconds. A real WebGPU browser should be benchmarked next; this headless environment exposes only the SwiftShader software adapter.
 ## Known Limitations

 | Voice loopback | Moonshine Tiny fp32/q4 | SmolLM2 135M q4 WASM | 2 | 1.08 s | 25% | pass 4/4 | 480 ms | 3.72 s | 4.84 s | 951 ms | 5.79 s | 6.88 s | 1.4 tok/s |
 | Voice loopback | Whisper Tiny English fp32/q4 | SmolLM2 135M q4 WASM | 2 | 5.62 s | 0% | pass 4/4 | 480 ms | 4.63 s | 6.63 s | 902 ms | 7.53 s | 13.16 s | 1.3 tok/s |
+The first-clause TTS threshold reduced the earlier SmolLM2 360M text benchmark's first-audio time from 45.84 s to 12.25 s in headless WASM by moving TTS queueing from 44.43 s after transcript to 11.01 s. Switching the default LLM to SmolLM2 135M, splitting generic and identity prompts, and using a word-boundary-safe first chunk reduced normal chat first audio to 4.34 s in the earlier default suite. The first response chunk now targets roughly 5 characters, keeps a minimum safe space boundary, and searches forward to the next word boundary before falling back to a hard character cut; this fixed the rejected early failure that produced "I'm read", keeps the current chat first chunk as "I'm ready", and lets identity turns start on the safe first-word chunk "This". Identity turns use a stricter system prompt plus tiny pinned examples, and repeated identity prompts now ignore prior identity history, keeping the fake-mic prompt at 187 input tokens across all three scripted rows instead of growing turn by turn. The identity-intent detector also tolerates narrow ASR near-misses such as "browse demo" or "browser dome" for "browser demo" and "identifies" for "identify"; the original transcript and WER are still preserved. The latest fake-mic series measured 4.98 s median transcript-to-first-token and 7.21 s median speech-end-to-first-audio. A current-code 360M text run improved the identity answer from 2/4 to 3/4 concepts, but first audio was still 14.75 s and the answer omitted the LLM component, so 135M remains the default latency/quality choice. Qwen3 0.6B terminated the local headless browser during WASM model load before it reached benchmark-ready state, so the app now blocks Qwen3 and SmolLM2 1.7B on WASM fallback and requires WebGPU for those candidates. The one-load TTS sweep confirms 2-step Supertonic is the right default for latency; 8 steps roughly doubled first-audio time, and M2 at 2 steps was the fastest isolated voice/step setting at 2.00 s versus F2 at 2.10 s. The app still defaults to F2 because the latest F2 fake-mic series reached 7.21 s median speech-end-to-audio, while M2's repeated fake-mic validation reached 0% WER but slower 11.64 s speech-end-to-audio and 23.42 s speech-end-to-audio-done; M2 also measured slower in loopback, so real-mic and WebGPU validation are still needed before reconsidering the default. The TTS worker now loads the selected voice before reporting ready, then preloads the remaining voices in the background; the latest hosted client-side smoke measured a 1.65 s TTS first-audio row with all benchmark-phase network requests already settled. The current default benchmark suite was verified with actual workers and produced TTS, barge-in, identity, chat, and loopback rows before re-enabling controls. The barge-in check verified that a synthetic speech-start event cancels in-flight TTS before stale audio plays and returns the TTS tile to Ready. After repeated loopback checks, the synthetic loopback stability gate now uses the prompt "Identify this browser demo.", 1.00x prompt speed, and a short silence preroll; the latest targeted run completed 3/3 rows, reached 3/3 exact transcripts with 0% median WER, passed the identity answer gate on all 3 rows, and reached 7.57 s median speech-end-to-audio. The previous "Please identify this browser demo." prompt passed the identity gate but intermittently dropped the low-value opening word, pushing WER to 20%; the shorter "What app is this?", "What application is this?", "What is this app?", and "What demo is this?" synthetic prompts also intermittently dropped opening words or confused "application" with "cation/location", so they remain stress options rather than the default loopback gate. A partial-ASR-off loopback comparison was slightly faster at 5.94 s median speech-end-to-audio, but it dropped to 2/3 exact transcripts with "Wap is this.", so partial previews remain enabled by default. The loopback gate is still treated as a synthetic stability signal, not a substitute for real microphone testing. Normal response TTS remains at 1.08x. The fake microphone scripted series reached 3/3 completed rows with 0% median WER, 1.66 s median ASR, 7.21 s median speech-end-to-audio, and 19.30 s median speech-end-to-audio-done through the browser capture path; because Chrome loops fake WAV input, the harness ignores non-matching partial prompt captures until the reference phrase is heard, then stops the mic so the response can finish. A stale-ASR guard now ignores queued speech events after scripted fake-mic capture stops; the M2 regression run completed 3/3 rows after previously timing out from those stale interruptions. The live input-level meter uses the same microphone worklet chunks as VAD/STT, so it helps distinguish permission/device silence from VAD or transcription delay during real-mic testing. The loopback benchmark now feeds synthesized audio at real-time chunk intervals, uses a 480 ms VAD close delay after a 280 ms default split the prompt into separate words, and posts an explicit ASR flush after trailing silence so sticky candidates finish. Moonshine Base remains the default STT because it is much faster than Whisper Tiny English while less error-prone than Tiny: Whisper Tiny English also reached 0% WER in fake mic, but raised median ASR to 4.54 s and speech-end-to-audio to 10.94 s, while Moonshine Tiny failed the exact fake-mic gate before completing one row. The scripted mic benchmark still uses "What app is this?" as its reference phrase so real-mic rows can report WER/CER against the short spoken phrase, and the 3-run mic series keeps the same microphone open while collecting the repeated rows. The latest full-stack model loads completed in about 15-35 seconds cold in this environment, with a same-profile fake-mic reload at about 10.5 seconds. A real WebGPU browser should be benchmarked next; this headless environment exposes only the SwiftShader software adapter.
 ## Known Limitations

app.js CHANGED Viewed

@@ -1612,6 +1612,7 @@ function normalizeForQuality(text) {
 function normalizeIdentityIntent(text) {
   return normalizeForQuality(text)
     .replace(/\bdome\b/g, "demo")
     .replace(/\bdemos\b/g, "demo")
     .replace(/\bidentifies\b/g, "identify");

 function normalizeIdentityIntent(text) {
   return normalizeForQuality(text)
+    .replace(/\bbrowse\b/g, "browser")
     .replace(/\bdome\b/g, "demo")
     .replace(/\bdemos\b/g, "demo")
     .replace(/\bidentifies\b/g, "identify");