Series: AgentZero Lite — Part 9. The previous post (Part 8 — When the Model Is Smarter Than Your Test Fixture: Building a TTS↔STT Round-Trip Suite) built a unit-test baseline that proved the model pair (Windows SAPI + Whisper.net) preserves Korean and English content end-to-end on file-based round-trips. This post is what happened when we put that pair on a live mic.
TL;DR
- Unit tests said "the model pair is fine." The live AskBot mic said otherwise. Six rounds of investigation closed the gap.
- The biggest single win was a buffer leak between utterances — captured PCM was 9–17 seconds for 2 seconds of speech, dragging Whisper into YouTube-creator-outro hallucinations on the silence padding.
- RMS-over-whole-clip is the wrong silence measure for streaming captures (silence padding dilutes it). Peak + voice-activity-ratio is right.
- The same user complaint kept recurring across multiple "fixes" because "bypass" was a bad name — my framing of the feature didn't match the user's mental model. Resolved by collapsing two modes into one.
- A WPF thread-affinity gotcha (
Application.Current.Windowsis dispatcher-owned) hid behind one of those mode-unification fixes.
- The test tool I built to drive the pipeline ended up being more useful as a debugger than as a tester. Every fix was discovered through a single new log line in it.
Commit range:
b1a0ec8 → 69eac4f. Net delta: ~700 LoC added, ~200 deleted.Why the unit test wasn't enough
Part 8's xUnit suite established a deterministic baseline: same TTS + STT models, file in / text out, fold case + punctuation + whitespace, 4 / 5 cases pass with the fifth failing on a Whisper auto-correction (
모래 → 모레). That ruled out the model pair as the problem.The live AskBot mic added a stack of variables the unit test couldn't see:
- VAD threshold and how it's set (sensitivity slider, max'd by users who want to catch quiet speech).
- Pre-roll ring (the second of audio captured before the VAD trigger, to preserve the initial consonant).
- Utterance hangover (40 frames ≈ 2 s of trailing silence before declaring the utterance complete).
- Buffer accumulation between utterances.
- Windows mic enhancements (echo cancellation, noise suppression).
- Real-time HTTP latency to OpenAI.
- Thread affinity inside the WPF dispatcher.
None of those exist in
dotnet test. They needed a way to be exercised inside the running app.The answer was a virtual voice test tool — a popup window inside AskBot that lets you type a phrase, synthesise it via TTS, and feed it through the same code path the live mic would use. Quick-phrase buttons for the same fixtures the unit test uses. A history list with replay buttons so you can A/B prior runs. Per-row debug WAV saved for offline listening.
The plan was "tester." What it actually became was "debugger" — a tool with enough diagnostic plumbing that every recurrence of "AskBot didn't react" pointed at exactly one log line.
Round 1 — Instrument the streaming pipe before guessing
The first thing we needed was visibility into where the latency was going. The legacy mic→STT path emitted exactly one log line per turn:
Transcript sent (N chars). Useless for "why does this take 10 seconds?"Five checkpoints went in:
[t0] utterance-start [t1] utterance-end | t1-t0={Nms} · pcm={N bytes} (~Ns audio) [t2] pipeline-start | enqueue-lag={Nms} · provider=X · lang=Y · pcm=~Ns [stage] STT prep | {Nms} · ready=True [stage] STT transcribe | {Nms} · {N.NN}x realtime · chars={N} [t3] DONE | end-to-end {Nms} (utt-end → terminal) · enqueue · prep · transcribe · dispatch
The headline number —
end-to-end Nms (utt-end → terminal) — is what the user perceives. Subtract prep + transcribe + dispatch from it and what's left is structural latency baked into the system (the 2-second VAD hangover, mostly).This was the single most valuable change in the whole sequence. Once these checkpoints were in the log, every subsequent bug pointed at exactly one of them.
Round 2 — The silent killer: buffer leak between utterances
The first thing the new logs showed was indefensible:
[t1] utterance-end | t1-t0=2049ms · pcm=544000 bytes (~17.00s audio) [t1] utterance-end | t1-t0=1999ms · pcm=312000 bytes (~9.75s audio) [t1] utterance-end | t1-t0=2100ms · pcm=403200 bytes (~12.60s audio)
t1-t0 is the duration of the utterance from start to end. PCM should be roughly pre-roll (1 s) + utterance (~2 s) + hangover (already counted in t1-t0) — so 3-ish seconds. It was 9 to 17.The bug, once we looked, was straight in front of us:
// VoiceCaptureService.OnDataAvailable — every audio frame, every time: if (BufferPcm) { lock (_pcmBuffer) { _pcmBuffer.AddRange(chunk); // <- always appends. Always. } }
BufferPcm was set true when the mic turned on. After that, every frame went into the buffer indefinitely. ConsumePcmBuffer cleared it on UtteranceEnded, but between utterances the buffer accumulated everything else: ambient room noise, fan hum, typing, keyboard clicks. By the time the second utterance fired, you'd built up a 5-15 s silence pile, then prepended pre-roll, then added the actual speech, then cleared.The fix moved buffer accumulation into the VAD state transitions, where the only frames we care about are the ones inside an utterance:
if (above) { if (!_inUtterance) { // Utterance starts — SeedBufferWithPreRoll on the consumer // side clears the buffer and adds pre-roll. _inUtterance = true; UtteranceStarted?.Invoke(); } else { // Continuing in utterance, above threshold. if (BufferPcm) lock (_pcmBuffer) { _pcmBuffer.AddRange(chunk); } } } else if (_inUtterance) { // Trailing silence — still inside utterance window per hangover. if (BufferPcm) lock (_pcmBuffer) { _pcmBuffer.AddRange(chunk); } _utteranceSilenceFrames++; if (_utteranceSilenceFrames >= UtteranceHangoverFrames) { _inUtterance = false; UtteranceEnded?.Invoke(); } } // else: outside utterance, below threshold. Drop.
The next test logs were:
[t1] utterance-end | t1-t0=2050ms · pcm=97600 bytes (~3.05s audio) [t1] utterance-end | t1-t0=2000ms · pcm=96000 bytes (~3.00s audio) [t1] utterance-end | t1-t0=2399ms · pcm=96000 bytes (~3.00s audio)
PCM length tracks utterance duration. The first big win.
Round 3 — Whisper hallucinated YouTube outros even on good audio
Buffer leak fixed, PCM length right. Recognition output, however:
text="시청해 주셔서 감사합니다." text="시청해주셔서 감사합니다! 구독 좋아요 댓글 부탁드려요!" text="오늘도 시청해 주셔서 감사합니다!"
These are the Korean-YouTube-creator-outro family. Whisper's training distribution is heavy on YouTube transcripts; on near-silence input it confidently emits these phrases. The audio length was right; the audio content was wrong.
The first attempt was an RMS-based energy gate. Drop turns where
rms < -40 dBFS. That seemed reasonable until the user reported: "the audio is clearly audible, why is it being dropped?" The next batch of logs:user feedback | peak | rms | gap |
--- | ---: | ---: | ---: |
audible speech | -19.6 dBFS | -52.3 dBFS | 32.7 dB |
audible speech | -27.3 dBFS | -56.5 dBFS | 29.2 dB |
audible speech | -27.4 dBFS | -55.5 dBFS | 28.1 dB |
audible speech | -29.5 dBFS | -56.6 dBFS | 27.1 dB |
Peak-to-RMS gap of 27–33 dB. For comparison the bypass-test clip (which is mostly speech) had a 16 dB gap. The streaming clip is structurally "0.5 s of speech burst inside a 3 s envelope" — RMS averages all of that and reports near-silence even when the burst itself is unmistakable.
Replacement: a two-tier gate.
const double MinSpeechPeakDbfs = -38.0; // any audible moment in the clip const double MinFrameLoudDbfs = -45.0; // per-frame "active" threshold const double MinVoiceActivityRatio = 0.10; // ≥ 10% of 50 ms frames active if (peakDb < MinSpeechPeakDbfs) return drop; // pure silence if (var50ms < MinVoiceActivityRatio) return drop; // single click
Voice-activity-ratio is the ratio of 50 ms frames whose peak crosses the active threshold. For 0.5 s of sustained speech in a 3 s clip you get 10 active frames out of 60 → 17% → pass. For a single keyboard click in 3 s of silence you get 1 active frame out of 60 → 1.7% → reject. Same peak, different sustainment. Right separator.
Logs after this change cleanly distinguished the two:
[t2] pipeline-start | ... · peak=-27.3dBFS · rms=-52.3dBFS · VAR=16.7% [stage] STT transcribe | ... · text="안녕하세요" [t3] DONE | ... [t2] pipeline-start | ... · peak=-22.0dBFS · rms=-58.0dBFS · VAR=1.7% dropped — VAR gate (active-frame ratio=1.7% < 10% · likely brief click/tap, not sustained speech)
The hallucination outros stopped appearing. Real speech still got through.
Round 4 — Mute and volume as user-controlled prerequisites
For the user to run a clean test, they needed mic mute and a system volume slider directly on the AskBot toolbar — not buried in Sound Settings. Two persisted fields in
VoiceSettings, two new toolbar controls:MicMuted: bool— soft mute. Capture stays alive (level meter still moves) but VAD/STT pipeline ignores frames. Persisted; applied on every mic-on.
MicVolumePercent: int(-1 = unset). Slider 0–100 controlling the Windows default capture endpoint's master volume via WASAPI (MMDeviceEnumerator+AudioEndpointVolume.MasterVolumeLevelScalar). System-wide effect — same as moving the slider in Sound Settings.
A small but useful default:
-1 means "leave Windows alone" so a first-launch user sees the slider already at whatever the OS is set to, instead of forcing a value.Round 5 — Mode unification: "bypass" was a bad name
Here's where the chronology gets interesting. The test tool originally had two modes:
- Bypass (default): synthesise → STT → show in popup. Don't drive AskBot.
- Acoustic loop (off by default): synthesise → speaker → mic → STT → AskBot terminal.
The acoustic loop was the original use case ("test as if user spoke to AskBot"), and it didn't work because Windows mic enhancements zero out the speaker output from the mic input — AskBot only ever heard ambient noise.
Multiple fixes tried to make this clearer:
- Added auto-mute checkbox so AskBot wouldn't pick up speaker echo during bypass tests.
- Rebuilt acoustic loop to do direct STT (no mic round-trip) but still play through the speaker for human verification.
- Updated the description text to explain the new semantics.
The user's complaint kept coming back: "AskBot 무반응" — AskBot doesn't react.
The diagnostic that finally surfaced the real issue wasn't a log line. It was rereading the original feature request from weeks earlier:
가상음성은 ... 음성재생을 일으켜 ... AskBot이 음성입력모드일때 자연스럽게 감지되어 반응함
Translation: "the virtual voice causes audio playback ... when AskBot is in voice input mode, it naturally detects and reacts."
The original ask was always "drive AskBot." My "bypass = popup-only" framing was a different feature than what the user wanted. Each iteration that tried to clarify the mode boundary just confirmed that we were debugging two different products.
The fix wasn't another bug fix. It was a naming change:
- Collapse the two modes. There is now ONE turn shape: synth → STT → inject into AskBot.
- The single remaining checkbox controls only "play through speaker for verification" — defaults to off. The transcript injection happens regardless.
- Mode line below the active-models row reads either "silent (no speaker playback)" or "ALSO speaker playback (verification)". No more "bypass."
The pattern matters more than the specific case: when the same user complaint recurs across multiple "fixes," the disagreement is upstream of the code. It's about what the feature is for. The fix lives in the requirements, not the implementation.
Round 6 — A WPF thread-affinity bug hiding inside the unification
The user re-ran the test after the mode unification:
"자원 점유 에러나서 안되는것같음, 테스트용은 생성 하나더 해야할것같음 (동시사용 성문제로보임)"
Translation: "looks like a resource conflict error — might need to create a separate test instance (looks like a concurrency issue)."
The user's diagnosis was wrong about the mechanism — there was no resource contention. But the symptom was real, and the log carried the actual cause:
System.InvalidOperationException: 다른 스레드가 이 개체를 소유하고 있어 at System.Windows.Application.get_Windows() at TestToolsWindow.FindAskBot() — line 182 at <RunTurnAsync>b__0.MoveNext() — inside Task.Run
Application.Current.Windows is a dispatcher-owned WPF collection. It's only safe to enumerate from the UI thread. The mode-unification commit had moved the AskBot-window lookup into a Task.Run worker by accident.The fix was three lines:
// Before Task.Run, on the UI thread: var bot = FindAskBot(); await Task.Run(async () => { // ... synth, decode, STT ... bot?.SendVoiceTranscript(transcript); // safe — dispatcher-marshals internally });
SendVoiceTranscript already had a Dispatcher.CheckAccess() marshal inside it. The dispatcher-bound part was only the enumeration via Application.Windows. Capture the reference once on the UI thread, close over it.This is the kind of bug where the user's debugging instinct ("must be concurrency") and the actual cause ("must be UI thread affinity") share a vocabulary. They are different bugs with similar surface stories. The log's stack trace settled it in seconds; without that, we'd have spent hours scaffolding "test isolation."
What the test tool actually taught
Six rounds of debugging through one tool. Distilling:
- Per-stage timing log was the primary instrument. It surfaced the buffer leak, the energy-gate-vs-RMS mismatch, the thread-affinity throw, and the structural 2-second VAD hangover. Every fix was triggered by one specific log line.
- Acoustic loops are unreliable on Windows. The native echo-cancellation + noise-suppression filters can zero out the speaker output reaching the mic. For automated voice testing, don't depend on the round trip — feed the synthesised audio directly to STT and play through the speaker only as human verification.
- RMS-over-whole-clip is the wrong silence measure when the audio is structurally "speech burst inside a silence envelope." Peak + voice-activity-ratio is right.
- Naming is requirements, not implementation. When the same complaint keeps surfacing across multiple "fixes," the disagreement is what the feature is for, not how it's coded. Renaming and collapsing modes can resolve more bugs than another patch.
- Capture WPF dispatcher-owned objects on the UI thread before Task.Run.
Application.Current.Windowsand a few other globals throw on background-thread access. Most member calls then marshal back viaDispatcher.BeginInvokeautomatically, so closing over the reference is enough.
What's still on the table
- Whisper Small vs Medium accuracy comparison. The unit test uses Medium for quality measurement; the running app uses Small for speed. A
[Theory]row that runs both on the same fixtures would quantify the trade-off on the actual deployed phrases.
- GPU acceleration. Vulkan / CUDA Whisper.net runtimes are wired but not on. The unit test's per-stage timings give a clean baseline to compare against once they are.
- Reactor token streaming. The original Akka.Streams Step 2 promise (Part 7) was per-token streaming all the way to the speaker. Currently AskBot waits for the full reactor result before TTS. The plumbing exists; the wiring is one more cycle of work.
- VAD hangover tuning. Currently 40 frames (2 s). At 20 frames (1 s) you'd shave a second off the perceived latency at the cost of mid-utterance pauses occasionally splitting into two segments. Worth measuring.
Closing
What started as "build a virtual voice tester" ended up as a six-round debugger. The model pair worked in the unit test. It didn't work live until the streaming pipeline around it was rebuilt to not leak between-utterance audio, not gate on the wrong silence measure, not depend on a hardware acoustic loop, not split the same feature across two confusing modes, and not enumerate WPF globals from a background thread.
The diagnostic instrumentation stays. Next regression in this area will be visible in a single log line.
Cloud Whisper swap, GPU runtime, and the reactor token stream are the next three numbers worth measuring.
- psmon
TECH LINKS
- 𝕏 @webnori