🔬

When the Model Is Smarter Than Your Test Fixture — Building a TTS↔STT Round-Trip Suite for AgentZero Lite

Series: AgentZero Lite — Part 8. The previous post (Part 7 — Step 2 Landed: Akka.NET Streams Now Carry the Whole Voice Path) put a streaming substrate under the voice subsystem. This one puts a measurement instrument on top — so we know whether it's getting better or worse between commits.

TL;DR

  • 5-case xUnit suite that runs TTS (Windows SAPI default voice) → WAV → resample → STT (Whisper.net medium, CPU) and compares the transcript to the input.
  • Pure-model, file-based, no Akka, no streams. Deterministic baseline; the streaming variant becomes measurable as a delta against it.
  • Comparison folds case + punctuation + whitespace before equality so tokeniser cosmetics ("hello" → "Hello.") aren't reported as content drift.
  • 4 / 5 pass. The one fail is the interesting case — Whisper auto-corrected the user's input typo 모래 (sand) to the contextually correct 모레 (day after tomorrow). Edit distance 1, similarity 96.8 %. The test reports the drift honestly; the post argues it should.
  • Six lessons worth writing down so the next person doesn't re-derive them. The middle four are technique; the bookends are philosophy.
Implementation: Project/AgentTest/Voice/TtsSttRoundTripTests.cs, commit b1a0ec8.

Why we needed this now

The Part 7 → Part 8 gap is a classic refactor-then-instrument move.
By the end of Part 7 the voice pipeline had a streaming substrate, a barge-in detector, a sentence chunker, and an opt-in feature flag. What it didn't have was a way to answer the question "is the new version better?" — every quality check was a human listening to a speaker and squinting at a transcript.
For a single change that's fine. For a series of small tuning passes (slow the SAPI rate by 15 %, pad the WAV with 1 s of room noise, switch the resampler, swap the model) it's the wrong loop. We were debating whether 안녕하세요 came out as 안녕하세요 or as Thank you for watching, please subscribe and like based on memory. Time to measure.
Constraints I set going in:
  • No mic, no speaker. The acoustic path has its own problems (Windows mic enhancements, room noise, hardware variance). Validating the raw model pair means a file-based round-trip — synthesise to WAV, hand the WAV to STT.
  • No actors, no streams. Those are operational concerns. The question here is "what do TTS and STT do to a string when neither has any other moving parts in the way?"
  • Real providers. Mocking either end defeats the point — we want to know the actual hallucination rate of the actual deployed pipeline, not the rate of a mock.
That gave the topology that landed:
flowchart LR A["input string"] --> B["WindowsTts.SynthesizeAsync"] B --> C["WAV bytes"] C --> D["File.WriteAllBytes<br/>*-tts.wav"] C --> E["WavToPcm.To16kMono<br/>(NAudio resample)"] E --> F["PCM 16k mono"] F --> G["File.WriteAllBytes<br/>*-stt-input-16k.wav"] F --> H["WhisperLocalStt.TranscribeAsync<br/>(medium, CPU)"] H --> I["transcript string"] A --> J["Normalize<br/>(case + punct + WS)"] I --> J J --> K{"== ?"} K --> L["VERDICT + diff + timings"] classDef io fill:#1f2937,stroke:#0ea5e9,color:#e5e7eb classDef pure fill:#0b3d2e,stroke:#10b981,color:#d1fae5 class B,H,E pure class D,G io
Five cases — same fixtures the production UI uses for its quick-phrase buttons:
[Fact] public Task Korean_short_안녕하세요() => RunRoundTrip("안녕하세요", language: "ko", caseId: "ko-short"); [Fact] public Task Korean_question_today_weather() => RunRoundTrip("오늘의 날씨는 어때?", language: "ko", caseId: "ko-question"); [Fact] public Task Korean_long_multi_clause() => RunRoundTrip( "내일의 날씨는 말고 모래의날씨는 흐리고 그리고 주간내내 비올예정입니다.", language: "ko", caseId: "ko-long"); [Fact] public Task English_short_hello() => RunRoundTrip("hello", language: "en", caseId: "en-short"); [Fact] public Task English_question_how_are_you() => RunRoundTrip("how are you?", language: "en", caseId: "en-question");
Mirroring the production UI's fixtures verbatim is a deliberate choice — including the typo in the long one, which gets its own section below.

Lesson 1 — Comparison: == is too strict, Contains is too loose

Whisper outputs proper-case + punctuation regardless of input shape. That's the tokeniser doing its job, not the recogniser failing.
input
Whisper output
Levenshtein
content drift?
---
---
---:
---
hello
Hello.
2
no — tokeniser cosmetics
how are you?
How are you?
1
no — capital H only
안녕하세요
안녕하세요
0
no
Naive == flunks the first two even though nothing was misrecognised. A loose threshold (similarity > 0.9 passes anything with mostly-overlapping characters) hides the cases where the model actually did substitute or drop a word.
The level that matches what we actually mean by "the model got it" is case-folded, punctuation-stripped, whitespace-stripped equality:
private static string Normalize(string s) { if (string.IsNullOrEmpty(s)) return string.Empty; var sb = new StringBuilder(s.Length); foreach (var c in s) { if (char.IsWhiteSpace(c)) continue; if (char.IsPunctuation(c)) continue; // covers .,?!;:'\"() and Unicode kin sb.Append(char.ToLowerInvariant(c)); // no-op for Korean } return sb.ToString(); }
Two small details that matter:
  • char.IsPunctuation instead of a hard-coded list. It picks up smart quotes, em-dashes, and full-width punctuation that Whisper happily emits for Korean text.
  • ToLowerInvariant is unconditional. Hangul has no case so applying it costs nothing, and it means the same Normalize works for both languages. Branchless code that's trivially right is worth more than language-aware cleverness.
Whitespace is stripped entirely rather than collapsed. For round-trip equivalence we don't care if the model adds a space between 오늘의 and 날씨 — different tokenisers split differently and there's no semantic change.
After normalising, content drift becomes legible. Which is exactly what we want before talking about lesson 2.

Lesson 2 — Whisper auto-corrects the fixture's typo. Test still fails. Good.

The long Korean fixture is "내일의 날씨는 말고 모래의날씨는 흐리고 그리고 주간내내 비올예정입니다."
모래 means sand. 모레 means day after tomorrow. They're a single Hangul vowel apart and the surrounding sentence is unambiguously about weather forecasting — so context tells any Korean reader that 모레 is what was meant. The user's quick-phrase fixture has 모래. They might have meant the typo. They might have meant the homophone. We can't tell.
Whisper transcribed it as 모레 — the contextually correct word. The model corrected the input.
The test result:
─── COMPARISON ──────────────────────────────────────────────────────── normalised input : "내일의날씨는말고모래의날씨는흐리고그리고주간내내비올예정입니다" normalised output : "내일의날씨는말고모레의날씨는흐리고그리고주간내내비올예정입니다" exact match : False edit distance : 1 similarity : 96.8% VERDICT : ✗ FAIL — see comparison
This case is going to come up over and over and the temptation is going to be:
"It's only one character. Bump the threshold to >= 0.95 PASS. Move on."
I think that's the wrong call. Three reasons:
  1. The 1-character drift is exactly the signal you wanted. The whole point of the suite is to know whether TTS+STT is preserving content. Whisper changed the content. That's the answer to the question. Hiding it because the change is small and arguably correct doesn't preserve the answer; it discards it.
  1. The next change won't be benign. Today it's a homophone correction nobody objects to. Tomorrow it's 송파구송파의 구 (Songpa's borough) because the model thinks the user meant a possessive — also a single substitution, also "smart", also breaks downstream code that's looking for an exact district name. A fuzzy threshold won't tell you which one is happening.
  1. The console dump tells the operator everything. normalised input and normalised output printed side-by-side — if the diff is 모래모레, the reviewer sees "ah, the model fixed my typo" and can move on. If it's 송파구송파의 구, the reviewer sees the actual problem. Both are 1-edit-distance failures; they're not the same kind of problem.
So the rule is: report the drift honestly, let the human read the diff. That's "the console dump IS the report" in concrete form — the test isn't a green/red gate, it's a measurement instrument that prints its readings.
This is the lesson I most wanted to write down. It's the one nobody believes until they're staring at their fourth voice pipeline failing the same way.

Lesson 3 — Dual output channel: ITestOutputHelper and Console.WriteLine

xUnit's canonical capture is ITestOutputHelper. It's right next to the pass/fail in test explorer. It's also captured by dotnet test regardless of verbosity.
What it doesn't do: show up in CI loggers like "console;verbosity=detailed" that target standard output. Those want Console.WriteLine.
Solution: write to both.
private void Log(string line) { try { _output.WriteLine(line); } catch { /* helper may be disposed during failed assert teardown */ } Console.WriteLine(line); }
The try/catch isn't paranoia. When Assert.Fail runs, the test helper's lifecycle can finalize before later cleanup tries to write. That's a secondary failure that masks the original. Catch and move on.
Two outputs is overkill for a happy-path local run. It pays for itself the first time the test fails in CI and you're trying to figure out which logger captured what.

Lesson 4 — Always save the audio as evidence

When the comparison fails, the next question is: "does the audio sound right?"
You can answer it instantly if both intermediate WAVs are on disk:
%TEMP%\agentzero-tts-stt-tests\ 20260429-180920-162-ko-long-tts.wav ← what SAPI produced 20260429-180920-162-ko-long-stt-input-16k.wav ← what Whisper actually heard
Listening in order is a decision tree:
TTS WAV sounds
STT-input WAV sounds
Conclusion
right
right
model hallucination — a real STT bug
right
wrong
resample regression — WavToPcm bug
wrong
(irrelevant)
TTS-side regression or wrong voice picked
Each branch points at a different fix. Without the artifacts you'd re-instrument the pipeline and re-run.
Cost is rounding-error disk space. Benefit is being able to diagnose any failure without leaving the file explorer.

Lesson 5 — Performance baseline. Record now, regress later.

Measured on the maintainer's dev machine (Whisper.net medium model, CPU only):
input length
TTS ms
decode ms
STT ms
xRT
total ms
---
---:
---:
---:
---:
---:
5 chars (Korean)
30
1
3000
3–5×
~3,000
11 chars (Korean)
33
1
8915
4.09×
8,949
42 chars (Korean)
66
5
9948
1.68×
10,019
5 chars (English)
25
1
6000
3–5×
~6,000
12 chars (English)
43
11
9478
6.30×
9,532
Three observations worth keeping in mind for the next discussion:
  • xRT improves on longer audio. Whisper's per-call overhead — model load, prompt setup, beam-search initialisation — dominates short clips. The 1.68× on the 5.93-second utterance is closer to the steady-state CPU performance of the medium model. Don't draw conclusions from the ~5× xRT on a half-second clip.
  • TTS is two orders of magnitude faster than STT. Synthesis is rarely the bottleneck. If the round-trip is slow, look at STT first.
  • First test in a run pays the model load (~9 s prep). Subsequent tests in the same suite hit cache (prep: 0 ms in the log). A 5-test suite costs ~50 seconds wall-clock end-to-end — fine for local iteration, fine for CI.
These numbers are the regression baseline. When we adopt:
  • a different STT (cloud Whisper, Webnori Gemma, on-device Gemma)
  • a different model size (tiny / small / large)
  • GPU acceleration (Vulkan / CUDA)
…the same suite produces directly comparable numbers in the same units. If a swap shows xRT going up by >1.5× with no quality gain, that's a regression worth investigating before merging. If the swap costs 1.2× xRT but lifts similarity from 96.8 % to 100 % on the long fixture, that's a different conversation.
The point isn't the absolute numbers; it's having them at all so the next change is measurable.

Lesson 6 — InternalsVisibleTo is for state, not for helpers

WavToPcm.To16kMono was internal. The test couldn't reach it without InternalsVisibleTo plumbing — which is reasonable when the test needs to peer at production state (private fields, internal exception types, the lock object), but heavyweight when all the test wants is to call the same transformation production calls.
The helper's encapsulation footprint is exactly zero. Flipping internal static class WavToPcm to public static class WavToPcm was the right move. No secret was protected by internal.
The general rule: InternalsVisibleTo is for state, not for helpers. If a static method is the canonical implementation of a transformation and tests will exercise it directly, just make it public. Save InternalsVisibleTo for the cases where the test legitimately needs to inspect production internals — at which point there's a real coupling to acknowledge.

What the console actually looks like

Every case prints something like this — the same data structure for every run, so you can diff two runs and see exactly what changed:
═══════════════════════════════════════════════════════════════════════ Round-trip case: ko-question ═══════════════════════════════════════════════════════════════════════ input : "오늘의 날씨는 어때?" language : ko input chars : 11 TTS provider : WindowsTTS (SAPI default voice) STT provider : WhisperLocal (model=medium) OS : Microsoft Windows NT 10.0.26200.0 .NET : 10.0.7 TTS voices available: Microsoft Heami Desktop, Microsoft Zira Desktop, ... [STAGE 1/3] TTS : 33 ms · 96150 bytes WAV saved : C:\...\agentzero-tts-stt-tests\20260429-180759-870-ko-question-tts.wav [STAGE 2/3] decode : 1 ms · 69736 pcm bytes (~2.18s audio) PCM level : peak=-5.2 dBFS · rms=-21.3 dBFS STT-input WAV saved: C:\...\agentzero-tts-stt-tests\20260429-180759-870-ko-question-stt-input-16k.wav STT prep : 0 ms · ready=True [STAGE 3/3] STT : 8915 ms · 4.09x realtime transcript : "오늘의 날씨는 어때?" transcript chars : 11 ─── COMPARISON ──────────────────────────────────────────────────────── normalised input : "오늘의날씨는어때" normalised output : "오늘의날씨는어때" exact match : True · edit distance: 0 · similarity: 100.0% TIMING TOTAL : 8949 ms (33 synth + 1 decode + 8915 STT) VERDICT : ✓ PASS ═══════════════════════════════════════════════════════════════════════
Every parameter, every byte count, every millisecond, every audio level, every saved artifact path. Fail or pass, the same report shape. Strip any of it at your peril — the next person reading the output is probably trying to figure out which of fifteen variables changed.

What this enables next

The streaming voice pipeline (Part 7) needs the same measurement to know whether it regresses. Now we have the baseline.
Concrete follow-ups that have a clear shape because the measurement instrument exists:
  • Cloud Whisper swap — same suite with OpenAiWhisperStt instead of WhisperLocalStt. We get a direct latency / quality / cost trade-off table.
  • Whisper small vs medium — currently the running app defaults to small for cost/speed; the test suite uses medium for quality measurement. Add a [Theory] row that runs both per fixture and we can see the small-vs-medium accuracy gap on the actual deployed fixtures, not on synthetic benchmarks.
  • Streaming-graph regression — once the Akka.Streams output graph chunks per sentence and runs progressive TTS, the same five fixtures in a streaming-flavor variant will catch any regression vs the file-based reference here.
  • GPU acceleration — the xRT column in the baseline becomes the "is this faster" check when we wire the Vulkan / CUDA Whisper.net runtimes.
Each of those changes used to feel like "we'll know if it works when we use it." Now each lands with a 5-row delta against the baseline numbers above.

Closing

The technique here is small. The philosophy is what made it worth writing about.
When the model under test is better than your fixture — Whisper correcting 모래 to 모레, an LLM giving a more fluent paraphrase, OCR normalising mojibake — the test has a choice: pass it via fuzzy threshold, or fail it and surface the diff. I think the second is right and I wanted that argument written down with the actual case attached, so the next person to face it has somewhere to look.
The next post is whatever the cloud Whisper swap teaches us once these numbers have a sibling.
  • psmon

NEXT : TECH LINKS